|  | # CPU Scheduling events | 
|  |  | 
|  | On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel | 
|  | [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) | 
|  | infrastructure. | 
|  |  | 
|  | This allows to get fine grained scheduling events such as: | 
|  |  | 
|  | * Which threads were scheduling on which CPU cores at any point in time, with | 
|  | nanosecond accuracy. | 
|  | * The reason why a running thread got descheduled (e.g. pre-emption, blocked on | 
|  | a mutex, blocking syscall or any other wait queue). | 
|  | * The point in time when a thread became eligible to be executed, even if it was | 
|  | not put immediately on any CPU run queue, together with the source thread that | 
|  | made it executable. | 
|  |  | 
|  | ## UI | 
|  |  | 
|  | When zoomed out, the UI shows a quantized view of CPU usage, which collapses the | 
|  | scheduling information: | 
|  |  | 
|  |  | 
|  |  | 
|  | However, by zooming in, the individual scheduling events become visible: | 
|  |  | 
|  |  | 
|  |  | 
|  | Clicking on a CPU slice shows the relevant information in the details panel: | 
|  |  | 
|  |  | 
|  |  | 
|  | Scrolling down, when expanding individual processes, the scheduling events also | 
|  | create one track for each thread, which allows to follow the evolution of the | 
|  | state of individual threads: | 
|  |  | 
|  |  | 
|  |  | 
|  |  | 
|  | ```protobuf | 
|  | data_sources { | 
|  | config { | 
|  | name: "linux.ftrace" | 
|  | ftrace_config { | 
|  | ftrace_events: "sched/sched_switch" | 
|  | ftrace_events: "sched/sched_waking" | 
|  | } | 
|  | } | 
|  | } | 
|  | ``` | 
|  |  | 
|  | ## SQL | 
|  |  | 
|  | At the SQL level, the scheduling data is exposed in the | 
|  | [`sched_slice`](/docs/analysis/sql-tables.autogen#sched_slice) table. | 
|  |  | 
|  | ```sql | 
|  | select ts, dur, cpu, end_state, priority, process.name, thread.name | 
|  | from sched_slice left join thread using(utid) left join process using(upid) | 
|  | ``` | 
|  |  | 
|  | ts | dur | cpu | end_state | priority | process.name, | thread.name | 
|  | ---|-----|-----|-----------|----------|---------------|------------ | 
|  | 261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd | 
|  | 261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
|  | 261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 | 
|  | 261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 | 
|  | 261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
|  | 261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 | 
|  | 261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 | 
|  | 261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
|  | 261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 | 
|  | 261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
|  |  | 
|  | ## TraceConfig | 
|  |  | 
|  | ```protobuf | 
|  | data_sources: { | 
|  | config { | 
|  | name: "linux.ftrace" | 
|  | ftrace_config { | 
|  | ftrace_events: "sched/sched_switch" | 
|  | ftrace_events: "sched/sched_process_exit" | 
|  | ftrace_events: "sched/sched_process_free" | 
|  | ftrace_events: "task/task_newtask" | 
|  | ftrace_events: "task/task_rename" | 
|  | } | 
|  | } | 
|  | } | 
|  |  | 
|  | # This is to get full process name and thread<>process relationships. | 
|  | data_sources: { | 
|  | config { | 
|  | name: "linux.process_stats" | 
|  | } | 
|  | } | 
|  | ``` | 
|  |  | 
|  | ## Scheduling wakeups and latency analysis | 
|  |  | 
|  | By further enabling the following in the TraceConfig, the ftrace data source | 
|  | will record also scheduling wake up events: | 
|  |  | 
|  | ```protobuf | 
|  | ftrace_events: "sched/sched_wakeup_new" | 
|  | ftrace_events: "sched/sched_waking" | 
|  | ``` | 
|  |  | 
|  | While `sched_switch` events are emitted only when a thread is in the | 
|  | `R(unnable)` state AND is running on a CPU run queue, `sched_waking` events are | 
|  | emitted when any event causes a thread state to change. | 
|  |  | 
|  | Consider the following example: | 
|  |  | 
|  | ``` | 
|  | Thread A | 
|  | condition_variable.wait() | 
|  | Thread B | 
|  | condition_variable.notify() | 
|  | ``` | 
|  |  | 
|  | When Thread A suspends on the wait() it will enter the state `S(sleeping)` and | 
|  | get removed from the CPU run queue. When Thread B notifies the variable, the | 
|  | kernel will transition Thread A into the `R(unnable)` state. Thread A at that | 
|  | point is eligible to be put back on a run queue. However this might not happen | 
|  | for some time because, for instance: | 
|  |  | 
|  | * All CPUs might be busy running some other thread, and Thread A needs to wait | 
|  | to get a run queue slot assigned (or the other threads have higher priority). | 
|  | * Some other CPUs other than the current one, but the scheduler load balancer | 
|  | might take some time to move the thread on another CPU. | 
|  |  | 
|  | Unless using real-time thread priorities, most Linux Kernel scheduler | 
|  | configurations are not strictly work-conserving. For instance the scheduler | 
|  | might prefer to wait some time in the hope that the thread running on the | 
|  | current CPU goes to idle, avoiding a cross-cpu migration which might be more | 
|  | costly both in terms of overhead and power. | 
|  |  | 
|  | NOTE: `sched_waking` and `sched_wakeup` provide nearly the same information. The | 
|  | difference lies in wakeup events across CPUs, which involve | 
|  | inter-processor interrupts. The former is emitted on the source (wakee) | 
|  | CPU, the latter on the destination (waked) CPU. `sched_waking` is usually | 
|  | sufficient for latency analysis, unless you are looking into breaking down | 
|  | latency due to inter-processor signaling. | 
|  |  | 
|  | When enabling `sched_waking` events, the following will appear in the UI when | 
|  | selecting a CPU slice: | 
|  |  | 
|  |  | 
|  |  | 
|  | ### Decoding `end_state` | 
|  |  | 
|  | The [sched_slice](/docs/analysis/sql-tables.autogen#sched_slice) table contains | 
|  | information on scheduling activity of the system: | 
|  |  | 
|  | ``` | 
|  | > select * from sched_slice limit 1 | 
|  | id  type        ts          dur    cpu utid end_state priority | 
|  | 0   sched_slice 70730062200 125364 0   1    S         130 | 
|  | ``` | 
|  |  | 
|  | Each row of the table shows when a given thread (`utid`) began running | 
|  | (`ts`), on which core it ran (`cpu`), for how long it ran (`dur`), | 
|  | and why it stopped running: `end_state`. | 
|  |  | 
|  | `end_state` is encoded as one or more ascii characters. The UI uses | 
|  | the following translations to convert `end_state` into human readable | 
|  | text: | 
|  |  | 
|  | | end_state  | Translation            | | 
|  | |------------|------------------------| | 
|  | | R          | Runnable               | | 
|  | | S          | Sleeping               | | 
|  | | D          | Uninterruptible Sleep | | 
|  | | T          | Stopped                | | 
|  | | t          | Traced                 | | 
|  | | X          | Exit (Dead)            | | 
|  | | Z          | Exit (Zombie)          | | 
|  | | x          | Task Dead              | | 
|  | | I          | Task Dead              | | 
|  | | K          | Wake Kill              | | 
|  | | W          | Waking                 | | 
|  | | P          | Parked                 | | 
|  | | N          | No Load                | | 
|  | | +          | (Preempted)            | | 
|  |  | 
|  | Not all combinations of characters are meaningful. | 
|  |  | 
|  | If we do not know when the scheduling ended (for example because the | 
|  | trace ended while the thread was still running) `end_state` will be | 
|  | `NULL` and `dur` will be -1. | 
|  |  |