On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel ftrace infrastructure.
This allows to get fine grained scheduling events such as:
When zoomed out, the UI shows a quantized view of CPU usage, which collapses the scheduling information:
However, by zooming in, the individual scheduling events become visible:
Clicking on a CPU slice shows the relevant information in the details panel:
Scrolling down, when expanding individual processes, the scheduling events also create one track for each thread, which allows to follow the evolution of the state of individual threads:
data_sources { config { name: "linux.ftrace" ftrace_config { ftrace_events: "sched/sched_switch" ftrace_events: "sched/sched_waking" } } }
At the SQL level, the scheduling data is exposed in the sched_slice table.
select ts, dur, cpu, end_state, priority, process.name, thread.name from sched_slice left join thread using(utid) left join process using(upid)
| ts | dur | cpu | end_state | priority | process.name, | thread.name | 
|---|---|---|---|---|---|---|
| 261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd | 
| 261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
| 261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 | 
| 261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 | 
| 261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
| 261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 | 
| 261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 | 
| 261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
| 261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 | 
| 261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 | 
data_sources: { config { name: "linux.ftrace" ftrace_config { ftrace_events: "sched/sched_switch" ftrace_events: "sched/sched_process_exit" ftrace_events: "sched/sched_process_free" ftrace_events: "task/task_newtask" ftrace_events: "task/task_rename" } } } # This is to get full process name and thread<>process relationships. data_sources: { config { name: "linux.process_stats" } }
By further enabling the following in the TraceConfig, the ftrace data source will record also scheduling wake up events:
ftrace_events: "sched/sched_wakeup_new" ftrace_events: "sched/sched_waking"
While sched_switch events are emitted only when a thread is in the R(unnable) state AND is running on a CPU run queue, sched_waking events are emitted when any event causes a thread state to change.
Consider the following example:
Thread A
condition_variable.wait()
                                     Thread B
                                     condition_variable.notify()
When Thread A suspends on the wait() it will enter the state S(sleeping) and get removed from the CPU run queue. When Thread B notifies the variable, the kernel will transition Thread A into the R(unnable) state. Thread A at that point is eligible to be put back on a run queue. However this might not happen for some time because, for instance:
Unless using real-time thread priorities, most Linux Kernel scheduler configurations are not strictly work-conserving. For instance the scheduler might prefer to wait some time in the hope that the thread running on the current CPU goes to idle, avoiding a cross-cpu migration which might be more costly both in terms of overhead and power.
NOTE: sched_waking and sched_wakeup provide nearly the same information. The difference lies in wakeup events across CPUs, which involve inter-processor interrupts. The former is emitted on the source (wakee) CPU, the latter on the destination (waked) CPU. sched_waking is usually sufficient for latency analysis, unless you are looking into breaking down latency due to inter-processor signaling.
When enabling sched_waking events, the following will appear in the UI when selecting a CPU slice: