On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel ftrace infrastructure.
This allows to get fine grained scheduling events such as:
When zoomed out, the UI shows a quantized view of CPU usage, which collapses the scheduling information:
However, by zooming in, the individual scheduling events become visible:
Clicking on a CPU slice shows the relevant information in the details panel:
Scrolling down, when expanding individual processes, the scheduling events also create one track for each thread, which allows to follow the evolution of the state of individual threads:
At the SQL level, the scheduling data is exposed in the sched_slice table.
select ts, dur, cpu, end_state, priority, process.name, thread.name from sched_slice left join thread using(utid) left join process using(upid)
| ts | dur | cpu | end_state | priority | process.name, | thread.name |
|---|---|---|---|---|---|---|
| 261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd |
| 261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 |
| 261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 |
| 261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 |
| 261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 |
| 261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 |
| 261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
To collect this data, include the following data sources:
# Scheduling data from the kernel. data_sources: { config { name: "linux.ftrace" ftrace_config { compact_sched: { enabled: true } ftrace_events: "sched/sched_switch" # optional: precise thread lifetime tracking: ftrace_events: "sched/sched_process_exit" ftrace_events: "sched/sched_process_free" ftrace_events: "task/task_newtask" ftrace_events: "task/task_rename" } } } # Adds full process names and thread<>process relationships: data_sources: { config { name: "linux.process_stats" } }
By further enabling the following in the TraceConfig, the ftrace data source will record also scheduling wake up events:
ftrace_events: "sched/sched_wakeup_new" ftrace_events: "sched/sched_waking"
While sched_switch events are emitted only when a thread is in the R(unnable) state AND is running on a CPU run queue, sched_waking events are emitted when any event causes a thread state to change.
Consider the following example:
Thread A
condition_variable.wait()
Thread B
condition_variable.notify()
When Thread A suspends on the wait() it will enter the state S(sleeping) and get removed from the CPU run queue. When Thread B notifies the variable, the kernel will transition Thread A into the R(unnable) state. Thread A at that point is eligible to be put back on a run queue. However this might not happen for some time because, for instance:
Unless using real-time thread priorities, most Linux Kernel scheduler configurations are not strictly work-conserving. For instance the scheduler might prefer to wait some time in the hope that the thread running on the current CPU goes to idle, avoiding a cross-cpu migration which might be more costly both in terms of overhead and power.
NOTE: sched_waking and sched_wakeup provide nearly the same information. The difference lies in wakeup events across CPUs, which involve inter-processor interrupts. The former is emitted on the source (wakee) CPU, the latter on the destination (waked) CPU. sched_waking is usually sufficient for latency analysis, unless you are looking into breaking down latency due to inter-processor signaling.
When enabling sched_waking events, the following will appear in the UI when selecting a CPU slice:
end_stateThe sched_slice table contains information on scheduling activity of the system:
> select * from sched_slice limit 1 id type ts dur cpu utid end_state priority 0 sched_slice 70730062200 125364 0 1 S 130
Each row of the table shows when a given thread (utid) began running (ts), on which core it ran (cpu), for how long it ran (dur), and why it stopped running: end_state.
end_state is encoded as one or more ascii characters. The UI uses the following translations to convert end_state into human readable text:
| end_state | Translation |
|---|---|
| R | Runnable |
| R+ | Runnable (Preempted) |
| S | Sleeping |
| D | Uninterruptible Sleep |
| T | Stopped |
| t | Traced |
| X | Exit (Dead) |
| Z | Exit (Zombie) |
| x | Task Dead |
| I | Idle |
| K | Wake Kill |
| W | Waking |
| P | Parked |
| N | No Load |
Not all combinations of characters are meaningful.
If we do not know when the scheduling ended (for example because the trace ended while the thread was still running) end_state will be NULL and dur will be -1.