On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel ftrace infrastructure.
This allows to get fine grained scheduling events such as:
The UI represents individual scheduling events as slices:
Clicking on a CPU slice shows the relevant information in the details panel:
Scrolling down, when expanding individual processes, the scheduling events also create one track for each thread, which allows to follow the evolution of the state of individual threads:
At the SQL level, the scheduling data is exposed in the sched_slice
table.
select ts, dur, cpu, end_state, priority, process.name, thread.name from sched_slice left join thread using(utid) left join process using(upid)
ts | dur | cpu | end_state | priority | process.name, | thread.name |
---|---|---|---|---|---|---|
261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd |
261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 |
261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 |
261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 |
261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 |
261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 |
261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 |
261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
To collect this data, include the following data sources:
# Scheduling data from the kernel. data_sources: { config { name: "linux.ftrace" ftrace_config { compact_sched: { enabled: true } ftrace_events: "sched/sched_switch" # optional: precise thread lifetime tracking: ftrace_events: "sched/sched_process_exit" ftrace_events: "sched/sched_process_free" ftrace_events: "task/task_newtask" ftrace_events: "task/task_rename" } } } # Adds full process names and thread<>process relationships: data_sources: { config { name: "linux.process_stats" } }
By further enabling the following in the TraceConfig, the ftrace data source will record also scheduling wake up events:
ftrace_events: "sched/sched_wakeup_new" ftrace_events: "sched/sched_waking"
While sched_switch
events are emitted only when a thread is in the R(unnable)
state AND is running on a CPU run queue, sched_waking
events are emitted when any event causes a thread state to change.
Consider the following example:
Thread A condition_variable.wait() Thread B condition_variable.notify()
When Thread A suspends on the wait() it will enter the state S(sleeping)
and get removed from the CPU run queue. When Thread B notifies the variable, the kernel will transition Thread A into the R(unnable)
state. Thread A at that point is eligible to be put back on a run queue. However this might not happen for some time because, for instance:
Unless using real-time thread priorities, most Linux Kernel scheduler configurations are not strictly work-conserving. For instance the scheduler might prefer to wait some time in the hope that the thread running on the current CPU goes to idle, avoiding a cross-cpu migration which might be more costly both in terms of overhead and power.
NOTE: sched_waking
and sched_wakeup
provide nearly the same information. The difference lies in wakeup events across CPUs, which involve inter-processor interrupts. The former is emitted on the source (wakee) CPU, the latter on the destination (waked) CPU. sched_waking
is usually sufficient for latency analysis, unless you are looking into breaking down latency due to inter-processor signaling.
When enabling sched_waking
events, the following will appear in the UI when selecting a CPU slice:
end_state
The sched_slice table contains information on scheduling activity of the system:
> select * from sched_slice limit 1 id type ts dur cpu utid end_state priority 0 sched_slice 70730062200 125364 0 1 S 130
Each row of the table shows when a given thread (utid
) began running (ts
), on which core it ran (cpu
), for how long it ran (dur
), and why it stopped running: end_state
.
end_state
is encoded as one or more ascii characters. The UI uses the following translations to convert end_state
into human readable text:
end_state | Translation |
---|---|
R | Runnable |
R+ | Runnable (Preempted) |
S | Sleeping |
D | Uninterruptible Sleep |
T | Stopped |
t | Traced |
X | Exit (Dead) |
Z | Exit (Zombie) |
x | Task Dead |
I | Idle |
K | Wake Kill |
W | Waking |
P | Parked |
N | No Load |
Not all combinations of characters are meaningful.
If we do not know when the scheduling ended (for example because the trace ended while the thread was still running) end_state
will be NULL
and dur
will be -1.