Primiano Tucci | a662485 | 2020-05-21 19:12:50 +0100 | [diff] [blame] | 1 | # CPU Scheduling events |
| 2 | |
| 3 | On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel |
| 4 | [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) |
| 5 | infrastructure. |
| 6 | |
| 7 | This allows to get fine grained scheduling events such as: |
| 8 | |
| 9 | * Which threads were scheduling on which CPU cores at any point in time, with |
| 10 | nanosecond accuracy. |
| 11 | * The reason why a running thread got descheduled (e.g. pre-emption, blocked on |
| 12 | a mutex, blocking syscall or any other wait queue). |
| 13 | * The point in time when a thread became eligible to be executed, even if it was |
| 14 | not put immediately on any CPU run queue, together with the source thread that |
| 15 | made it executable. |
| 16 | |
| 17 | ## UI |
| 18 | |
| 19 | When zoomed out, the UI shows a quantized view of CPU usage, which collapses the |
| 20 | scheduling information: |
| 21 | |
| 22 |  |
| 23 | |
| 24 | However, by zooming in, the individual scheduling events become visible: |
| 25 | |
| 26 |  |
| 27 | |
| 28 | Clicking on a CPU slice shows the relevant information in the details panel: |
| 29 | |
| 30 |  |
| 31 | |
| 32 | Scrolling down, when expanding individual processes, the scheduling events also |
| 33 | create one track for each thread, which allows to follow the evolution of the |
| 34 | state of individual threads: |
| 35 | |
| 36 |  |
| 37 | |
| 38 | |
| 39 | ```protobuf |
| 40 | data_sources { |
| 41 | config { |
| 42 | name: "linux.ftrace" |
| 43 | ftrace_config { |
| 44 | ftrace_events: "sched/sched_switch" |
| 45 | ftrace_events: "sched/sched_waking" |
| 46 | } |
| 47 | } |
| 48 | } |
| 49 | ``` |
| 50 | |
| 51 | ## SQL |
| 52 | |
| 53 | At the SQL level, the scheduling data is exposed in the |
| 54 | [`sched_slice`](/docs/analysis/sql-tables.autogen#sched_slice) table. |
| 55 | |
| 56 | ```sql |
| 57 | select ts, dur, cpu, end_state, priority, process.name, thread.name |
| 58 | from sched_slice left join thread using(utid) left join process using(upid) |
| 59 | ``` |
| 60 | |
| 61 | ts | dur | cpu | end_state | priority | process.name, | thread.name |
| 62 | ---|-----|-----|-----------|----------|---------------|------------ |
| 63 | 261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd |
| 64 | 261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 65 | 261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 |
| 66 | 261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 |
| 67 | 261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 68 | 261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 |
| 69 | 261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 |
| 70 | 261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 71 | 261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 |
| 72 | 261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 |
| 73 | |
| 74 | ## TraceConfig |
| 75 | |
| 76 | ```protobuf |
| 77 | data_sources: { |
| 78 | config { |
| 79 | name: "linux.ftrace" |
| 80 | ftrace_config { |
| 81 | ftrace_events: "sched/sched_switch" |
| 82 | ftrace_events: "sched/sched_process_exit" |
| 83 | ftrace_events: "sched/sched_process_free" |
| 84 | ftrace_events: "task/task_newtask" |
| 85 | ftrace_events: "task/task_rename" |
| 86 | } |
| 87 | } |
| 88 | } |
| 89 | |
| 90 | # This is to get full process name and thread<>process relationships. |
| 91 | data_sources: { |
| 92 | config { |
| 93 | name: "linux.process_stats" |
| 94 | } |
| 95 | } |
| 96 | ``` |
| 97 | |
| 98 | ## Scheduling wakeups and latency analysis |
| 99 | |
| 100 | By further enabling the following in the TraceConfig, the ftrace data source |
| 101 | will record also scheduling wake up events: |
| 102 | |
| 103 | ```protobuf |
| 104 | ftrace_events: "sched/sched_wakeup_new" |
| 105 | ftrace_events: "sched/sched_waking" |
| 106 | ``` |
| 107 | |
| 108 | While `sched_switch` events are emitted only when a thread is in the |
| 109 | `R(unnable)` state AND is running on a CPU run queue, `sched_waking` events are |
| 110 | emitted when any event causes a thread state to change. |
| 111 | |
| 112 | Consider the following example: |
| 113 | |
| 114 | ``` |
| 115 | Thread A |
| 116 | condition_variable.wait() |
| 117 | Thread B |
| 118 | condition_variable.notify() |
| 119 | ``` |
| 120 | |
| 121 | When Thread A suspends on the wait() it will enter the state `S(sleeping)` and |
| 122 | get removed from the CPU run queue. When Thread B notifies the variable, the |
| 123 | kernel will transition Thread A into the `R(unnable)` state. Thread A at that |
| 124 | point is eligible to be put back on a run queue. However this might not happen |
| 125 | for some time because, for instance: |
| 126 | |
| 127 | * All CPUs might be busy running some other thread, and Thread A needs to wait |
| 128 | to get a run queue slot assigned (or the other threads have higher priority). |
| 129 | * Some other CPUs other than the current one, but the scheduler load balancer |
| 130 | might take some time to move the thread on another CPU. |
| 131 | |
| 132 | Unless using real-time thread priorities, most Linux Kernel scheduler |
| 133 | configurations are not strictly work-conserving. For instance the scheduler |
| 134 | might prefer to wait some time in the hope that the thread running on the |
| 135 | current CPU goes to idle, avoiding a cross-cpu migration which might be more |
| 136 | costly both in terms of overhead and power. |
| 137 | |
| 138 | NOTE: `sched_waking` and `sched_wakeup` provide nearly the same information. The |
| 139 | difference lies in wakeup events across CPUs, which involve |
| 140 | inter-processor interrupts. The former is emitted on the source (wakee) |
| 141 | CPU, the latter on the destination (waked) CPU. `sched_waking` is usually |
| 142 | sufficient for latency analysis, unless you are looking into breaking down |
| 143 | latency due to inter-processor signaling. |
| 144 | |
| 145 | When enabling `sched_waking` events, the following will appear in the UI when |
| 146 | selecting a CPU slice: |
| 147 | |
| 148 |  |
| 149 | |