commit | 9ed18981b172e0ef8f5f1a42f22733f4656138bd | [log] [tgz] |
---|---|---|
author | Ryan <rsavitski@google.com> | Wed Jul 24 15:04:40 2019 +0100 |
committer | Ryan <rsavitski@google.com> | Wed Jul 24 15:04:40 2019 +0100 |
tree | 561451a1059dc5ff141e6f935c3129abad1d98ad | |
parent | 05c925d2e7f0ffd5ef6d603bcbda90398c48e9e2 [diff] |
traced_probes ftrace: switch back to single-threaded, nonblocking read(2)-only approach Per the recent performance measurements (go/perfetto-bg-cpu), we no longer think that the multi-threading and block/nonblock splice/read approach is worthwhile in its current state. It is not just highly complex, it also has an unnecessarily high cpu% overhead. The reading is now guided by a single repeating task that reads & parses the contents of all per-cpu ftrace pipes. What we lose in this version: * ability to sleep until a page of ftrace events is filled (with blocking splice). This would only make a difference for tracing sessions with truly low-frequency events (not important to optimize for atm). * scalability for many-core machines. This version works well for an 8 core phone, but is likely to struggle on a 64 core workstation. Let's treat this patch as a reset for complexity, and reintroduce it only as necessary. Update: ran on my 72 core dev workstation as a smoke test, it kept up fine. * possibly splice efficiency? Haven't tried a single-thread splicer, but the bigger immediate wins are probably in the parsing code (this version is at 5:1 utime:stime ratio according to my measurements). Rough measurements of traced_probes cpu% on a crosshatch (standalone ndk build + tmux script), with the methodology as in go/perfetto-bg-cpu: tuned cfg: 32k page (i.e. chunk) size, 1s ftrace drain period. idle device, default cfg: 2.4% idle device, tuned cfg: <1% video rec, default cfg: 13.5% video rec, tuned cfg: 6% So we're doing much better with a tuned config, waking up for 60ms to process all cores once a second. Unfortunately this patch is lacking in programmatic tests, I'm not really sure which ones would be worthwhile with the existing mock-heavy test setups. This will likely require a separate pass (in a separate cl) to be more unit-testable (it'd be nice to test the cpu_reader loop stop/continue logic). Bug: 133312949 Change-Id: Ia79a267f43214f336b5396f4dd5789bc49ab1e67
Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps.
See www.perfetto.dev for docs.