commit | c065c727a6fd630cc419c75795e3b802bc0f72b1 | [log] [tgz] |
---|---|---|
author | Lalit Maganti <lalitm@google.com> | Wed Apr 10 13:09:03 2019 +0100 |
committer | Lalit Maganti <lalitm@google.com> | Wed Apr 10 12:12:46 2019 +0000 |
tree | f89880569f19f5916fce93aed9a47e4689f3998a | |
parent | 86820d1f3a7d486e40bbaeda455b030d71f07c40 [diff] |
traced_probes: fix CpuReaders not waking up on non-first tracing session Symptoms of the issue: Massive overruns would be observed in the ftrace buffers even though the buffers are more than big enough for the ftrace flush period for the volume of data being written. This would only happen on the Android build but not on the standalone build started with tmux (though this was a red herring - see below for why). Upon further investigation, with a program which would context switch every 1ms (while confined to the BIG cores of the phone), we found that traced_probes was not being scheduled for up to 12s after the initial read of the buffers. However, there was plenty of CPU available because the little cores were essentially free so this was not because of starvation but rather some logical bug somewhere causing us not to wake up correctly. Cause: This is a subtle bug caused by not resetting the cpus_to_drain field when we start a tracing session. This means we leak state between an old tracing session (which likely has the cpus_to_drain field set to some non-zero value) and the current tracing session. This is a problem because this variable is checked by the CpuReaders to decide whether to wake up the main thread to drain the buffer. As every CpuReader was seeing this non-zero value, everyone thought the flush had already been scheduled which means no one scheduled the flush. This leads to the drains being ignored for all of trace; with traced_probes only waking up if there was a spurious wakeup or similar. Resolution: Similar to how flush_acks are reset just before a kFlush is issued in FtraceController, clear the cpus_to_drain just before a kRun is issued. This matches the code with the documented behaviour in ftrace_thread_sync.h which says that we are meant to do this. Unittest failure: There were actually 4 bugs in the FtraceControllerTest::BackToBackEnableDisable 1. We were expecting only one Post{Task, DelayedTask} each which is preceisely the behaviour we are trying to fix by the above change - we should get 2 PostTasks as we have two data sources and the state of the first should not affect the state of the second. 2. We were not running the DelayedTask even while we were expecting it to be present. 3. We were not resetting the data available callback between data sources which is wrong because the callback captures state about the generation which is out of date for the second data source. 4. We were relying on unspecified behaviour of std::function clearing the task when we moved from it which is not guaranteed to happen. Fix all of these bugs which essentially makes this test a correctness check for the real change. Interesting notes: This took 1 day of 3.5 engineers time for a 1 line fix. As usual, the simple, subtle failures are the hardest to find. Bug: 130202267 Change-Id: I8f5045adda84b633b80186a1e7080886366cd923
Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps.
See www.perfetto.dev for docs.