commit	a9f287ea4bd3f01cc10c5d56c3de5965fd133726	[log] [tgz]
author	Ryan Savitski <rsavitski@google.com>	Tue Mar 19 02:36:23 2024 +0000
committer	Ryan Savitski <rsavitski@google.com>	Tue Mar 19 02:36:23 2024 +0000
tree	68065660c3a59012c317ff08125967b8f8527f39
parent	7e63f42d755ca255162f1f327126f90e34465581 [diff]

ftrace: write "last read event timestamp" to assist parsing of ring buffer traces

-- Problem: --

Ftrace data is recorded per-cpu, grouping series of events from a given
per-cpu kernel buffer into an ftrace_event_bundle packet. When a
perfetto trace uses RING_BUFFER, the oldest trace packet wrapping can
leave us with very imbalanced per-cpu state as entire bundles of events
are overwritten.

This in turn is problematic when interpreting certain ftrace events
requires pairing them up (e.g. slice begin/end) or otherwise tracking
transitions, and the related events are not constrained to a single cpu.

Consider the following scenario of a {begin, end} pair of events that
represent a tracing slice. Assuming there's one bundle per cpu:
  cpu0: [e...E..e..e]    <- bundle
             ^event end
  cpu1: [.B.....e..eee.] <- bundle
          ^event begin

It is possible for us to observe cpu1's events, while cpu0's have been
overwritten in the ring buffer. In such a case, a naive parsing of all
events would conclude that the slice began and never ended.

The above is just the simplest example, things get more complicated once
we start reasoning about different kernel tracing bandwidth across cpus
and userspace transcoding breaking up reads in terms of KBs of tracing
data instead of timestamps.

-- Solution: --

Since traced_probes is the exclusive reader of per-cpu kernel ring
buffers, it knows the latest "consumed" event's timestamp for that
buffer. By putting this timestamp into the *next* bundle, we're
recording that "all ftrace data in this bundle is guaranteed to cover
the time range starting from where the last read stopped".

Example sketch:
  Initial state of the per-cpu ring buffers:
    cpu0: .eeee.e.
    cpu1: ee...ee.
    cpu2: ..ee.ee.
  After a read pass, some data has been consumed from each buf:
    cpu0: XXXee.e.
    cpu1: XXXXXXe.
    cpu2: XXXe.ee.
  Now, on the *next* bundle per cpu, we'll be emitting how far the
  consumed ('X') area extends.

In turn, from the trace_processor perspective, we can drop events until
the earliest timestamp covered by all per-cpu event bundles. This is
the maximum of all per-cpu "valid from" timestamps.

Bug: 192586066
Change-Id: Iebc6952a2f2a6a399b8cde64f210e28a69c04111

9 files changed

tree: 68065660c3a59012c317ff08125967b8f8527f39

README.md

Perfetto - System profiling, app tracing and trace analysis

Perfetto is a production-grade open-source stack for performance instrumentation and trace analysis. It offers services and libraries and for recording system-level and app-level traces, native + java heap profiling, a library for analyzing traces using SQL and a web-based UI to visualize and explore multi-GB traces.

See https://perfetto.dev/docs or the /docs/ directory for documentation.