|  | # Buffers and dataflow | 
|  |  | 
|  | This page describes the dataflow in Perfetto when recording traces. It describes | 
|  | all the buffering stages, explains how to size the buffers and how to debug | 
|  | data losses. | 
|  |  | 
|  | ## Concepts | 
|  |  | 
|  | Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline. | 
|  | In many senses, its architecture is very similar to modern GPUs' command | 
|  | buffers. | 
|  |  | 
|  | The design principles of the tracing dataflow are: | 
|  |  | 
|  | * The tracing fastpath is based on direct writes into a shared memory buffer. | 
|  | * Highly optimized for low-overhead writing. NOT optimized for low-latency | 
|  | reading. | 
|  | * Trace data is eventually committed in the central trace buffer by the end | 
|  | of the trace or when explicit flush requests are issued via the IPC channel. | 
|  | * Producers are untrusted and should not be able to see each-other's trace data, | 
|  | as that would leak sensitive information. | 
|  |  | 
|  | In the general case, there are two types buffers involved in a trace. When | 
|  | pulling data from the Linux kernel's ftrace infrastructure, there is a third | 
|  | stage of buffering (one per-CPU) involved: | 
|  |  | 
|  |  | 
|  |  | 
|  | #### Tracing service's central buffers | 
|  |  | 
|  | These buffers (yellow, in the picture above) are defined by the user in the | 
|  | `buffers` section of the [trace config](config.md). In the most simple cases, | 
|  | one tracing session = one buffer, regardless of the number of data sources and | 
|  | producers. | 
|  |  | 
|  | This is the place where the tracing data is ultimately kept, while in memory, | 
|  | whether it comes from the kernel ftrace infrastructure, from some other data | 
|  | source in `traced_probes` or from another userspace process using the | 
|  | [Perfetto SDK](/docs/instrumentation/tracing-sdk.md). | 
|  | At the end of the trace (or during, if in [streaming mode]) these buffers are | 
|  | written into the output trace file. | 
|  |  | 
|  | These buffers can contain a mixture of trace packets coming from different data | 
|  | sources and even different producer processes. What-goes-where is defined in the | 
|  | [buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config. | 
|  | Because of this, the tracing buffers are not shared across processes, to avoid | 
|  | cross-talking and information leaking across producer processes. | 
|  |  | 
|  | #### Shared memory buffers | 
|  |  | 
|  | Each producer process has one memory buffer shared 1:1 with the tracing service | 
|  | (blue, in the picture above), regardless of the number of data sources it hosts. | 
|  | This buffer is a temporary staging buffer and has two purposes: | 
|  |  | 
|  | 1. Zero-copy on the writer path. This buffer allows direct serialization of the | 
|  | tracing data from the writer fastpath in a memory region directly readable by | 
|  | the tracing service. | 
|  |  | 
|  | 2. Decoupling writes from reads of the tracing service. The tracing service has | 
|  | the job of moving trace packets from the shared memory buffer (blue) into the | 
|  | central buffer (yellow) as fast as it can. | 
|  | The shared memory buffer hides the scheduling and response latencies of the | 
|  | tracing service, allowing the producer to keep writing without losing data | 
|  | when the tracing service is temporarily blocked. | 
|  |  | 
|  | #### Ftrace buffer | 
|  |  | 
|  | When the `linux.ftrace` data source is enabled, the kernel will have its own | 
|  | per-CPU buffers. These are unavoidable because the kernel cannot write directly | 
|  | into user-space buffers. The `traced_probes` process will periodically read | 
|  | those buffers, convert the data into binary protos and follow the same dataflow | 
|  | of userspace tracing. These buffers need to be just large enough to hold data | 
|  | between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`). | 
|  |  | 
|  | ## Life of a trace packet | 
|  |  | 
|  | Here is a summary to understand the dataflow of trace packets across buffers. | 
|  | Consider the case of a producer process hosting two data sources writing packets | 
|  | at a different rates, both targeting the same central buffer. | 
|  |  | 
|  | 1. When each data source starts writing, it will grab a free page of the shared | 
|  | memory buffer and directly serialize proto-encoded tracing data onto it. | 
|  |  | 
|  | 2. When a page of the shared memory buffer is filled, the producer will send an | 
|  | async IPC to the service, asking it to copy the shared memory page just | 
|  | written. Then, the producer will grab the next free page in the shared memory | 
|  | buffer and keep writing. | 
|  |  | 
|  | 3. When the service receives the IPC, it copies the shared memory page into | 
|  | the central buffer and marks the shared memory buffer page as free again. Data | 
|  | sources within the producer are able to reuse that page at this point. | 
|  |  | 
|  | 4. When the tracing session ends, the service sends a `Flush` request to all | 
|  | data sources. In reaction to this, data sources will commit all outstanding | 
|  | shared memory pages, even if not completely full. The services copies these | 
|  | pages into the service's central buffer. | 
|  |  | 
|  |  | 
|  |  | 
|  | ## Buffer sizing | 
|  |  | 
|  | #### Central buffer sizing | 
|  |  | 
|  | The math for sizing the central buffer is quite straightforward: in the default | 
|  | case of tracing without `write_into_file` (when the trace file is written only | 
|  | at the end of the trace), the buffer will hold as much data as it has been | 
|  | written by the various data sources. | 
|  |  | 
|  | The total length of the trace will be `(buffer size) / (aggregated write rate)`. | 
|  | If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold | 
|  | ~ 8 seconds of tracing data. | 
|  |  | 
|  | The write rate is highly dependent on the data sources configured and by the | 
|  | activity of the system. 1-2 MB/s is a typical figure on Android traces with | 
|  | scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier | 
|  | data sources are enabled (e.g., syscall or pagefault tracing). | 
|  |  | 
|  | When using [streaming mode] the buffer needs to be able to hold enough data | 
|  | between two `file_write_period_ms` periods (default: 5s). | 
|  | For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s | 
|  | the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses. | 
|  |  | 
|  | #### Shared memory buffer sizing | 
|  |  | 
|  | The sizing of the shared memory buffer depends on: | 
|  |  | 
|  | * The scheduling characteristics of the underlying system, i.e. for how long the | 
|  | tracing service can be blocked on the scheduler queues. This is a function of | 
|  | the kernel configuration and nice-ness level of the `traced` process. | 
|  | * The max write rate of all data sources within a producer process. | 
|  |  | 
|  | Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets | 
|  | blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB | 
|  | to avoid losses. | 
|  |  | 
|  | Empirical measurements suggest that on most Android systems a shared memory | 
|  | buffer size of 128-512 KB is good enough. | 
|  |  | 
|  | The default shared memory buffer size is 256 KB. When using the Perfetto Client | 
|  | Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`. | 
|  |  | 
|  | WARNING: if a data source writes very large trace packets in a single batch, | 
|  | either the shared memory buffer needs to be big enough to handle that or | 
|  | `BufferExhaustedPolicy.kStall` must be employed. | 
|  |  | 
|  | For instance, consider a data source that emits a 2MB screenshot every 10s. | 
|  | Its (simplified) code, would look like: | 
|  | ```c++ | 
|  | for (;;) { | 
|  | ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) { | 
|  | auto packet = ctx.NewTracePacket(); | 
|  | packet.set_bitmap(Grab2MBScreenshot()); | 
|  | }); | 
|  | std::this_thread::sleep_for(std::chrono::seconds(10)); | 
|  | } | 
|  | ``` | 
|  |  | 
|  | Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will | 
|  | create bursts of 2MB back-to-back without yielding; it is limited only by the | 
|  | tracing serialization overhead. In practice, it will write the 2MB buffer at | 
|  | O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be | 
|  | unlikely to catch up at that rate and data losses will be experienced. | 
|  |  | 
|  | In a case like this these options are: | 
|  |  | 
|  | * Increase the size of the shared memory buffer in the producer that hosts the | 
|  | data source. | 
|  | * Split the write into chunks spaced by some delay. | 
|  | * Adopt the `BufferExhaustedPolicy::kStall` when defining the data source: | 
|  |  | 
|  | ```c++ | 
|  | class ScreenshotDataSource : public perfetto::DataSource<ScreenshotDataSource> { | 
|  | public: | 
|  | constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy = | 
|  | BufferExhaustedPolicy::kStall; | 
|  | ... | 
|  | }; | 
|  | ``` | 
|  |  | 
|  | ## Debugging data losses | 
|  |  | 
|  | #### Ftrace kernel buffer losses | 
|  |  | 
|  | When using the Linux kernel ftrace data source, losses can occur in the | 
|  | kernel -> userspace path if the `traced_probes` process gets blocked for too | 
|  | long. | 
|  |  | 
|  | At the trace proto level, losses in this path are recorded: | 
|  | * In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the | 
|  | beginning and end of the trace. If the `overrun` field is non-zero, data has | 
|  | been lost. | 
|  | * In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows | 
|  | to locate precisely the point where data loss happened. | 
|  |  | 
|  | At the TraceProcessor SQL level, this data is available in the `stats` table: | 
|  |  | 
|  | ```sql | 
|  | > select * from stats where name like 'ftrace_cpu_overrun_end' | 
|  | name                 idx                  severity             source value | 
|  | -------------------- -------------------- -------------------- ------ ------ | 
|  | ftrace_cpu_overrun_e                    0 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    1 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    2 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    3 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    4 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    5 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    6 data_loss            trace       0 | 
|  | ftrace_cpu_overrun_e                    7 data_loss            trace       0 | 
|  | ``` | 
|  |  | 
|  | These losses can be mitigated either increasing | 
|  | [`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig] | 
|  | or decreasing | 
|  | [`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig] | 
|  |  | 
|  | #### Shared memory losses | 
|  |  | 
|  | Tracing data can be lost in the shared memory due to bursts while traced is | 
|  | blocked. | 
|  |  | 
|  | At the trace proto level, losses in this path are recorded: | 
|  |  | 
|  | * In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats]. | 
|  | * In [`TracePacket.previous_packet_dropped`][TracePacket]. | 
|  | Caveat: the very first packet emitted by every data source is also marked as | 
|  | `previous_packet_dropped=true`. This is because the service has no way to | 
|  | tell if that was the truly first packet or everything else before that was | 
|  | lost. | 
|  |  | 
|  | At the TraceProcessor SQL level, this data is available in the `stats` table: | 
|  | ```sql | 
|  | > select * from stats where name = 'traced_buf_trace_writer_packet_loss' | 
|  | name                 idx                  severity             source    value | 
|  | -------------------- -------------------- -------------------- --------- ----- | 
|  | traced_buf_trace_wri                    0 data_loss            trace         0 | 
|  | ``` | 
|  |  | 
|  | #### Central buffer losses | 
|  |  | 
|  | Data losses in the central buffer can happen for two different reasons: | 
|  |  | 
|  | 1. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by | 
|  | virtue of wrapping in the ring buffer. | 
|  | These losses are recorded, at the trace proto level, in | 
|  | [`TraceStats.BufferStats.chunks_overwritten`][BufferStats]. | 
|  |  | 
|  | 2. When using `fill_policy: DISCARD`, newer tracing data committed after the | 
|  | buffer is full is dropped. | 
|  | These losses are recorded, at the trace proto level, in | 
|  | [`TraceStats.BufferStats.chunks_discarded`][BufferStats]. | 
|  |  | 
|  | At the TraceProcessor SQL level, this data is available in the `stats` table, | 
|  | one entry per central buffer: | 
|  |  | 
|  | ```sql | 
|  | > select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded' | 
|  | name                 idx                  severity             source  value | 
|  | -------------------- -------------------- -------------------- ------- ----- | 
|  | traced_buf_chunks_di                    0 info                 trace       0 | 
|  | traced_buf_chunks_ov                    0 data_loss            trace       0 | 
|  | ``` | 
|  |  | 
|  | Summary: the best way to detect and debug data losses is to use Trace Processor | 
|  | and issue the query: | 
|  | `select * from stats where severity = 'data_loss' and value != 0` | 
|  |  | 
|  | ## Atomicity and ordering guarantees | 
|  |  | 
|  | A "writer sequence" is the sequence of trace packets emitted by a given | 
|  | TraceWriter from a data source. In almost all cases 1 data source == | 
|  | 1+ TraceWriter(s). Some data sources that support writing from multiple threads | 
|  | typically create one TraceWriter per thread. | 
|  |  | 
|  | * Trace packets written from a sequence are emitted in the trace file in the | 
|  | same order they have been written. | 
|  |  | 
|  | * There is no ordering guarantee between packets written by different sequences. | 
|  | Sequences are, by design, concurrent and more than one linearization is | 
|  | possible. The service does NOT respect global timestamp ordering across | 
|  | different sequences. If two packets from two sequences were emitted in | 
|  | global timestamp order, the service can still emit them in the trace file in | 
|  | the opposite order. | 
|  |  | 
|  | * Trace packets are atomic. If a trace packet is emitted in the trace file, it | 
|  | is guaranteed to be contain all the fields that the data source wrote. If a | 
|  | trace packet is large and spans across several shared memory buffer pages, the | 
|  | service will save it in the trace file only if it can observe that all | 
|  | fragments have been committed without gaps. | 
|  |  | 
|  | * If a trace packet is lost (e.g. because of wrapping in the ring buffer | 
|  | or losses in the shared memory buffer), no further trace packet will be | 
|  | emitted for that sequence, until all packets before are dropped as well. | 
|  | In other words, if the tracing service ends up in a situation where it sees | 
|  | packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new | 
|  | packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap, | 
|  | the full sequence 5, 6, 7, 8, 9 will be emitted. | 
|  | This behavior, however, doesn't hold when using [streaming mode] because, | 
|  | in that case, the periodic read will consume the packets in the buffer and | 
|  | clear the gaps, allowing the sequence to restart. | 
|  |  | 
|  | ## Incremental state in trace packets | 
|  |  | 
|  | In many cases trace packets are fully independent of each other and can be | 
|  | processed and interpreted without further context. | 
|  | In some cases, however, they can have _incremental state_ and behave similarly | 
|  | to inter-frame video encoding techniques, where some frames require the keyframe | 
|  | to be present to be meaningfully decoded. | 
|  |  | 
|  | Here are are two concrete examples: | 
|  |  | 
|  | 1. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are | 
|  | keyed by thread id. In most cases users want to map those events back to the | 
|  | parent process (the thread-group). To solve this, when both the | 
|  | `linux.ftrace` and the `linux.process_stats` data sources are enabled in a | 
|  | Perfetto trace, the latter does capture process<>thread associations from | 
|  | the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace. | 
|  | A typical trace in this case looks as follows: | 
|  | ``` | 
|  | # From process_stats's /proc scanner. | 
|  | pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger" | 
|  |  | 
|  | # From ftrace | 
|  | timestamp: 95054961131912; sched_wakeup: pid: 610;     target_cpu: 2; | 
|  | timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98 | 
|  | ``` | 
|  | The /proc entry is emitted only once per process to avoid bloating the size of | 
|  | the trace. In lack of data losses this is fine to be able to reconstruct all | 
|  | scheduling events for that pid. If, however, the process_stats packet gets | 
|  | dropped in the ring buffer, there will be no way left to work out the process | 
|  | details for all the other ftrace events that refer to that PID. | 
|  |  | 
|  | 2. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto | 
|  | SDK makes extensive use of string interning. Mos strings and descriptors | 
|  | (e.g. details about processes / threads) are emitted only once and later | 
|  | referred to using a monotonic ID. In case a loss of the descriptor packet, | 
|  | it is not possible to make fully sense of those events. | 
|  |  | 
|  | Trace Processor has built-in mechanism that detect loss of interning data and | 
|  | skips ingesting packets that refer to missing interned strings or descriptors. | 
|  |  | 
|  | When using tracing in ring-buffer mode, these types of losses are very likely to | 
|  | happen. | 
|  |  | 
|  | There are two mitigations for this: | 
|  |  | 
|  | 1. Issuing periodic invalidations of the incremental state via | 
|  | [`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig]. | 
|  | This will cause the data sources that make use of incremental state to | 
|  | periodically drop the interning / process mapping tables and re-emit the | 
|  | descriptors / strings on the next occurrence. This mitigates quite well the | 
|  | problem in the context of ring-buffer traces, as long as the | 
|  | `clear_period_ms` is one order of magnitude lower than the estimated length | 
|  | of trace data in the central trace buffer. | 
|  |  | 
|  | 2. Recording the incremental state into a dedicated buffer (via | 
|  | `DataSourceConfig.target_buffer`). This technique is quite commonly used with | 
|  | in the ftrace + process_stats example mentioned before, recording the | 
|  | process_stats packet in a dedicated buffer less likely to wrap (ftrace events | 
|  | are much more frequent than descriptors for new processes). | 
|  |  | 
|  | ## Flushes and windowed trace importing | 
|  |  | 
|  | Another common problem experienced in traces that involve multiple data sources | 
|  | is the non-synchronous nature of trace commits. As explained in the | 
|  | [Life of a trace packet](#life-of-a-trace-packet) section above, trace data is | 
|  | committed only when a full memory page of the shared memory buffer is filled (or | 
|  | at when the tracing session ends). In most cases, if data sources produce events | 
|  | at a regular cadence, pages are filled quite quickly and events are committed | 
|  | in the central buffers within seconds. | 
|  |  | 
|  | In some other cases, however, a data source can emit events only sporadically. | 
|  | Imagine the case of a data source that emits events when the display is turned | 
|  | on/off. Such an infrequent event might end up being staged in the shared memory | 
|  | buffer for very long times and can end up being committed in the trace buffer | 
|  | hours after it happened. | 
|  |  | 
|  | Another scenario where this can happen is when using ftrace and when a | 
|  | particular CPU is idle most of the time or gets hot-unplugged (ftrace uses | 
|  | per-cpu buffers). In this case a CPU might record little-or-no data for several | 
|  | minutes while the other CPUs pump thousands of new trace events per second. | 
|  |  | 
|  | This causes two side effects that end up breaking user expectations or causing | 
|  | bugs: | 
|  |  | 
|  | * The UI can show an abnormally long timeline with a huge gap in the middle. | 
|  | The packet ordering of events doesn't matter for the UI because events are | 
|  | sorted by timestamp at import time. The trace in this case will contain very | 
|  | recent events plus a handful of stale events that happened hours before. The | 
|  | UI, for correctness, will try to display all events, showing a handful of | 
|  | early events, followed by a huge temporal gap when nothing happened, | 
|  | followed by the stream of recent events. | 
|  |  | 
|  | * When recording long traces, Trace Processor can show import errors of the form | 
|  | "XXX event out-of-order". This is because. in order to limit the memory usage | 
|  | at import time, Trace Processor sorts events using a sliding window. If trace | 
|  | packets are too out-of-order (trace file order vs timestamp order), the | 
|  | sorting will fail and some packets will be dropped. | 
|  |  | 
|  | #### Mitigations | 
|  |  | 
|  | The best mitigation for these sort of problems is to specify a | 
|  | [`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually | 
|  | good enough for most cases), especially when recording long traces. | 
|  |  | 
|  | This will cause the tracing service to issue periodic flush requests to data | 
|  | sources. A flush requests causes the data source to commit the shared memory | 
|  | buffer pages into the central buffer, even if they are not completely full. | 
|  | By default, a flush issued only at the end of the trace. | 
|  |  | 
|  | In case of long traces recorded without `flush_period_ms`, another option is to | 
|  | pass the `--full-sort` option to `trace_processor_shell` when importing the | 
|  | trace. Doing so will disable the windowed sorting at the cost of a higher | 
|  | memory usage (the trace file will be fully buffered in memory before parsing). | 
|  |  | 
|  | [streaming mode]: /docs/concepts/config#long-traces | 
|  | [TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig | 
|  | [FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig | 
|  | [IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig | 
|  | [FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats | 
|  | [FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle | 
|  | [TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket | 
|  | [BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats |