docs/concepts/buffers.md - third_party/perfetto - Git at Google

 # Buffers and dataflow

 This page describes the dataflow in Perfetto when recording traces. It describes
 all the buffering stages, explains how to size the buffers and how to debug
 data losses.

 ## Concepts

 Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline.
 In many senses, its architecture is very similar to modern GPUs' command
 buffers.

 The design principles of the tracing dataflow are:

 * The tracing fastpath is based on direct writes into a shared memory buffer.
 * Highly optimized for low-overhead writing. NOT optimized for low-latency
   reading.
 * Trace data is eventually committed in the central trace buffer by the end
   of the trace or when explicit flush requests are issued via the IPC channel.
 * Producers are untrusted and should not be able to see each-other's trace data,
   as that would leak sensitive information.

 In the general case, there are two types buffers involved in a trace. When
 pulling data from the Linux kernel's ftrace infrastructure, there is a third
 stage of buffering (one per-CPU) involved:

 ![Buffers](/docs/images/buffers.png)

 #### Tracing service's central buffers

 These buffers (yellow, in the picture above) are defined by the user in the
 `buffers` section of the [trace config](config.md). In the most simple cases,
 one tracing session = one buffer, regardless of the number of data sources and
 producers.

 This is the place where the tracing data is ultimately kept, while in memory,
 whether it comes from the kernel ftrace infrastructure, from some other data
 source in `traced_probes` or from another userspace process using the
 [Perfetto SDK](/docs/instrumentation/tracing-sdk.md).
 At the end of the trace (or during, if in [streaming mode]) these buffers are
 written into the output trace file.

 These buffers can contain a mixture of trace packets coming from different data
 sources and even different producer processes. What-goes-where is defined in the
 [buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config.
 Because of this, the tracing buffers are not shared across processes, to avoid
 cross-talking and information leaking across producer processes.

 #### Shared memory buffers

 Each producer process has one memory buffer shared 1:1 with the tracing service
 (blue, in the picture above), regardless of the number of data sources it hosts.
 This buffer is a temporary staging buffer and has two purposes:

 1. Zero-copy on the writer path. This buffer allows direct serialization of the
    tracing data from the writer fastpath in a memory region directly readable by
    the tracing service.

 2. Decoupling writes from reads of the tracing service. The tracing service has
    the job of moving trace packets from the shared memory buffer (blue) into the
    central buffer (yellow) as fast as it can.
    The shared memory buffer hides the scheduling and response latencies of the
    tracing service, allowing the producer to keep writing without losing data
    when the tracing service is temporarily blocked.

 #### Ftrace buffer

 When the `linux.ftrace` data source is enabled, the kernel will have its own
 per-CPU buffers. These are unavoidable because the kernel cannot write directly
 into user-space buffers. The `traced_probes` process will periodically read
 those buffers, convert the data into binary protos and follow the same dataflow
 of userspace tracing. These buffers need to be just large enough to hold data
 between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`).

 ## Life of a trace packet

 Here is a summary to understand the dataflow of trace packets across buffers.
 Consider the case of a producer process hosting two data sources writing packets
 at a different rates, both targeting the same central buffer.

 1. When each data source starts writing, it will grab a free page of the shared
    memory buffer and directly serialize proto-encoded tracing data onto it.

 2. When a page of the shared memory buffer is filled, the producer will send an
    async IPC to the service, asking it to copy the shared memory page just
    written. Then, the producer will grab the next free page in the shared memory
    buffer and keep writing.

 3. When the service receives the IPC, it copies the shared memory page into
    the central buffer and marks the shared memory buffer page as free again. Data
    sources within the producer are able to reuse that page at this point.

 4. When the tracing session ends, the service sends a `Flush` request to all
    data sources. In reaction to this, data sources will commit all outstanding
    shared memory pages, even if not completely full. The services copies these
    pages into the service's central buffer.

 ![Dataflow animation](/docs/images/dataflow.svg)

 ## Buffer sizing

 #### Central buffer sizing

 The math for sizing the central buffer is quite straightforward: in the default
 case of tracing without `write_into_file` (when the trace file is written only
 at the end of the trace), the buffer will hold as much data as it has been
 written by the various data sources.

 The total length of the trace will be `(buffer size) / (aggregated write rate)`.
 If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold
 ~ 8 seconds of tracing data.

 The write rate is highly dependent on the data sources configured and by the
 activity of the system. 1-2 MB/s is a typical figure on Android traces with
 scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier
 data sources are enabled (e.g., syscall or pagefault tracing).

 When using [streaming mode] the buffer needs to be able to hold enough data
 between two `file_write_period_ms` periods (default: 5s).
 For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s
 the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses.

 #### Shared memory buffer sizing

 The sizing of the shared memory buffer depends on:

 * The scheduling characteristics of the underlying system, i.e. for how long the
  tracing service can be blocked on the scheduler queues. This is a function of
  the kernel configuration and nice-ness level of the `traced` process.
 * The max write rate of all data sources within a producer process.

 Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets
 blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB
 to avoid losses.

 Empirical measurements suggest that on most Android systems a shared memory
 buffer size of 128-512 KB is good enough.

 The default shared memory buffer size is 256 KB. When using the Perfetto Client
 Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`.

 WARNING: if a data source writes very large trace packets in a single batch,
 either the shared memory buffer needs to be big enough to handle that or
 `BufferExhaustedPolicy.kStall` must be employed.

 For instance, consider a data source that emits a 2MB screenshot every 10s.
 Its (simplified) code, would look like:
 ```c++
 for (;;) {
   ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) {
     auto packet = ctx.NewTracePacket();
     packet.set_bitmap(Grab2MBScreenshot());
   });
   std::this_thread::sleep_for(std::chrono::seconds(10));
 }
 ```

 Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will
 create bursts of 2MB back-to-back without yielding; it is limited only by the
 tracing serialization overhead. In practice, it will write the 2MB buffer at
 O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be
 unlikely to catch up at that rate and data losses will be experienced.

 In a case like this these options are:

 * Increase the size of the shared memory buffer in the producer that hosts the
   data source.
 * Split the write into chunks spaced by some delay.
 * Adopt the `BufferExhaustedPolicy::kStall` when defining the data source:

 ```c++
 class ScreenshotDataSource : public perfetto::DataSource<ScreenshotDataSource> {
  public:
   constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy =
       BufferExhaustedPolicy::kStall;
  ...
 };
 ```

 ## Debugging data losses

 #### Ftrace kernel buffer losses

 When using the Linux kernel ftrace data source, losses can occur in the
 kernel -> userspace path if the `traced_probes` process gets blocked for too
 long.

 At the trace proto level, losses in this path are recorded:
 * In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the
   beginning and end of the trace. If the `overrun` field is non-zero, data has
   been lost.
 * In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows
   to locate precisely the point where data loss happened.

 At the TraceProcessor SQL level, this data is available in the `stats` table:

 ```sql
 > select * from stats where name like 'ftrace_cpu_overrun_end'
 name                 idx                  severity             source value
 -------------------- -------------------- -------------------- ------ ------
 ftrace_cpu_overrun_e                    0 data_loss            trace       0
 ftrace_cpu_overrun_e                    1 data_loss            trace       0
 ftrace_cpu_overrun_e                    2 data_loss            trace       0
 ftrace_cpu_overrun_e                    3 data_loss            trace       0
 ftrace_cpu_overrun_e                    4 data_loss            trace       0
 ftrace_cpu_overrun_e                    5 data_loss            trace       0
 ftrace_cpu_overrun_e                    6 data_loss            trace       0
 ftrace_cpu_overrun_e                    7 data_loss            trace       0
 ```

 These losses can be mitigated either increasing
 [`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig]
  or decreasing
 [`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig]

 #### Shared memory losses

 Tracing data can be lost in the shared memory due to bursts while traced is
 blocked.

 At the trace proto level, losses in this path are recorded:

 * In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats].
 * In [`TracePacket.previous_packet_dropped`][TracePacket].
   Caveat: the very first packet emitted by every data source is also marked as
   `previous_packet_dropped=true`. This is because the service has no way to
   tell if that was the truly first packet or everything else before that was
   lost.

 At the TraceProcessor SQL level, this data is available in the `stats` table:
 ```sql
 > select * from stats where name = 'traced_buf_trace_writer_packet_loss'
 name                 idx                  severity             source    value
 -------------------- -------------------- -------------------- --------- -----
 traced_buf_trace_wri                    0 data_loss            trace         0
 ```

 #### Central buffer losses

 Data losses in the central buffer can happen for two different reasons:

 1. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by
    virtue of wrapping in the ring buffer.
    These losses are recorded, at the trace proto level, in
    [`TraceStats.BufferStats.chunks_overwritten`][BufferStats].

 2. When using `fill_policy: DISCARD`, newer tracing data committed after the
    buffer is full is dropped.
    These losses are recorded, at the trace proto level, in
    [`TraceStats.BufferStats.chunks_discarded`][BufferStats].

 At the TraceProcessor SQL level, this data is available in the `stats` table,
 one entry per central buffer:

 ```sql
 > select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded'
 name                 idx                  severity             source  value
 -------------------- -------------------- -------------------- ------- -----
 traced_buf_chunks_di                    0 info                 trace       0
 traced_buf_chunks_ov                    0 data_loss            trace       0
 ```

 Summary: the best way to detect and debug data losses is to use Trace Processor
 and issue the query:
 `select * from stats where severity = 'data_loss' and value != 0`

 ## Atomicity and ordering guarantees

 A "writer sequence" is the sequence of trace packets emitted by a given
 TraceWriter from a data source. In almost all cases 1 data source ==
 1+ TraceWriter(s). Some data sources that support writing from multiple threads
 typically create one TraceWriter per thread.

 * Trace packets written from a sequence are emitted in the trace file in the
   same order they have been written.

 * There is no ordering guarantee between packets written by different sequences.
   Sequences are, by design, concurrent and more than one linearization is
   possible. The service does NOT respect global timestamp ordering across
   different sequences. If two packets from two sequences were emitted in
   global timestamp order, the service can still emit them in the trace file in
   the opposite order.

 * Trace packets are atomic. If a trace packet is emitted in the trace file, it
   is guaranteed to be contain all the fields that the data source wrote. If a
   trace packet is large and spans across several shared memory buffer pages, the
   service will save it in the trace file only if it can observe that all
   fragments have been committed without gaps.

 * If a trace packet is lost (e.g. because of wrapping in the ring buffer
   or losses in the shared memory buffer), no further trace packet will be
   emitted for that sequence, until all packets before are dropped as well.
   In other words, if the tracing service ends up in a situation where it sees
   packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new
   packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap,
   the full sequence 5, 6, 7, 8, 9 will be emitted.
   This behavior, however, doesn't hold when using [streaming mode] because,
   in that case, the periodic read will consume the packets in the buffer and
   clear the gaps, allowing the sequence to restart.

 ## Incremental state in trace packets

 In many cases trace packets are fully independent of each other and can be
 processed and interpreted without further context.
 In some cases, however, they can have _incremental state_ and behave similarly
 to inter-frame video encoding techniques, where some frames require the keyframe
 to be present to be meaningfully decoded.

 Here are are two concrete examples:

 1. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are
    keyed by thread id. In most cases users want to map those events back to the
    parent process (the thread-group). To solve this, when both the
    `linux.ftrace` and the `linux.process_stats` data sources are enabled in a
    Perfetto trace, the latter does capture process<>thread associations from
    the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace.
    A typical trace in this case looks as follows:
    ```
     # From process_stats's /proc scanner.
     pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger"

     # From ftrace
     timestamp: 95054961131912; sched_wakeup: pid: 610;     target_cpu: 2;
     timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98
   ```
   The /proc entry is emitted only once per process to avoid bloating the size of
   the trace. In lack of data losses this is fine to be able to reconstruct all
   scheduling events for that pid. If, however, the process_stats packet gets
   dropped in the ring buffer, there will be no way left to work out the process
   details for all the other ftrace events that refer to that PID.

 2. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto
    SDK makes extensive use of string interning. Mos strings and descriptors
    (e.g. details about processes / threads) are emitted only once and later
    referred to using a monotonic ID. In case a loss of the descriptor packet,
    it is not possible to make fully sense of those events.

 Trace Processor has built-in mechanism that detect loss of interning data and
 skips ingesting packets that refer to missing interned strings or descriptors.

 When using tracing in ring-buffer mode, these types of losses are very likely to
 happen.

 There are two mitigations for this:

 1. Issuing periodic invalidations of the incremental state via
    [`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig].
    This will cause the data sources that make use of incremental state to
    periodically drop the interning / process mapping tables and re-emit the
    descriptors / strings on the next occurrence. This mitigates quite well the
    problem in the context of ring-buffer traces, as long as the
    `clear_period_ms` is one order of magnitude lower than the estimated length
    of trace data in the central trace buffer.

 2. Recording the incremental state into a dedicated buffer (via
    `DataSourceConfig.target_buffer`). This technique is quite commonly used with
    in the ftrace + process_stats example mentioned before, recording the
    process_stats packet in a dedicated buffer less likely to wrap (ftrace events
    are much more frequent than descriptors for new processes).

 ## Flushes and windowed trace importing

 Another common problem experienced in traces that involve multiple data sources
 is the non-synchronous nature of trace commits. As explained in the
 [Life of a trace packet](#life-of-a-trace-packet) section above, trace data is
 committed only when a full memory page of the shared memory buffer is filled (or
 at when the tracing session ends). In most cases, if data sources produce events
 at a regular cadence, pages are filled quite quickly and events are committed
 in the central buffers within seconds.

 In some other cases, however, a data source can emit events only sporadically.
 Imagine the case of a data source that emits events when the display is turned
 on/off. Such an infrequent event might end up being staged in the shared memory
 buffer for very long times and can end up being committed in the trace buffer
 hours after it happened.

 Another scenario where this can happen is when using ftrace and when a
 particular CPU is idle most of the time or gets hot-unplugged (ftrace uses
 per-cpu buffers). In this case a CPU might record little-or-no data for several
 minutes while the other CPUs pump thousands of new trace events per second.

 This causes two side effects that end up breaking user expectations or causing
 bugs:

 * The UI can show an abnormally long timeline with a huge gap in the middle.
   The packet ordering of events doesn't matter for the UI because events are
   sorted by timestamp at import time. The trace in this case will contain very
   recent events plus a handful of stale events that happened hours before. The
   UI, for correctness, will try to display all events, showing a handful of
   early events, followed by a huge temporal gap when nothing happened,
   followed by the stream of recent events.

 * When recording long traces, Trace Processor can show import errors of the form
   "XXX event out-of-order". This is because. in order to limit the memory usage
   at import time, Trace Processor sorts events using a sliding window. If trace
   packets are too out-of-order (trace file order vs timestamp order), the
   sorting will fail and some packets will be dropped.

 #### Mitigations

 The best mitigation for these sort of problems is to specify a
 [`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually
 good enough for most cases), especially when recording long traces.

 This will cause the tracing service to issue periodic flush requests to data
 sources. A flush requests causes the data source to commit the shared memory
 buffer pages into the central buffer, even if they are not completely full.
 By default, a flush issued only at the end of the trace.

 In case of long traces recorded without `flush_period_ms`, another option is to
 pass the `--full-sort` option to `trace_processor_shell` when importing the
 trace. Doing so will disable the windowed sorting at the cost of a higher
 memory usage (the trace file will be fully buffered in memory before parsing).

 [streaming mode]: /docs/concepts/config#long-traces
 [TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig
 [FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig
 [IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig
 [FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats
 [FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle
 [TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket
 [BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats
	# Buffers and dataflow

	This page describes the dataflow in Perfetto when recording traces. It describes
	all the buffering stages, explains how to size the buffers and how to debug
	data losses.

	## Concepts

	Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline.
	In many senses, its architecture is very similar to modern GPUs' command
	buffers.

	The design principles of the tracing dataflow are:

	* The tracing fastpath is based on direct writes into a shared memory buffer.
	* Highly optimized for low-overhead writing. NOT optimized for low-latency
	reading.
	* Trace data is eventually committed in the central trace buffer by the end
	of the trace or when explicit flush requests are issued via the IPC channel.
	* Producers are untrusted and should not be able to see each-other's trace data,
	as that would leak sensitive information.

	In the general case, there are two types buffers involved in a trace. When
	pulling data from the Linux kernel's ftrace infrastructure, there is a third
	stage of buffering (one per-CPU) involved:

	![Buffers](/docs/images/buffers.png)

	#### Tracing service's central buffers

	These buffers (yellow, in the picture above) are defined by the user in the
	`buffers` section of the [trace config](config.md). In the most simple cases,
	one tracing session = one buffer, regardless of the number of data sources and
	producers.

	This is the place where the tracing data is ultimately kept, while in memory,
	whether it comes from the kernel ftrace infrastructure, from some other data
	source in `traced_probes` or from another userspace process using the
	[Perfetto SDK](/docs/instrumentation/tracing-sdk.md).
	At the end of the trace (or during, if in [streaming mode]) these buffers are
	written into the output trace file.

	These buffers can contain a mixture of trace packets coming from different data
	sources and even different producer processes. What-goes-where is defined in the
	[buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config.
	Because of this, the tracing buffers are not shared across processes, to avoid
	cross-talking and information leaking across producer processes.

	#### Shared memory buffers

	Each producer process has one memory buffer shared 1:1 with the tracing service
	(blue, in the picture above), regardless of the number of data sources it hosts.
	This buffer is a temporary staging buffer and has two purposes:

	1. Zero-copy on the writer path. This buffer allows direct serialization of the
	tracing data from the writer fastpath in a memory region directly readable by
	the tracing service.

	2. Decoupling writes from reads of the tracing service. The tracing service has
	the job of moving trace packets from the shared memory buffer (blue) into the
	central buffer (yellow) as fast as it can.
	The shared memory buffer hides the scheduling and response latencies of the
	tracing service, allowing the producer to keep writing without losing data
	when the tracing service is temporarily blocked.

	#### Ftrace buffer

	When the `linux.ftrace` data source is enabled, the kernel will have its own
	per-CPU buffers. These are unavoidable because the kernel cannot write directly
	into user-space buffers. The `traced_probes` process will periodically read
	those buffers, convert the data into binary protos and follow the same dataflow
	of userspace tracing. These buffers need to be just large enough to hold data
	between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`).

	## Life of a trace packet

	Here is a summary to understand the dataflow of trace packets across buffers.
	Consider the case of a producer process hosting two data sources writing packets
	at a different rates, both targeting the same central buffer.

	1. When each data source starts writing, it will grab a free page of the shared
	memory buffer and directly serialize proto-encoded tracing data onto it.

	2. When a page of the shared memory buffer is filled, the producer will send an
	async IPC to the service, asking it to copy the shared memory page just
	written. Then, the producer will grab the next free page in the shared memory
	buffer and keep writing.

	3. When the service receives the IPC, it copies the shared memory page into
	the central buffer and marks the shared memory buffer page as free again. Data
	sources within the producer are able to reuse that page at this point.

	4. When the tracing session ends, the service sends a `Flush` request to all
	data sources. In reaction to this, data sources will commit all outstanding
	shared memory pages, even if not completely full. The services copies these
	pages into the service's central buffer.

	![Dataflow animation](/docs/images/dataflow.svg)

	## Buffer sizing

	#### Central buffer sizing

	The math for sizing the central buffer is quite straightforward: in the default
	case of tracing without `write_into_file` (when the trace file is written only
	at the end of the trace), the buffer will hold as much data as it has been
	written by the various data sources.

	The total length of the trace will be `(buffer size) / (aggregated write rate)`.
	If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold
	~ 8 seconds of tracing data.

	The write rate is highly dependent on the data sources configured and by the
	activity of the system. 1-2 MB/s is a typical figure on Android traces with
	scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier
	data sources are enabled (e.g., syscall or pagefault tracing).

	When using [streaming mode] the buffer needs to be able to hold enough data
	between two `file_write_period_ms` periods (default: 5s).
	For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s
	the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses.

	#### Shared memory buffer sizing

	The sizing of the shared memory buffer depends on:

	* The scheduling characteristics of the underlying system, i.e. for how long the
	tracing service can be blocked on the scheduler queues. This is a function of
	the kernel configuration and nice-ness level of the `traced` process.
	* The max write rate of all data sources within a producer process.

	Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets
	blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB
	to avoid losses.

	Empirical measurements suggest that on most Android systems a shared memory
	buffer size of 128-512 KB is good enough.

	The default shared memory buffer size is 256 KB. When using the Perfetto Client
	Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`.

	WARNING: if a data source writes very large trace packets in a single batch,
	either the shared memory buffer needs to be big enough to handle that or
	`BufferExhaustedPolicy.kStall` must be employed.

	For instance, consider a data source that emits a 2MB screenshot every 10s.
	Its (simplified) code, would look like:
	```c++
	for (;;) {
	ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) {
	auto packet = ctx.NewTracePacket();
	packet.set_bitmap(Grab2MBScreenshot());
	});
	std::this_thread::sleep_for(std::chrono::seconds(10));
	}
	```

	Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will
	create bursts of 2MB back-to-back without yielding; it is limited only by the
	tracing serialization overhead. In practice, it will write the 2MB buffer at
	O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be
	unlikely to catch up at that rate and data losses will be experienced.

	In a case like this these options are:

	* Increase the size of the shared memory buffer in the producer that hosts the
	data source.
	* Split the write into chunks spaced by some delay.
	* Adopt the `BufferExhaustedPolicy::kStall` when defining the data source:

	```c++
	class ScreenshotDataSource : public perfetto::DataSource<ScreenshotDataSource> {
	public:
	constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy =
	BufferExhaustedPolicy::kStall;
	...
	};
	```

	## Debugging data losses

	#### Ftrace kernel buffer losses

	When using the Linux kernel ftrace data source, losses can occur in the
	kernel -> userspace path if the `traced_probes` process gets blocked for too
	long.

	At the trace proto level, losses in this path are recorded:
	* In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the
	beginning and end of the trace. If the `overrun` field is non-zero, data has
	been lost.
	* In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows
	to locate precisely the point where data loss happened.

	At the TraceProcessor SQL level, this data is available in the `stats` table:

	```sql
	> select * from stats where name like 'ftrace_cpu_overrun_end'
	name idx severity source value
	-------------------- -------------------- -------------------- ------ ------
	ftrace_cpu_overrun_e 0 data_loss trace 0
	ftrace_cpu_overrun_e 1 data_loss trace 0
	ftrace_cpu_overrun_e 2 data_loss trace 0
	ftrace_cpu_overrun_e 3 data_loss trace 0
	ftrace_cpu_overrun_e 4 data_loss trace 0
	ftrace_cpu_overrun_e 5 data_loss trace 0
	ftrace_cpu_overrun_e 6 data_loss trace 0
	ftrace_cpu_overrun_e 7 data_loss trace 0
	```

	These losses can be mitigated either increasing
	[`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig]
	or decreasing
	[`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig]

	#### Shared memory losses

	Tracing data can be lost in the shared memory due to bursts while traced is
	blocked.

	At the trace proto level, losses in this path are recorded:

	* In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats].
	* In [`TracePacket.previous_packet_dropped`][TracePacket].
	Caveat: the very first packet emitted by every data source is also marked as
	`previous_packet_dropped=true`. This is because the service has no way to
	tell if that was the truly first packet or everything else before that was
	lost.

	At the TraceProcessor SQL level, this data is available in the `stats` table:
	```sql
	> select * from stats where name = 'traced_buf_trace_writer_packet_loss'
	name idx severity source value
	-------------------- -------------------- -------------------- --------- -----
	traced_buf_trace_wri 0 data_loss trace 0
	```

	#### Central buffer losses

	Data losses in the central buffer can happen for two different reasons:

	1. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by
	virtue of wrapping in the ring buffer.
	These losses are recorded, at the trace proto level, in
	[`TraceStats.BufferStats.chunks_overwritten`][BufferStats].

	2. When using `fill_policy: DISCARD`, newer tracing data committed after the
	buffer is full is dropped.
	These losses are recorded, at the trace proto level, in
	[`TraceStats.BufferStats.chunks_discarded`][BufferStats].

	At the TraceProcessor SQL level, this data is available in the `stats` table,
	one entry per central buffer:

	```sql
	> select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded'
	name idx severity source value
	-------------------- -------------------- -------------------- ------- -----
	traced_buf_chunks_di 0 info trace 0
	traced_buf_chunks_ov 0 data_loss trace 0
	```

	Summary: the best way to detect and debug data losses is to use Trace Processor
	and issue the query:
	`select * from stats where severity = 'data_loss' and value != 0`

	## Atomicity and ordering guarantees

	A "writer sequence" is the sequence of trace packets emitted by a given
	TraceWriter from a data source. In almost all cases 1 data source ==
	1+ TraceWriter(s). Some data sources that support writing from multiple threads
	typically create one TraceWriter per thread.

	* Trace packets written from a sequence are emitted in the trace file in the
	same order they have been written.

	* There is no ordering guarantee between packets written by different sequences.
	Sequences are, by design, concurrent and more than one linearization is
	possible. The service does NOT respect global timestamp ordering across
	different sequences. If two packets from two sequences were emitted in
	global timestamp order, the service can still emit them in the trace file in
	the opposite order.

	* Trace packets are atomic. If a trace packet is emitted in the trace file, it
	is guaranteed to be contain all the fields that the data source wrote. If a
	trace packet is large and spans across several shared memory buffer pages, the
	service will save it in the trace file only if it can observe that all
	fragments have been committed without gaps.

	* If a trace packet is lost (e.g. because of wrapping in the ring buffer
	or losses in the shared memory buffer), no further trace packet will be
	emitted for that sequence, until all packets before are dropped as well.
	In other words, if the tracing service ends up in a situation where it sees
	packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new
	packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap,
	the full sequence 5, 6, 7, 8, 9 will be emitted.
	This behavior, however, doesn't hold when using [streaming mode] because,
	in that case, the periodic read will consume the packets in the buffer and
	clear the gaps, allowing the sequence to restart.

	## Incremental state in trace packets

	In many cases trace packets are fully independent of each other and can be
	processed and interpreted without further context.
	In some cases, however, they can have _incremental state_ and behave similarly
	to inter-frame video encoding techniques, where some frames require the keyframe
	to be present to be meaningfully decoded.

	Here are are two concrete examples:

	1. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are
	keyed by thread id. In most cases users want to map those events back to the
	parent process (the thread-group). To solve this, when both the
	`linux.ftrace` and the `linux.process_stats` data sources are enabled in a
	Perfetto trace, the latter does capture process<>thread associations from
	the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace.
	A typical trace in this case looks as follows:
	```
	# From process_stats's /proc scanner.
	pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger"

	# From ftrace
	timestamp: 95054961131912; sched_wakeup: pid: 610; target_cpu: 2;
	timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98
	```
	The /proc entry is emitted only once per process to avoid bloating the size of
	the trace. In lack of data losses this is fine to be able to reconstruct all
	scheduling events for that pid. If, however, the process_stats packet gets
	dropped in the ring buffer, there will be no way left to work out the process
	details for all the other ftrace events that refer to that PID.

	2. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto
	SDK makes extensive use of string interning. Mos strings and descriptors
	(e.g. details about processes / threads) are emitted only once and later
	referred to using a monotonic ID. In case a loss of the descriptor packet,
	it is not possible to make fully sense of those events.

	Trace Processor has built-in mechanism that detect loss of interning data and
	skips ingesting packets that refer to missing interned strings or descriptors.

	When using tracing in ring-buffer mode, these types of losses are very likely to
	happen.

	There are two mitigations for this:

	1. Issuing periodic invalidations of the incremental state via
	[`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig].
	This will cause the data sources that make use of incremental state to
	periodically drop the interning / process mapping tables and re-emit the
	descriptors / strings on the next occurrence. This mitigates quite well the
	problem in the context of ring-buffer traces, as long as the
	`clear_period_ms` is one order of magnitude lower than the estimated length
	of trace data in the central trace buffer.

	2. Recording the incremental state into a dedicated buffer (via
	`DataSourceConfig.target_buffer`). This technique is quite commonly used with
	in the ftrace + process_stats example mentioned before, recording the
	process_stats packet in a dedicated buffer less likely to wrap (ftrace events
	are much more frequent than descriptors for new processes).

	## Flushes and windowed trace importing

	Another common problem experienced in traces that involve multiple data sources
	is the non-synchronous nature of trace commits. As explained in the
	[Life of a trace packet](#life-of-a-trace-packet) section above, trace data is
	committed only when a full memory page of the shared memory buffer is filled (or
	at when the tracing session ends). In most cases, if data sources produce events
	at a regular cadence, pages are filled quite quickly and events are committed
	in the central buffers within seconds.

	In some other cases, however, a data source can emit events only sporadically.
	Imagine the case of a data source that emits events when the display is turned
	on/off. Such an infrequent event might end up being staged in the shared memory
	buffer for very long times and can end up being committed in the trace buffer
	hours after it happened.

	Another scenario where this can happen is when using ftrace and when a
	particular CPU is idle most of the time or gets hot-unplugged (ftrace uses
	per-cpu buffers). In this case a CPU might record little-or-no data for several
	minutes while the other CPUs pump thousands of new trace events per second.

	This causes two side effects that end up breaking user expectations or causing
	bugs:

	* The UI can show an abnormally long timeline with a huge gap in the middle.
	The packet ordering of events doesn't matter for the UI because events are
	sorted by timestamp at import time. The trace in this case will contain very
	recent events plus a handful of stale events that happened hours before. The
	UI, for correctness, will try to display all events, showing a handful of
	early events, followed by a huge temporal gap when nothing happened,
	followed by the stream of recent events.

	* When recording long traces, Trace Processor can show import errors of the form
	"XXX event out-of-order". This is because. in order to limit the memory usage
	at import time, Trace Processor sorts events using a sliding window. If trace
	packets are too out-of-order (trace file order vs timestamp order), the
	sorting will fail and some packets will be dropped.

	#### Mitigations

	The best mitigation for these sort of problems is to specify a
	[`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually
	good enough for most cases), especially when recording long traces.

	This will cause the tracing service to issue periodic flush requests to data
	sources. A flush requests causes the data source to commit the shared memory
	buffer pages into the central buffer, even if they are not completely full.
	By default, a flush issued only at the end of the trace.

	In case of long traces recorded without `flush_period_ms`, another option is to
	pass the `--full-sort` option to `trace_processor_shell` when importing the
	trace. Doing so will disable the windowed sorting at the cost of a higher
	memory usage (the trace file will be fully buffered in memory before parsing).

	[streaming mode]: /docs/concepts/config#long-traces
	[TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig
	[FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig
	[IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig
	[FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats
	[FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle
	[TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket
	[BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats