This document covers the design of TraceBufferV2, which is the 2025 rewrite of the core trace buffer code in occasion of ProtoVM.
TraceBuffer is the non-shared userspace buffer that is used by the tracing service to hold traced data in memory, until it's either read back or written into a file. There is one TraceBuffer instance for each buffers section of the trace config
NOTE: This section assumes you are familiar with the core concepts exposed in Buffers and dataflow.
TraceBuffer is a ring buffer on steroids. Unfortunately due to the complications of the protocol (see Challenges section) it is far from a plain byte-oriented FIFO ring buffer when it comes to readback and deletions.
Before delving into its complications, let's explore its key operations.
Logically TraceBuffer deals with overlapping streams of data, called TraceWriter Sequences, or in short just Sequences:
From a TraceBuffer viewpoint, the only abstraction visible are the Producer (identified by a uint16_t ProducerID) and TraceWriter (identified by a uint16_t WriterID, within the scoped of a producer). The 32-bit tuple {ProducerId, WriterID} constitutes the unique Sequence ID for TraceBuffer. Everything in TraceBuffer is keyed by that.
Basic operation:
{ProducerID,WriterID}, has a sequence ID and flags.Readback gives the following guarantees:
TracePacket.previous_packet_dropped flag.Readback happens in the following cases:
Code-wise there are four main entry-points:
Writer-side (Producer-side):
CopyChunkUntrusted(): called when a CommitData IPC is received, or when the service performs SMB Scraping (more below)TryPatchChunkContents(): still part of CommitData IPC.Reader-side:
BeginRead(): called at readback time at the beginning of each read batch.ReadNextTracePacket(): called once for each packet until either there are no more packets in the buffer or TracingServiceImpl decided it has read enough data for the current task (to avoid saturating the IPC channel).TraceBuffer can operate in two modes.
This is the mode used by the majority of the traces. It is also the one with the most complications. This document focuses on the operation in RING_BUFFER mode, unless otherwise specified. This mode can be used for pure ring buffer tracing, or can be coupled with write_into_file to have long traces streamed into disk, in which case the ring buffer serves mainly to decouple the SMB and the I/O activity (and to handle fragment reassembly).
This mode is used for one-shot traces where the user cares about the left-most part of the trace. This is conceptually easier: once reached the end of the buffer, TraceBuffer stops accepting data.
There is a slight behavioural change from the V1 implementation. V1 tried to be (too) smart about DISCARD and allowed to keep writing data into the buffer as long as the write and read cursors never crossed (i.e. as long as the reader caught up). This turned out to be useless and confusing: coupling DISCARD with write_into_file leads to a scenario where DISCARD behaves almost like a RING_BUFFER. However if the reader doesn't catch up (e.g. due to lack of CPU bandwidth), TraceBuffer stops accepting data (forever). We later realized this was a confusing feature (a ring buffer that suddenly stops) and added warnings when trying to combine the two.
V2 doesn't try to be smart about readbacks and simply stops once we reach the end of the buffer, whether it has been read or not.
Packet fragmentation is the cause of most of TraceBuffer's design complexity.
Simple Fragmentation Example:
Chunk A (ChunkID=100) Chunk B (ChunkID=101) Chunk C (ChunkID=102)
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│[Packet1: Complete] │ │[Packet2: Begin] │ │[Packet2: Continue] │
│[Packet2: Begin] │ │ flags: kContOnNext │ │ flags: kContFromPrev│
│ flags: kContOnNext │ └─────────────────────┘ │[Packet2: End] │
└─────────────────────┘ │[Packet3: Complete] │
└─────────────────────┘
Fragmentation Chain: Packet2 = [Begin] → [Continue] → [End]
Key Fragmentation Challenges:
kChunkNeedsPatching block readback until patchedOut of order commits are rare but regularly present. They happen due to a feature called SMB Scraping introduced in the early years of Perfetto.
SMB scraping happens when TracingServiceImpl, upon a Flush, forcefully reads the chunks in the SMB, even if they are not marked as completed, and writes them into the trace buffer.
This was necessary to deal with data sources like TrackEvent that can be used on arbitrary threads that don't have a TaskRunner, where would it be impossible to issue a PostTask(FlushTask) upon a trace protocol Flush request.
The challenge is that TracingServiceImpl, when scraping, scans the SMB in linear order and commits chunks as found. But that linear order translates into “chunk allocation order”, which is unpredictable, effectively causing chunks to be committed in random order.
In practice, these instances are relatively rare, as they happen:
Important note: TraceBuffer assumes that all out-of-order commits are batched together atomically. The only known use case for OOO is SMB scraping, which commits all scraped chunks in one go within a single TaskRunner task.
Hence we assume that the following cannot happen:
The logic on TraceBufferV2 treats any ChunkID gaps identified, after having sorted chunks by ChunkID, as data losses.
There are several paths that can lead to a data loss, and TraceBuffer must track and report all of them. Debugging data losses is a very common activity. It's extremely important that TraceBuffer signals any case of data loss.
There are several different types and causes of data losses:
kPacketSizeDropPacket at the end of the next chunk.When a packet spans across several fragments, almost always it involves patching due to the nature of protobuf encoding. The problem is the following:
{ProducerID,WriterID} patch offset X with contents [DE,AD, BE,EF].From a protocol viewpoint, only the last fragment of a chunk can be patched:
The information about “needs patching” is held in the SMB's Chunk flags (kChunkNeedsPatching).
The kChunkNeedsPatching state is cleared by the CommitData IPC, which contains, alongside the patches offset and payload, the bool has_more_patches flag. When false, it causes the kChunkNeedsPatching state to be cleared.
From a TraceBuffer viewpoint patching has the following implications:
Recommit means committing again a chunk that exists in the buffer with the same ChunkID. The only legitimate case of recommit is SMB scraping followed by an actual commit. We don't expect nor support the case of Producers trying to re-commit the same chunk N times, as that unavoidably leads to undefined behaviour (what if the tracing service has written the packets to a file already?).
This is the scenario when recommit can legitimately happen:
chunk_complete=false argument.NOTE: kChunkNeedsPatching and kIncomplete are two different and orthogonal chunk states. kIncomplete has nothing to do with fragments and is purely about SMB scraping (and the fact that we have to be conservative and ignore the last fragment).
Implications:
The main complication of incomplete chunks is that we cannot know upfront the size of their payload. Because of this we have to conservatively copy and reserve in the buffer the whole chunk size.
Buffer cloning happens via the CloneReadOnly() method. As the name suggests it creates a new TraceBuffer instance which contains the same contents but can only be read into. This is to support CLONE_SNAPSHOT triggers.
Architecturally buffer cloning is not particularly complicated, at least in the current design. The main design implications are:
ProtoVM is an upcoming feature to TracingService. It's a non-turing-complete VM language to describe proto merging operations, to keep track of the state of arbitrary proto-encoded data structures as we overwrite data in the trace buffer. ProtoVM has been the reason that triggered the redesign of TraceBuffer V2.
Without getting into its details, the primary requirement of ProtoVM is the following: when overwriting chunks in the trace buffer, we must pass valid packets from these soon-to-be-deleted chunks to ProtoVM. We must do so in order, hence replicating the same logic that we would use when doing a readback.
Internal docs about ProtoVM:
For the aforementioned ProtoVM reasons, in the V2 design, the logic that deals with ring buffer overwrites (DeleteNextChunksFor()) is almost identical - and shares most of its code with - the readback logic.
I say almost because there is (of course) a subtle difference: when deleting chunks, stalling (either due to pending patches or incompleteness) is NOT an option. Old chunks MUST go to make space to new chunks, no matter what.
So overwrites are the equivalent of a no-stalling force-delete readback.
There are two main data structures involved:
TBChunkIs the struct, stored in the trace buffer memory as a result of calling CopyChunkUntrusted from a SMB chunk.
A TBChunk is very similar to a SMB chunk with the following caveats:
The sizeof() of both is the same (16 bytes). This is very important to keep patches offsets consistent.
The SMB chunk maintains a counter of fragments. TBChunk instead does byte-based bookkeeping, as that reduces the complexity of the read iterators.
The layout of the fields is slightly different, but they both contains ProducerID, WriterID, ChunkID, fragment counts/sizes and flags. The SMB chunk layout is an ABI. The TCHunk layout is not: it is an implementation detail and can change.
TBChunk maintains a basic checksum for each chunk (used only in debug builds).
In a nutshell TBChunk is:
base::PagedMemory contains a sequence of chunks.struct TBChunk header followed by its fragments' payload.It maintains the state of a {ProducerID, WriterID} sequence.
Its important feature is maintaining an ordered list (logically) of TBChunk(s) for that sequence, sorted by ChunkID order. The “list” is actually a CircularQueue of offsets, which has O(1) push_back() and pop_front() operations.
ProducerAndWriterId -> SequenceState.SequenceState for each {Producer,Writer} active in the buffer.SequenceState holds:last_chunk_id_consumed, to detect gaps in the ChunkID sequence (data losses)CircularQueue<size_t>) of chunks, which stores their offset in the buffer.chunks queue is maintained sorted and updated as chunks are appended and consumed (removed) from the buffer.The lifetime of a SequenceState has a subtle tradeoff:
SequenceState holds the last_chunk_id_consumed which is used to detect gaps in the chunk ids .TraceBufferV2 balances this using a lazy sweeping approach: it allows the most recently deleted SequenceStates to stay alive, up to kKeepLastEmptySeq = 1024. See DeleteStaleEmptySequences().
A simple class that tokenizes fragments in a chunk and allows forward-only iteration.
It deals with untrusted data, detecting malformed / out of bounds scenarios.
It does not alter the state of the buffer.
A simple utility class that iterates over the ordered list of TBChunk for a given SequenceState. It merely follows the SequenceState.chunks queue and detects gaps.
Encapsulates most of the readback complexity. It reads and consumes chunks in sequence order, as follows:
SequenceState.chunks (using ChunkSeqIterator) and starts the iteration from there.Chunks can be visited in two different ways:
Buffer order: in the order they have been written in the buffer. In the example below: A1, B1, B3, A2, B2
In Sequence order: in the order they appear in the SequenceState's list.
When chunks are written via CopyChunkUntrusted() a new TBChunk is allocated in the buffer‘s PagedMemory using the usual bump-pointer pattern you’d expect from a ring-buffer. Chunks are variable-size, and are stored contiguously with 32-bit alignment.
The offset of the chunk is also appended in the SequenceState.chunks list.
After the first wrapping, writing a chunk involves deleting one or more existing chunks. The deletion operation RemoveNextChunksFor() is as complex as a readback, because it reconstructs packets being deleted in order, to pass them to ProtoVM.
So the writing itself is straightforward, but the deletion (overwrite) of existing chunks is where most of the complexity lies. This is described in the next section.
flowchart TD A[DeleteNextChunksFor<br/>bytes_to_clear] --> B[Initialize: off = wr_<br/>clear_end = wr_ + bytes_to_clear] B --> C{off < clear_end?} C -->|No| M[Create padding chunks<br/>for partial deletions] C -->|Yes| D{off >= used_size_?} D -->|Yes| N[Break - nothing to delete<br/>in unused space] D -->|No| E[chunk = GetTBChunkAt off] E --> F{chunk.is_padding?} F -->|Yes| G[Update padding stats<br/>off += chunk.outer_size] F -->|No| H[Create ChunkSeqReader<br/>in kEraseMode] H --> I[ReadNextPacketInSeqOrder loop] I --> J{Packet found?} J -->|Yes| K[Pass packet to ProtoVM<br/>has_cleared_fragments = true] J -->|No| L{has_cleared_fragments?} K --> I L -->|Yes| O[Mark sequence data_loss = true] L -->|No| P[No data loss] O --> Q[Update overwrite stats<br/>off += chunk.outer_size] P --> Q Q --> R{More chunks in range?} R -->|Yes| C R -->|No| M G --> C M --> S[Scan remaining range for padding] S --> T{Partial chunk at end?} T -->|Yes| U[Create new padding chunk<br/>for remaining space] T -->|No| V[End] U --> V N --> V style A fill:#e1f5fe style K fill:#fff3e0 style O fill:#ffcdd2 style V fill:#c8e6c9
Key Differences from ReadNextTracePacket:
Readback (ReadNextTracePacket()) is where most of the TraceBuffer's complexity lies, as it needs to reassembles packets from fragments, deal with gaps / data losses, and deal with interleaving of chunks from different sequences, and out of ordering.
flowchart TD A[ReadNextTracePacket Start] --> B{chunk_seq_reader_ exists?} B -->|No| C[Get chunk at rd_] C --> D{Is chunk padding?} D -->|Yes| E[rd_ += chunk.outer_size<br/>Check wrap around] D -->|No| F[Create ChunkSeqReader<br/>for this chunk] B -->|Yes| G[ReadNextPacketInSeqOrder] F --> G G --> H{Packet found?} H -->|Yes| I[Set sequence properties<br/>Set data_loss flag<br/>Return packet] H -->|No| J[Get end chunk from reader<br/>rd_ = end_offset + size] J --> K{rd_ == wr_ OR<br/>wrapped to wr_?} K -->|Yes| L[Return false - no more data] K -->|No| M[Reset chunk_seq_reader_<br/>Handle wrap around] E --> N{rd_ wrapped around?} N -->|Yes| O[rd_ = 0] N -->|No| P[Continue with new rd_] O --> K P --> K M --> B style A fill:#e1f5fe style I fill:#c8e6c9 style L fill:#ffcdd2
This is how ReadNextTracePacket() works:
ReadNextTracePacket() invocation. Done.flowchart TD A[ReadNextPacketInSeqOrder] --> B{skip_in_generation?} B -->|Yes| C[Return false - stalled] B -->|No| D[NextFragmentInChunk] D --> E{Fragment found?} E -->|Yes| F{Fragment type?} F -->|kFragWholePacket| G[ConsumeFragment<br/>Return packet] F -->|kFragBegin| H[ReassembleFragmentedPacket] F -->|kFragEnd/Continue| I[Data loss - unexpected<br/>ConsumeFragment<br/>Continue loop] E -->|No| J{Chunk corrupted?} J -->|Yes| K[Mark data_loss = true] J -->|No| L{Chunk incomplete?} L -->|Yes| M[Set skip_in_generation<br/>Return false] L -->|No| N[EraseCurrentChunk] K --> N N --> O{Reached end chunk?} O -->|Yes| P[Return false] O -->|No| Q[NextChunkInSequence] Q --> R{Next chunk exists?} R -->|No| P R -->|Yes| S[iter_ = next_chunk<br/>Create new FragIterator] S --> D H --> T{Reassembly result?} T -->|Success| U[Return reassembled packet] T -->|NotEnoughData| V[Set skip_in_generation<br/>Return false] T -->|DataLoss| W[Mark data_loss = true<br/>Continue loop] style A fill:#e1f5fe style G fill:#c8e6c9 style U fill:#c8e6c9 style C fill:#ffcdd2 style P fill:#ffcdd2 style V fill:#fff3e0
But things are more complicated. Let‘s take first only out-of-ordering into the picture. With reference to the drawing above, let’s imagine the write cursor is @ offset=48, right before B3.
If we proceeded simply in buffer order we would break FIFO-ness, as we would first emit the packets contained in B3, then A2 (this is fine) and ultimately B2 (this is problematic).
The only valid linearizations that preserve in-sequence FIFO-ness, would be either [A2,B2,B3], [B2,B3,A2] or [B2,A2,B3].
In order to deal with this we introduce a two layer walk in the readback code:
sequences_ map.SequenceState.chunks ordered list.In the code, the outer layer walk is implemented by TraceBufferV2::ReadNextTracePacket() while the inner walk is implemented by the class ChunkSeqReader::ReadNextPacket().
BM_TraceBuffer_WR_SingleWriter<TraceBufferV1> bytes_per_second=9.77742G/s
BM_TraceBuffer_WR_SingleWriter<TraceBufferV2> bytes_per_second=12.6395G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV1> bytes_per_second=8.65385G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV2> bytes_per_second=11.7582G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV1> bytes_per_second=4.27694G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV2> bytes_per_second=4.35475G/s
BM_TraceBuffer_WR_SingleWriter<TraceBufferV1> bytes_per_second=4.4379G/s
BM_TraceBuffer_WR_SingleWriter<TraceBufferV2> bytes_per_second=3.7931G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV1> bytes_per_second=3.19148G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV2> bytes_per_second=3.47354G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV1> bytes_per_second=1.26698G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV2> bytes_per_second=1.35394G/s