TraceBuffer V2 Design document

Overview

This document covers the design of TraceBufferV2, which is the 2025 rewrite of the core trace buffer code in occasion of ProtoVM.

TraceBuffer is the non-shared userspace buffer that is used by the tracing service to hold traced data in memory, until it's either read back or written into a file. There is one TraceBuffer instance for each buffers section of the trace config

Basic operating principles

NOTE: This section assumes you are familiar with the core concepts exposed in Buffers and dataflow.

TraceBuffer is a ring buffer on steroids. Unfortunately due to the complications of the protocol (see Challenges section) it is far from a plain byte-oriented FIFO ring buffer when it comes to readback and deletions.

Before delving into its complications, let's explore its key operations.

Logically TraceBuffer deals with overlapping streams of data, called TraceWriter Sequences, or in short just Sequences:

  • A client process that writes trace data acts as a Producer. Typically 1 Producer = 1 Process, but there are cases where a process can host >1 producers (e.g. if it uses N libraries each statically linking the tracing SDK)
  • A Producer declares DataSources, which is the unit of enablement/configuration in the buffer. DataSources, however, don't make it as an abstraction in the buffer. Only TracingService knows about data sources.
  • Each data source uses one or more TraceWriter, typically one per thread.
  • A TraceWriter writes linear sequences of TracePackets.

From a TraceBuffer viewpoint, the only abstraction visible are the Producer (identified by a uint16_t ProducerID) and TraceWriter (identified by a uint16_t WriterID, within the scoped of a producer). The 32-bit tuple {ProducerId, WriterID} constitutes the unique Sequence ID for TraceBuffer. Everything in TraceBuffer is keyed by that.

Basic operation:

  • Producers commit “chunks” into the SMB (Shared Memory Buffer).
  • A Chunk belongs to a {ProducerID,WriterID}, has a sequence ID and flags.
  • A SMB Chunk contains one or more fragments.
  • Typically 1 fragment == 1 packet, with the exception of the first and last fragment which MIGHT be continuations of longer fragmented packets. Note that a chunk could contain only one fragment that happens to be a continuation of a larger packet.
  • Chunks are copied almost as-is into the TraceBuffer, +- some metadata tracking (more details later).
  • At readback time, TraceBuffer reconstructs the sequence of packets and reassembles larger fragmented packets.
  • Reading is a destructive operation.
  • However the destructivity of read involves (almost) the same logic of readback to reconstruct packets being overwritten (and in future pass them to ProtoVM).

Readback gives the following guarantees:

  • TraceBuffer only outputs fully formed packets which are valid protobuf-encoded TracePacket messages. Packets that are missing fragments, are missing patches or are invalid are dropped.
  • Data drops are always tracked and reported through the TracePacket.previous_packet_dropped flag.
  • TraceBuffer tries very hard to avoid hiding valid data: a missing fragment or other similar protocol violations should not invalidate the rest of the data for the sequence.
  • Packets for a sequence are always read back FIFO in the same order of writing.
  • TraceBuffer tries hard to also respect FIFO-ness of packets belonging to different sequences (this is a new behaviour introduced by TraceBufferV2). So data is read back roughly in the same order it has been written (+- pending patches and data losses which can cause jumps).

Readback happens in the following cases:

  • Read via IPC at the end of the trace: this is what perfetto_cmd does by default and is used in most tracing scenarios today. All the contents of the buffer are read after tracing stops.
  • Periodic reads into file: this happens in long tracing mode. Every O(seconds) (configurable) the buffer is read and the packets extracted are written into the file descriptor passed by the consumer.
  • Periodics reads over IPC: these are rare. Some third-party tools like GPU Inspector do that. Architecturally they are no different from the case of read into file. TraceBuffer isn't aware of any difference between “reading into a file” or “reading via IPC”. Those concepts exist only in TracingServiceImpl.

Code-wise there are four main entry-points:

Writer-side (Producer-side):

  • CopyChunkUntrusted(): called when a CommitData IPC is received, or when the service performs SMB Scraping (more below)
  • TryPatchChunkContents(): still part of CommitData IPC.

Reader-side:

  • BeginRead(): called at readback time at the beginning of each read batch.
  • ReadNextTracePacket(): called once for each packet until either there are no more packets in the buffer or TracingServiceImpl decided it has read enough data for the current task (to avoid saturating the IPC channel).

Key challenges

RING_BUFFER vs DISCARD

TraceBuffer can operate in two modes.

RING_BUFFER

This is the mode used by the majority of the traces. It is also the one with the most complications. This document focuses on the operation in RING_BUFFER mode, unless otherwise specified. This mode can be used for pure ring buffer tracing, or can be coupled with write_into_file to have long traces streamed into disk, in which case the ring buffer serves mainly to decouple the SMB and the I/O activity (and to handle fragment reassembly).

DISCARD

This mode is used for one-shot traces where the user cares about the left-most part of the trace. This is conceptually easier: once reached the end of the buffer, TraceBuffer stops accepting data.

There is a slight behavioural change from the V1 implementation. V1 tried to be (too) smart about DISCARD and allowed to keep writing data into the buffer as long as the write and read cursors never crossed (i.e. as long as the reader caught up). This turned out to be useless and confusing: coupling DISCARD with write_into_file leads to a scenario where DISCARD behaves almost like a RING_BUFFER. However if the reader doesn't catch up (e.g. due to lack of CPU bandwidth), TraceBuffer stops accepting data (forever). We later realized this was a confusing feature (a ring buffer that suddenly stops) and added warnings when trying to combine the two.

V2 doesn't try to be smart about readbacks and simply stops once we reach the end of the buffer, whether it has been read or not.

Fragmentation

Packet fragmentation is the cause of most of TraceBuffer's design complexity.

Simple Fragmentation Example:
Chunk A (ChunkID=100)      Chunk B (ChunkID=101)      Chunk C (ChunkID=102)
┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│[Packet1: Complete]  │    │[Packet2: Begin]     │    │[Packet2: Continue]  │
│[Packet2: Begin]     │    │ flags: kContOnNext  │    │ flags: kContFromPrev│
│ flags: kContOnNext  │    └─────────────────────┘    │[Packet2: End]       │
└─────────────────────┘                               │[Packet3: Complete]  │
                                                      └─────────────────────┘

Fragmentation Chain: Packet2 = [Begin] → [Continue] → [End]

Key Fragmentation Challenges:

  • Out-of-order commits: Chunks may arrive out of ChunkID order due to SMB scraping
  • Missing fragments: Gaps in ChunkID sequence cause packet drops
  • Patch dependencies: Chunks marked kChunkNeedsPatching block readback until patched
  • Buffer wraparound: Fragmented packets may span buffer wraparound boundaries

Out-of-order commits

Out of order commits are rare but regularly present. They happen due to a feature called SMB Scraping introduced in the early years of Perfetto.

SMB scraping happens when TracingServiceImpl, upon a Flush, forcefully reads the chunks in the SMB, even if they are not marked as completed, and writes them into the trace buffer.

This was necessary to deal with data sources like TrackEvent that can be used on arbitrary threads that don't have a TaskRunner, where would it be impossible to issue a PostTask(FlushTask) upon a trace protocol Flush request.

The challenge is that TracingServiceImpl, when scraping, scans the SMB in linear order and commits chunks as found. But that linear order translates into “chunk allocation order”, which is unpredictable, effectively causing chunks to be committed in random order.

In practice, these instances are relatively rare, as they happen:

  • Only when stopping the trace, for most traces.
  • every O(seconds) in the case of long tracing mode. Hence they must be supported, but not optimized for.

Important note: TraceBuffer assumes that all out-of-order commits are batched together atomically. The only known use case for OOO is SMB scraping, which commits all scraped chunks in one go within a single TaskRunner task.

Hence we assume that the following cannot happen:

  • Task 1 (IPC message)
    • Commit chunk 1
    • Commit chunk 3
  • Task 2
    • ReadBuffers (e.g. due to periodic write_into_file)
  • Task 3 (IPC message)
    • Commit chunk 2

The logic on TraceBufferV2 treats any ChunkID gaps identified, after having sorted chunks by ChunkID, as data losses.

Tracking of data losses

There are several paths that can lead to a data loss, and TraceBuffer must track and report all of them. Debugging data losses is a very common activity. It's extremely important that TraceBuffer signals any case of data loss.

There are several different types and causes of data losses:

  • SMB exhaustion: this happens when a TraceWriter drops data because the SMB is full. TraceWriters signal this by appending a special fragment of size kPacketSizeDropPacket at the end of the next chunk.
  • Fragment reassembly failure: this happens when TraceBuffer tries to reassemble a fragmented packet and realizes there is a gap in the sequence of ChunkIDs (typically due to a chunk being overwritten in ring-buffer mode).
  • Sequence gaps: this happens when there is no fragmentation but two chunks have a discontinuity in their ChunkID(s). This happens due to ring-buffer overwrites or due to some other issue when writing in the SMB.
  • ABI violations: this happens when a Chunk is malformed, for instance:
    • One of its fragments has a size that goes out of bounds.
    • The first fragment has a “fragment continuation” flag, but there was no fragment previously initiated.

Patches

When a packet spans across several fragments, almost always it involves patching due to the nature of protobuf encoding. The problem is the following:

  • A TraceWriter starts writing a packet at the end of a chunk.
  • By doing so it starts writing a protobuf message (at very least for the root TracePacket proto). More protobuf messages might be nested and open while writing (e.g., writing a TracePacket.ftrace_events bundle)
  • The TraceWriter runs out of space in the Chunk. So it commits the current chunk in the SMB and acquires a new one to continue writing.
  • The chunk being committed contains the preamble with the message size. However that preamble at the moment is filled with zeros, because we don't know yet the size of the message(s), as they are still being written.
  • Only when the nested messages end TraceWriter can possibly know the size of the messages to put in the preamble. But at this point the chunk containing the preamble has been committed in the SMB. TraceWriters cannot touch committed chunks. More importantly, they might have been already consumed by the TracingService.
  • To deal with this, the IPC protocol exposes the ability to patch a Chunk via IPC, with the semantics: If you (TracingService/TraceBuffer) still have ChunkID 1234 for my {ProducerID,WriterID} patch offset X with contents [DE,AD, BE,EF].

From a protocol viewpoint, only the last fragment of a chunk can be patched:

  • Non-fragmented chunks don't require any patches.
  • The first fragment of a chunk (which is the last of the chain) by design does not need any patching as there is no further fragment.
  • Note that a chunk can contain a single fragment which is both the first and last, in the middle of a fragmentation chain. This can also need patching.
  • In general given a packet fragmented in N fragments, all but the last fragment can (and generally will) need patching.

The information about “needs patching” is held in the SMB's Chunk flags (kChunkNeedsPatching).

The kChunkNeedsPatching state is cleared by the CommitData IPC, which contains, alongside the patches offset and payload, the bool has_more_patches flag. When false, it causes the kChunkNeedsPatching state to be cleared.

From a TraceBuffer viewpoint patching has the following implications:

  • Chunks that are pending patches cause a stalling of the readback, for the sequence.
  • Stalling is not synchronous. TraceBuffer simply acts as if there was no more data for the sequence, either by moving on to other sequences or by returning false in ReadNextTracePacket() when all the other sequences have been read.
  • Fragment reassembly stops gracefully in presence of a chunk that has patches pending, without destroy any data or signalling a data loss.
  • Because patches travel over IPC, and the IPC channel is by design non-lossy we stall the sequence for arbitrary time in presence of missing patches.
  • The stalling, however, cannot affect other sequences.
  • So a fragmented packet missing patches can cause a long chain of packets (for the same TraceWriter sequences) to never be propagated in output when reading back, but cannot stall other sequences.
  • However, if the chunks that are pending patches get overwritten by newer data, the stalling ends, and TraceBuffer shall keep reading next packets, signalling the data loss.

Recommits

Recommit means committing again a chunk that exists in the buffer with the same ChunkID. The only legitimate case of recommit is SMB scraping followed by an actual commit. We don't expect nor support the case of Producers trying to re-commit the same chunk N times, as that unavoidably leads to undefined behaviour (what if the tracing service has written the packets to a file already?).

This is the scenario when recommit can legitimately happen:

  • SMB scraping happens, and TracingService calls CopyChunkUntrusted for a chunk that is still being written by the producer. While doing this it signals the condition to TraceBuffer passing the chunk_complete=false argument.
  • The chunk is copied into the TraceBuffer. By design when committing a scraped (incomplete) chunk TraceBuffer ignores the last fragment, because it cannot tell if the producer is still writing it or not.
  • Later on the TraceWriter (who is unaware of SMB scraping) finishes writing the chunk and commits it.
  • TraceBuffer at this point overwrites the chunk, potentially extending it with the last fragment.

NOTE: kChunkNeedsPatching and kIncomplete are two different and orthogonal chunk states. kIncomplete has nothing to do with fragments and is purely about SMB scraping (and the fact that we have to be conservative and ignore the last fragment).

Implications:

  • An incomplete chunk causes a read stall for the sequence similar to what kChunkNeedsPatching does.
  • Similarly to the former case, the stall is withdrawn if the chunk gets overwritten.
  • TraceBuffer never tries to read the last fragment of an incomplete chunk.
  • As such an incomplete chunk cannot be fragmented on the ending side (phew).

The main complication of incomplete chunks is that we cannot know upfront the size of their payload. Because of this we have to conservatively copy and reserve in the buffer the whole chunk size.

Buffer cloning

Buffer cloning happens via the CloneReadOnly() method. As the name suggests it creates a new TraceBuffer instance which contains the same contents but can only be read into. This is to support CLONE_SNAPSHOT triggers.

Architecturally buffer cloning is not particularly complicated, at least in the current design. The main design implications are:

  • Ensuring that no state in the TraceBuffer fields contains pointers.
  • For this reason, the core structure in the buffer use offsets rather than pointers (which also happen to be more memory compact and cache-friendly).
  • Stats and auxiliary metadata tends to be the thing that requires some care, where bugs can occasionally hide.

ProtoVM

ProtoVM is an upcoming feature to TracingService. It's a non-turing-complete VM language to describe proto merging operations, to keep track of the state of arbitrary proto-encoded data structures as we overwrite data in the trace buffer. ProtoVM has been the reason that triggered the redesign of TraceBuffer V2.

Without getting into its details, the primary requirement of ProtoVM is the following: when overwriting chunks in the trace buffer, we must pass valid packets from these soon-to-be-deleted chunks to ProtoVM. We must do so in order, hence replicating the same logic that we would use when doing a readback.

Internal docs about ProtoVM:

Overwrites

For the aforementioned ProtoVM reasons, in the V2 design, the logic that deals with ring buffer overwrites (DeleteNextChunksFor()) is almost identical - and shares most of its code with - the readback logic.

I say almost because there is (of course) a subtle difference: when deleting chunks, stalling (either due to pending patches or incompleteness) is NOT an option. Old chunks MUST go to make space to new chunks, no matter what.

So overwrites are the equivalent of a no-stalling force-delete readback.

Core design

There are two main data structures involved:

TBChunk

tbchunk

Is the struct, stored in the trace buffer memory as a result of calling CopyChunkUntrusted from a SMB chunk.

A TBChunk is very similar to a SMB chunk with the following caveats:

  • The sizeof() of both is the same (16 bytes). This is very important to keep patches offsets consistent.

  • The SMB chunk maintains a counter of fragments. TBChunk instead does byte-based bookkeeping, as that reduces the complexity of the read iterators.

  • The layout of the fields is slightly different, but they both contains ProducerID, WriterID, ChunkID, fragment counts/sizes and flags. The SMB chunk layout is an ABI. The TCHunk layout is not: it is an implementation detail and can change.

  • TBChunk maintains a basic checksum for each chunk (used only in debug builds).

In a nutshell TBChunk is:

  • A linear buffer of base::PagedMemory contains a sequence of chunks.
  • Each chunk is prefixed by a struct TBChunk header followed by its fragments' payload.
  • The TBChunk header contains also
    • The read state (how many bytes of fragments have been consumed)
    • ABI Flags
      • kFirstPacketContinuesFromPrevChunk
      • kLastPacketContinuesOnNextChunk
      • kChunkNeedsPatching
    • Local flags
      • kChunkIncomplete (for SMB-scraped chunks)

SequenceState

It maintains the state of a {ProducerID, WriterID} sequence.

Its important feature is maintaining an ordered list (logically) of TBChunk(s) for that sequence, sorted by ChunkID order. The “list” is actually a CircularQueue of offsets, which has O(1) push_back() and pop_front() operations.

  • TraceBuffer holds a hashmap of ProducerAndWriterId -> SequenceState.
  • There is one SequenceState for each {Producer,Writer} active in the buffer.
  • SequenceState holds:
    • The identity of the producer (uid, pid, ...)
    • The last_chunk_id_consumed, to detect gaps in the ChunkID sequence (data losses)
    • A sorted list (a CircularQueue<size_t>) of chunks, which stores their offset in the buffer.
  • The chunks queue is maintained sorted and updated as chunks are appended and consumed (removed) from the buffer.

The lifetime of a SequenceState has a subtle tradeoff:

  • On one hand, we could destroy a SequenceState when the last chunk for a sequence has been read or overwritten.
  • After all we must delete SequenceState(s) at some point. Doing otherwise would cause memory leaks in long running traces if we have many threads coming and going, as up to 64K sequences per producer are possible.
  • On the other hand, deleting sequences too aggressively have a drawback: we cannot detect data losses in long-trace mode (see Issue #114 and b/268257546). Long trace mode periodically consumes the buffer, hence making all sequences eligible to be destroyed if we were to be aggressive.
  • The problem here lies in the fact that SequenceState holds the last_chunk_id_consumed which is used to detect gaps in the chunk ids .

TraceBufferV2 balances this using a lazy sweeping approach: it allows the most recently deleted SequenceStates to stay alive, up to kKeepLastEmptySeq = 1024. See DeleteStaleEmptySequences().

FragIterator

A simple class that tokenizes fragments in a chunk and allows forward-only iteration.

It deals with untrusted data, detecting malformed / out of bounds scenarios.

It does not alter the state of the buffer.

ChunkSeqIterator

A simple utility class that iterates over the ordered list of TBChunk for a given SequenceState. It merely follows the SequenceState.chunks queue and detects gaps.

ChunkSeqReader

Encapsulates most of the readback complexity. It reads and consumes chunks in sequence order, as follows:

  • When constructed, the caller must pass a target TBChunk as argument. This is the chunk where we will stop the iteration *.
  • At readback time this is the next chunk in the buffer that we want to read.
  • At overwrite time this is the chunk that we are about to overwriter.
  • In both cases, because of OOO commits, the next chunk in buffer-order might not necessarily be the next chunk that should be consumed in FIFO order (although in the vast majority cases we expect them to be in order).
  • Upon construction, it rewinds all the way back in the SequenceState.chunks (using ChunkSeqIterator) and starts the iteration from there.
  • It keeps reading packets until we reach the target TBChunk passed in the constructor.
  • In some cases (fragmentation) it might read beyond the target chunk. This is to reassemble a packet that started in the target chunk and continued later on.
  • When doing so it just consumes the fragment required for reassembly and leaves the other packets in the chunk untouched, to preserve global FIFO-ness.

Buffer order vs Sequence order

Chunks can be visited in two different ways:

  1. Buffer order: in the order they have been written in the buffer. In the example below: A1, B1, B3, A2, B2

  2. In Sequence order: in the order they appear in the SequenceState's list.

core design

Writing chunks

When chunks are written via CopyChunkUntrusted() a new TBChunk is allocated in the buffer‘s PagedMemory using the usual bump-pointer pattern you’d expect from a ring-buffer. Chunks are variable-size, and are stored contiguously with 32-bit alignment.

The offset of the chunk is also appended in the SequenceState.chunks list.

After the first wrapping, writing a chunk involves deleting one or more existing chunks. The deletion operation RemoveNextChunksFor() is as complex as a readback, because it reconstructs packets being deleted in order, to pass them to ProtoVM.

So the writing itself is straightforward, but the deletion (overwrite) of existing chunks is where most of the complexity lies. This is described in the next section.

DeleteNextChunksFor() Flow

flowchart TD
    A[DeleteNextChunksFor<br/>bytes_to_clear] --> B[Initialize: off = wr_<br/>clear_end = wr_ + bytes_to_clear]
    
    B --> C{off < clear_end?}
    C -->|No| M[Create padding chunks<br/>for partial deletions]
    
    C -->|Yes| D{off >= used_size_?}
    D -->|Yes| N[Break - nothing to delete<br/>in unused space]
    
    D -->|No| E[chunk = GetTBChunkAt off]
    E --> F{chunk.is_padding?}
    F -->|Yes| G[Update padding stats<br/>off += chunk.outer_size]
    F -->|No| H[Create ChunkSeqReader<br/>in kEraseMode]
    
    H --> I[ReadNextPacketInSeqOrder loop]
    I --> J{Packet found?}
    J -->|Yes| K[Pass packet to ProtoVM<br/>has_cleared_fragments = true]
    J -->|No| L{has_cleared_fragments?}
    
    K --> I
    L -->|Yes| O[Mark sequence data_loss = true]
    L -->|No| P[No data loss]
    
    O --> Q[Update overwrite stats<br/>off += chunk.outer_size]
    P --> Q
    
    Q --> R{More chunks in range?}
    R -->|Yes| C
    R -->|No| M
    
    G --> C
    
    M --> S[Scan remaining range for padding]
    S --> T{Partial chunk at end?}
    T -->|Yes| U[Create new padding chunk<br/>for remaining space]
    T -->|No| V[End]
    
    U --> V
    N --> V
    
    style A fill:#e1f5fe
    style K fill:#fff3e0
    style O fill:#ffcdd2
    style V fill:#c8e6c9

Key Differences from ReadNextTracePacket:

  • No stalling: Chunks marked as incomplete or needing patches are force-deleted
  • ProtoVM integration: Valid packets are reconstructed and passed to ProtoVM before deletion
  • Padding management: Creates padding chunks for partial deletions at range boundaries
  • Stats tracking: Updates overwrite statistics rather than read statistics

Reading back packets

Readback (ReadNextTracePacket()) is where most of the TraceBuffer's complexity lies, as it needs to reassembles packets from fragments, deal with gaps / data losses, and deal with interleaving of chunks from different sequences, and out of ordering.

flowchart TD
    A[ReadNextTracePacket Start] --> B{chunk_seq_reader_ exists?}
    
    B -->|No| C[Get chunk at rd_]
    C --> D{Is chunk padding?}
    D -->|Yes| E[rd_ += chunk.outer_size<br/>Check wrap around]
    D -->|No| F[Create ChunkSeqReader<br/>for this chunk]
    
    B -->|Yes| G[ReadNextPacketInSeqOrder]
    F --> G
    
    G --> H{Packet found?}
    H -->|Yes| I[Set sequence properties<br/>Set data_loss flag<br/>Return packet]
    H -->|No| J[Get end chunk from reader<br/>rd_ = end_offset + size]
    
    J --> K{rd_ == wr_ OR<br/>wrapped to wr_?}
    K -->|Yes| L[Return false - no more data]
    K -->|No| M[Reset chunk_seq_reader_<br/>Handle wrap around]
    
    E --> N{rd_ wrapped around?}
    N -->|Yes| O[rd_ = 0]
    N -->|No| P[Continue with new rd_]
    
    O --> K
    P --> K
    M --> B
    
    style A fill:#e1f5fe
    style I fill:#c8e6c9
    style L fill:#ffcdd2

ChunkSeqReader Internal Flow

This is how ReadNextTracePacket() works:

  • We start the read iteration immediately after the write cursor. Because writes are simply FIFO, the oldest data in the buffer is by design the one after the write cursor.
  • For simplicity let's ignore fragmentation for now and assume that every chunk is self-contained (i.e. every chunk contains N fragments = N packets).
  • If we assume no fragmentation, and if we also assume no out-of-order commits (i.e. no scraping) we could just iterate linearly in buffer order, and visit the chunks until we reach back the write cursor.
  • So we could just tokenize packets out of each chunk, and return one for each ReadNextTracePacket() invocation. Done.
flowchart TD
    A[ReadNextPacketInSeqOrder] --> B{skip_in_generation?}
    B -->|Yes| C[Return false - stalled]
    
    B -->|No| D[NextFragmentInChunk]
    D --> E{Fragment found?}
    E -->|Yes| F{Fragment type?}
    
    F -->|kFragWholePacket| G[ConsumeFragment<br/>Return packet]
    F -->|kFragBegin| H[ReassembleFragmentedPacket]
    F -->|kFragEnd/Continue| I[Data loss - unexpected<br/>ConsumeFragment<br/>Continue loop]
    
    E -->|No| J{Chunk corrupted?}
    J -->|Yes| K[Mark data_loss = true]
    
    J -->|No| L{Chunk incomplete?}
    L -->|Yes| M[Set skip_in_generation<br/>Return false]
    L -->|No| N[EraseCurrentChunk]
    
    K --> N
    N --> O{Reached end chunk?}
    O -->|Yes| P[Return false]
    O -->|No| Q[NextChunkInSequence]
    
    Q --> R{Next chunk exists?}
    R -->|No| P
    R -->|Yes| S[iter_ = next_chunk<br/>Create new FragIterator]
    
    S --> D
    
    H --> T{Reassembly result?}
    T -->|Success| U[Return reassembled packet]
    T -->|NotEnoughData| V[Set skip_in_generation<br/>Return false]
    T -->|DataLoss| W[Mark data_loss = true<br/>Continue loop]
    
    style A fill:#e1f5fe
    style G fill:#c8e6c9
    style U fill:#c8e6c9
    style C fill:#ffcdd2
    style P fill:#ffcdd2
    style V fill:#fff3e0

Dealing with out-of-order chunks

But things are more complicated. Let‘s take first only out-of-ordering into the picture. With reference to the drawing above, let’s imagine the write cursor is @ offset=48, right before B3.

If we proceeded simply in buffer order we would break FIFO-ness, as we would first emit the packets contained in B3, then A2 (this is fine) and ultimately B2 (this is problematic).

The only valid linearizations that preserve in-sequence FIFO-ness, would be either [A2,B2,B3], [B2,B3,A2] or [B2,A2,B3].

In order to deal with this we introduce a two layer walk in the readback code:

  • The outer layer iterates in buffer order, as that respects the global FIFO-ness, trying to get events out in roughly the same order they got in (% chunking)
  • At every step, the inner layer proceeds in sequence order, as follows:
    • It takes the next chunk (B3 in the example above) that buffer-order visit found.
    • It finds its SequenceState by doing a hash-lookup in the sequences_ map.
    • It jumps to the first Chunk in the SequenceState.chunks ordered list.
    • It proceeds in sequence order until the target chunk (B3) has been reached.
  • The outer layer continues in buffer order and the story repeats.

In the code, the outer layer walk is implemented by TraceBufferV2::ReadNextTracePacket() while the inner walk is implemented by the class ChunkSeqReader::ReadNextPacket().

Benchmarks

Apple Macbook (M4)

BM_TraceBuffer_WR_SingleWriter<TraceBufferV1>       bytes_per_second=9.77742G/s
BM_TraceBuffer_WR_SingleWriter<TraceBufferV2>       bytes_per_second=12.6395G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV1>    bytes_per_second=8.65385G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV2>    bytes_per_second=11.7582G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV1>      bytes_per_second=4.27694G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV2>      bytes_per_second=4.35475G/s

Google Pixel 7

BM_TraceBuffer_WR_SingleWriter<TraceBufferV1>      bytes_per_second=4.4379G/s
BM_TraceBuffer_WR_SingleWriter<TraceBufferV2>      bytes_per_second=3.7931G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV1>   bytes_per_second=3.19148G/s
BM_TraceBuffer_WR_MultipleWriters<TraceBufferV2>   bytes_per_second=3.47354G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV1>      bytes_per_second=1.26698G/s
BM_TraceBuffer_RD_MixedPackets<TraceBufferV2>      bytes_per_second=1.35394G/s