Trace Processor report Subcommand

Authors: @lalitm

Status: Draft

Note: This document is a thought experiment exploring a possible future direction for trace_processor_shell. It is NOT a proposal for immediate implementation. The goal is to capture the design space and solicit feedback.

Motivation

Perfetto traces are rich, multi-dimensional datasets. Today, extracting a useful summary requires either:

  1. Loading the trace in the UI and clicking around.
  2. Writing ad-hoc SQL queries against trace_processor.
  3. Authoring TraceSummarySpec textprotos for the summarize subcommand.

None of these serve the “I just collected a trace, what's in it?” use case well. Users coming from perf report expect to point a tool at a data file and immediately see an opinionated, useful summary — no query authoring, no spec files, no UI.

This gap is especially felt by:

  • CLI power users who want quick triage without leaving the terminal.
  • AI tools that need structured trace summaries to reason about performance.
  • CI pipelines that want a human-readable (or machine-parseable) trace summary as a build artifact.

The Firefox Profiler project is exploring a similar direction with their experimental pq CLI tool (PR 5663 in the firefox-devtools/profiler repo), which provides opinionated per-dimension views of profiling data from the command line.

Decision

Pending

Design

Relationship to summarize

report is a higher-level, opinionated cousin of summarize:

  • summarize is the general-purpose engine — users author custom TraceSummarySpec protos to define exactly what to compute.
  • report ships built-in specs that produce useful defaults across known trace dimensions.

Under the hood, report is built entirely on top of the summarization machinery. Each dimension's report is a pre-authored TraceSummarySpec that gets fed into the same engine that summarize uses.

Built-in spec embedding

Report specs are authored as human-readable textproto files in the source tree (e.g. src/trace_processor/shell/report_specs/*.textproto). A build rule converts these to binary proto and embeds them as byte arrays in the binary, following the existing perfetto_cc_proto_descriptor pattern used for metric and trace descriptors. This means:

  • Zero file I/O at runtime — specs are baked into the binary.
  • Specs are human-editable in the source tree.
  • The same build infrastructure that handles descriptor embedding is reused.

CLI surface

trace_processor_shell report [dimension] [FLAGS] <trace_file>

When no dimension is specified, produce an overview covering all applicable dimensions (skipping those with no data in the trace). When a dimension is specified, produce a detailed per-dimension report.

Dimensions

DimensionDescription
slicesSlice aggregations (wall duration, count, max)
stack-samplesCPU profiling samples (self/total time)
heap-profileHeap allocation profiling (bytes, count)
heap-dumpHeap snapshot analysis (retained size, objects)
schedulingThread scheduling (CPU time, runnable, wait time)

Output format flags

--format text|json     Output format (default: text).
  • text: Human-readable tables, similar to perf report --stdio.
  • json: Structured JSON object, for tool/AI consumption.

Scoping flags

These filter the report to a subset of the trace data:

--pid <pid>            Scope to a specific process ID.
--process <name>       Scope to a process by name.
--tid <tid>            Scope to a specific thread ID.
--thread <name>        Scope to a thread by name.
--track <name>         Scope to a track by name.
--cpu <cpu>            Scope to a specific CPU.
--time <start>,<end>   Scope to a time range.
                       Accepts raw nanoseconds or human-friendly
                       format (e.g. 2.7s,3.1s).

Scoping flags are translated into structured query filters and interval_intersect clauses in the underlying TraceSummarySpec, using the existing DSL primitives — no raw SQL WHERE clauses.

Aggregation control

--top <N>              Number of entries per section (default: 10).

Overview output

When invoked without a dimension, the overview produces a one-line trace context followed by per-dimension aggregated highlights.

Example (--format text):

Trace: 12.3s | Android 14 | Pixel 7 Pro | 12 processes | 48 threads | 156 tracks

Slices (12.3M total):
  Name                          Count     Total dur   % of trace   Max dur
  Choreographer#doFrame         83.2k     4.1s        33.2%        128ms
  DrawFrame                     83.1k     3.5s        28.4%        96ms
  measure                       41.6k     890ms       7.2%         42ms
  layout                        41.6k     620ms       5.0%         38ms
  dequeueBuffer                 24.9k     310ms       2.5%         12ms
  eglSwapBuffers                24.9k     280ms       2.3%         8ms
  RenderThread::draw            24.9k     240ms       1.9%         6ms
  BinderTransaction             12.1k     180ms       1.5%         52ms
  animation                     8.3k      120ms       1.0%         4ms
  inflate                       2.1k      95ms        0.8%         18ms

Stack Samples (3.2k total):
  Function                                  Self%     Total%    Samples
  art::Thread::RunRootClock                 18.2%     42.1%     583
  __epoll_pwait                             12.1%     12.1%     387
  art::interpreter::Execute                 8.4%      31.2%     269
  ...

Scheduling:
  Thread                        CPU time   Runnable   Sleeping   % of trace
  RenderThread                  3.2s       120ms      8.9s       26.0%
  mali-cmar-backe               1.8s       45ms       10.4s      14.6%
  HeapTaskDaemon                890ms      12ms       11.3s      7.2%
  ...

Heap Profile: (not present in trace)
Heap Dump: (not present in trace)

Per-dimension detail

Per-dimension reports provide a deeper view. For example, tp report slices <trace> would show the same columns as the overview but with a higher default --top and potentially additional breakdowns (e.g. per-thread grouping).

The exact content of per-dimension reports is left as an open question for now. As noted below, the call-tree views for stack samples (top-down / bottom-up, as seen in perf report and the Firefox Profiler's pq tool) are a natural fit here but the exact interaction model needs more thought.

Per-dimension column definitions

Slices

Default aggregation key: slice name.

ColumnDescription
NameSlice name
CountNumber of instances
Total durSum of wall durations across all instances
% of traceTotal duration as percentage of trace duration
Max durMaximum single-instance duration (outlier detection)

Stack Samples

Modeled after perf report --stdio.

ColumnDescription
FunctionFunction name (symbol)
Self%Samples where this function is at the top of the stack
Total%Samples where this function appears anywhere in stack
SamplesAbsolute sample count

Heap Profile

Same shape as stack samples but with bytes instead of sample count.

ColumnDescription
AllocatorAllocation site / function
Self bytesBytes allocated directly by this function
Total bytesBytes allocated by this function and its callees
CountNumber of allocations
Avg sizeAverage allocation size

Heap Dump

Point-in-time memory snapshot.

ColumnDescription
Type/AllocType or allocator
Retained sizeTotal retained memory
Live objectsCount of live objects

Scheduling

Per-thread scheduling summary.

ColumnDescription
ThreadThread name
CPU timeTotal time spent running on a CPU
RunnableTotal time in runnable state (waiting for CPU)
SleepingTotal time sleeping
% of traceCPU time as percentage of trace duration

Sources of inspiration

  • perf report (Linux perf): Opinionated defaults, hierarchical views, sort-by-overhead, --stdio output. The gold standard for “point at data, get useful summary.”
  • Firefox Profiler pq: CLI profile querying with per-dimension formatters, top-down/bottom-up call trees, scoping via time ranges, dual human/JSON output (PR 5663 in the firefox-devtools/profiler repo).
  • pprof (Go): -top, -text views for CPU/heap profiles. Ergonomic top-N function summaries.
  • heaptrack (KDE): CLI heap profile summaries — peak consumption, top allocators, leak candidates.

Alternatives considered

Ship report specs as external files

Pro:

  • Users can inspect and modify specs without rebuilding.

Con:

  • Requires distributing spec files alongside the binary.
  • File discovery and path resolution adds complexity.
  • Embedded binary protos are zero-overhead and follow existing precedent (metric descriptors, trace descriptors).

Hardcode aggregation queries in C++

Pro:

  • No proto serialization overhead.

Con:

  • Loses the declarative nature of the summarization DSL.
  • Cannot be reused by the summarize subcommand.
  • Harder to maintain and review.

Combine with summarize

Pro:

  • One subcommand to learn.

Con:

  • summarize is for custom specs; overloading it with opinionated defaults muddies its purpose.
  • Different flag surfaces (scoping flags vs spec paths) would conflict.

Open questions

  • Per-dimension drill-down interaction model: For stack samples, top-down and bottom-up call trees (à la perf report and Firefox Profiler's pq tool) are a natural fit. Should these be sub-sub-commands (tp report stack-samples top-down <trace>), flags (--view top-down), or sections within the same output?
  • Exact per-dimension report content: The overview columns are defined above. The detailed per-dimension reports may include additional breakdowns (e.g. per-thread slice grouping, per-process scheduling). Exact content TBD.
  • Spec authoring: The built-in specs need to be written against the existing PerfettoSQL stdlib tables and modules. The exact table/module references for each dimension need to be determined.
  • Trace metadata extraction: The one-line context line (OS, device, duration, process/thread/track counts) may require queries outside the summarization DSL. How to handle this cleanly?