| # Trace Processor `report` Subcommand |
| |
| **Authors:** @lalitm |
| |
| **Status:** Draft |
| |
| > **Note:** This document is a thought experiment exploring a possible future |
| > direction for trace_processor_shell. It is NOT a proposal for immediate |
| > implementation. The goal is to capture the design space and solicit feedback. |
| |
| ## Motivation |
| |
| Perfetto traces are rich, multi-dimensional datasets. Today, extracting a |
| useful summary requires either: |
| |
| 1. Loading the trace in the UI and clicking around. |
| 2. Writing ad-hoc SQL queries against trace_processor. |
| 3. Authoring `TraceSummarySpec` textprotos for the `summarize` subcommand. |
| |
| None of these serve the "I just collected a trace, what's in it?" use case |
| well. Users coming from `perf report` expect to point a tool at a data file |
| and immediately see an opinionated, useful summary — no query authoring, no |
| spec files, no UI. |
| |
| This gap is especially felt by: |
| |
| - **CLI power users** who want quick triage without leaving the terminal. |
| - **AI tools** that need structured trace summaries to reason about |
| performance. |
| - **CI pipelines** that want a human-readable (or machine-parseable) trace |
| summary as a build artifact. |
| |
| The Firefox Profiler project is exploring a similar direction with their |
| experimental `pq` CLI tool (PR 5663 in the firefox-devtools/profiler repo), |
| which provides opinionated per-dimension views of profiling data from the |
| command line. |
| |
| ## Decision |
| |
| Pending |
| |
| ## Design |
| |
| ### Relationship to `summarize` |
| |
| `report` is a higher-level, opinionated cousin of `summarize`: |
| |
| - **`summarize`** is the general-purpose engine — users author custom |
| `TraceSummarySpec` protos to define exactly what to compute. |
| - **`report`** ships built-in specs that produce useful defaults across |
| known trace dimensions. |
| |
| Under the hood, `report` is built entirely on top of the summarization |
| machinery. Each dimension's report is a pre-authored `TraceSummarySpec` that |
| gets fed into the same engine that `summarize` uses. |
| |
| ### Built-in spec embedding |
| |
| Report specs are authored as human-readable textproto files in the source tree |
| (e.g. `src/trace_processor/shell/report_specs/*.textproto`). A build rule |
| converts these to binary proto and embeds them as byte arrays in the binary, |
| following the existing `perfetto_cc_proto_descriptor` pattern used for metric |
| and trace descriptors. This means: |
| |
| - Zero file I/O at runtime — specs are baked into the binary. |
| - Specs are human-editable in the source tree. |
| - The same build infrastructure that handles descriptor embedding is reused. |
| |
| ### CLI surface |
| |
| ``` |
| trace_processor_shell report [dimension] [FLAGS] <trace_file> |
| ``` |
| |
| When no dimension is specified, produce an overview covering all applicable |
| dimensions (skipping those with no data in the trace). When a dimension is |
| specified, produce a detailed per-dimension report. |
| |
| #### Dimensions |
| |
| | Dimension | Description | |
| | ---------------- | -------------------------------------------------- | |
| | `slices` | Slice aggregations (wall duration, count, max) | |
| | `stack-samples` | CPU profiling samples (self/total time) | |
| | `heap-profile` | Heap allocation profiling (bytes, count) | |
| | `heap-dump` | Heap snapshot analysis (retained size, objects) | |
| | `scheduling` | Thread scheduling (CPU time, runnable, wait time) | |
| |
| #### Output format flags |
| |
| ``` |
| --format text|json Output format (default: text). |
| ``` |
| |
| - **`text`**: Human-readable tables, similar to `perf report --stdio`. |
| - **`json`**: Structured JSON object, for tool/AI consumption. |
| |
| #### Scoping flags |
| |
| These filter the report to a subset of the trace data: |
| |
| ``` |
| --pid <pid> Scope to a specific process ID. |
| --process <name> Scope to a process by name. |
| --tid <tid> Scope to a specific thread ID. |
| --thread <name> Scope to a thread by name. |
| --track <name> Scope to a track by name. |
| --cpu <cpu> Scope to a specific CPU. |
| --time <start>,<end> Scope to a time range. |
| Accepts raw nanoseconds or human-friendly |
| format (e.g. 2.7s,3.1s). |
| ``` |
| |
| Scoping flags are translated into structured query filters and |
| `interval_intersect` clauses in the underlying `TraceSummarySpec`, using the |
| existing DSL primitives — no raw SQL WHERE clauses. |
| |
| #### Aggregation control |
| |
| ``` |
| --top <N> Number of entries per section (default: 10). |
| ``` |
| |
| ### Overview output |
| |
| When invoked without a dimension, the overview produces a one-line trace |
| context followed by per-dimension aggregated highlights. |
| |
| Example (`--format text`): |
| |
| ``` |
| Trace: 12.3s | Android 14 | Pixel 7 Pro | 12 processes | 48 threads | 156 tracks |
| |
| Slices (12.3M total): |
| Name Count Total dur % of trace Max dur |
| Choreographer#doFrame 83.2k 4.1s 33.2% 128ms |
| DrawFrame 83.1k 3.5s 28.4% 96ms |
| measure 41.6k 890ms 7.2% 42ms |
| layout 41.6k 620ms 5.0% 38ms |
| dequeueBuffer 24.9k 310ms 2.5% 12ms |
| eglSwapBuffers 24.9k 280ms 2.3% 8ms |
| RenderThread::draw 24.9k 240ms 1.9% 6ms |
| BinderTransaction 12.1k 180ms 1.5% 52ms |
| animation 8.3k 120ms 1.0% 4ms |
| inflate 2.1k 95ms 0.8% 18ms |
| |
| Stack Samples (3.2k total): |
| Function Self% Total% Samples |
| art::Thread::RunRootClock 18.2% 42.1% 583 |
| __epoll_pwait 12.1% 12.1% 387 |
| art::interpreter::Execute 8.4% 31.2% 269 |
| ... |
| |
| Scheduling: |
| Thread CPU time Runnable Sleeping % of trace |
| RenderThread 3.2s 120ms 8.9s 26.0% |
| mali-cmar-backe 1.8s 45ms 10.4s 14.6% |
| HeapTaskDaemon 890ms 12ms 11.3s 7.2% |
| ... |
| |
| Heap Profile: (not present in trace) |
| Heap Dump: (not present in trace) |
| ``` |
| |
| ### Per-dimension detail |
| |
| Per-dimension reports provide a deeper view. For example, |
| `tp report slices <trace>` would show the same columns as the overview but |
| with a higher default `--top` and potentially additional breakdowns (e.g. |
| per-thread grouping). |
| |
| The exact content of per-dimension reports is left as an open question for |
| now. As noted below, the call-tree views for stack samples (top-down / |
| bottom-up, as seen in `perf report` and the Firefox Profiler's `pq` tool) |
| are a natural fit here but the exact interaction model needs more thought. |
| |
| ### Per-dimension column definitions |
| |
| #### Slices |
| |
| Default aggregation key: slice name. |
| |
| | Column | Description | |
| | ------------ | ---------------------------------------------------- | |
| | Name | Slice name | |
| | Count | Number of instances | |
| | Total dur | Sum of wall durations across all instances | |
| | % of trace | Total duration as percentage of trace duration | |
| | Max dur | Maximum single-instance duration (outlier detection) | |
| |
| #### Stack Samples |
| |
| Modeled after `perf report --stdio`. |
| |
| | Column | Description | |
| | --------- | ------------------------------------------------------- | |
| | Function | Function name (symbol) | |
| | Self% | Samples where this function is at the top of the stack | |
| | Total% | Samples where this function appears anywhere in stack | |
| | Samples | Absolute sample count | |
| |
| #### Heap Profile |
| |
| Same shape as stack samples but with bytes instead of sample count. |
| |
| | Column | Description | |
| | ------------ | ---------------------------------------------------- | |
| | Allocator | Allocation site / function | |
| | Self bytes | Bytes allocated directly by this function | |
| | Total bytes | Bytes allocated by this function and its callees | |
| | Count | Number of allocations | |
| | Avg size | Average allocation size | |
| |
| #### Heap Dump |
| |
| Point-in-time memory snapshot. |
| |
| | Column | Description | |
| | ------------- | --------------------------------------------------- | |
| | Type/Alloc | Type or allocator | |
| | Retained size | Total retained memory | |
| | Live objects | Count of live objects | |
| |
| #### Scheduling |
| |
| Per-thread scheduling summary. |
| |
| | Column | Description | |
| | ------------ | ---------------------------------------------------- | |
| | Thread | Thread name | |
| | CPU time | Total time spent running on a CPU | |
| | Runnable | Total time in runnable state (waiting for CPU) | |
| | Sleeping | Total time sleeping | |
| | % of trace | CPU time as percentage of trace duration | |
| |
| ### Sources of inspiration |
| |
| - **`perf report`** (Linux perf): Opinionated defaults, hierarchical views, |
| sort-by-overhead, `--stdio` output. The gold standard for "point at data, |
| get useful summary." |
| - **Firefox Profiler `pq`**: CLI profile querying with per-dimension |
| formatters, top-down/bottom-up call trees, scoping via time ranges, dual |
| human/JSON output (PR 5663 in the firefox-devtools/profiler repo). |
| - **`pprof`** (Go): `-top`, `-text` views for CPU/heap profiles. Ergonomic |
| top-N function summaries. |
| - **`heaptrack`** (KDE): CLI heap profile summaries — peak consumption, top |
| allocators, leak candidates. |
| |
| ## Alternatives considered |
| |
| ### Ship report specs as external files |
| |
| Pro: |
| |
| * Users can inspect and modify specs without rebuilding. |
| |
| Con: |
| |
| * Requires distributing spec files alongside the binary. |
| * File discovery and path resolution adds complexity. |
| * Embedded binary protos are zero-overhead and follow existing precedent |
| (metric descriptors, trace descriptors). |
| |
| ### Hardcode aggregation queries in C++ |
| |
| Pro: |
| |
| * No proto serialization overhead. |
| |
| Con: |
| |
| * Loses the declarative nature of the summarization DSL. |
| * Cannot be reused by the `summarize` subcommand. |
| * Harder to maintain and review. |
| |
| ### Combine with `summarize` |
| |
| Pro: |
| |
| * One subcommand to learn. |
| |
| Con: |
| |
| * `summarize` is for custom specs; overloading it with opinionated defaults |
| muddies its purpose. |
| * Different flag surfaces (scoping flags vs spec paths) would conflict. |
| |
| ## Open questions |
| |
| * **Per-dimension drill-down interaction model:** For stack samples, top-down |
| and bottom-up call trees (à la `perf report` and Firefox Profiler's `pq` |
| tool) are a natural fit. Should these be sub-sub-commands |
| (`tp report stack-samples top-down <trace>`), flags |
| (`--view top-down`), or sections within the same output? |
| * **Exact per-dimension report content:** The overview columns are defined |
| above. The detailed per-dimension reports may include additional breakdowns |
| (e.g. per-thread slice grouping, per-process scheduling). Exact content |
| TBD. |
| * **Spec authoring:** The built-in specs need to be written against the |
| existing PerfettoSQL stdlib tables and modules. The exact table/module |
| references for each dimension need to be determined. |
| * **Trace metadata extraction:** The one-line context line (OS, device, |
| duration, process/thread/track counts) may require queries outside the |
| summarization DSL. How to handle this cleanly? |