In this guide, you'll learn how to:
On linux and android, perfetto can record per-cpu perf counters, for example hardware events such as executed instructions or cache misses. Additionally, perfetto can be configured to sample callstacks of running processes based on these performance counters. Both modes are analogous to the perf record
command from the perf tool, and use the same system call (perf_event_open
).
If you're only interested in the profiling (i.e. flamegraphs), skip to “Collecting a callstack profile”.
The recording is defined using the usual perfetto config protobuf, and can be freely combined with other data sources such as ftrace. This allows for hybrid traces with a single timeline showing both the sampled counter values as well as other traced data, e.g. process scheduling.
The data source configuration (PerfEventConfig) defines the following:
period
(e.g., every 1000 events) or a frequency
(e.g., 100 times per second).One tracing configuration can define multiple “linux.perf” data sources for separate sampling groups. But note that you need to be careful not to exceed the PMU capacity of the platform if counting hardware events. Otherwise the kernel will multiplex (repeatedly switch in and out) the event groups, leading to undercounting (see this perfwiki page for more info).
This config defines one group of three counters per CPU. A timer event (SW_CPU_CLOCK
) is used as the leader, providing a steady rate of samples. Each sample additionally includes the counts of cpu cycles (HW_CPU_CYCLES
) and executed instructions (HW_INSTRUCTIONS
) since the beginning of tracing.
duration_ms: 10000 buffers: { size_kb: 40960 fill_policy: DISCARD } # sample per-cpu counts of instructions and cycles data_sources { config { name: "linux.perf" perf_event_config { timebase { frequency: 1000 counter: SW_CPU_CLOCK timestamp_clock: PERF_CLOCK_MONOTONIC } followers { counter: HW_CPU_CYCLES } followers { counter: HW_INSTRUCTIONS } } } } # include scheduling data via ftrace data_sources: { config: { name: "linux.ftrace" ftrace_config: { ftrace_events: "sched/sched_switch" ftrace_events: "sched/sched_waking" } } } # include process names and grouping via procfs data_sources: { config: { name: "linux.process_stats" process_stats_config { scan_all_processes_on_start: true } } }
Which should look similar to the following in the UI, after expanding the “Perf Counters” track groups. The counter tracks show the values as counting rates by default.
The counter data can be queried as follows:
select ts, cpu, name, value from counter c join perf_counter_track pct on (c.track_id = pct.id) order by 1, 2 asc
The counter recording can also be configured to include a callstack (list of function frames that called each other) of the process that was interrupted at the time of the counter sampling. This is achieved by asking the kernel to record additional state (userspace register state, top of the stack memory) in each sample, and unwinding + symbolising the callstack in the profiler. The unwinding happens outside of the process, without any need for instrumentation or injected libraries in the processes being profiled.
To enable callstack profiling, set the callstack_sampling
field in the data source config. Note that the sampling will still be performed per-cpu, but you can set the scope
field to have the profiler unwind callstacks only for matching processes (which in turn can help prevent the profiler from being overloaded by unwinding runtime costs).
The following is an example of a config for periodic sampling based on time (i.e. a per-cpu timer leader), unwinding callstacks only if they happen when a process with the given name is running.
By changing the timebase
, you can instead capture callstacks on other events, for example you could see the callstacks of when the process wakes other threads up by setting “sched/sched_waking” as a tracepoint
timebase.
Android note: the example uses “com.android.settings” as an example, but for successful callstack sampling the app has to be declared as either profileable or debuggable in the manifest (or you must be on a debuggable build of the android OS).
duration_ms: 10000 buffers: { size_kb: 40960 fill_policy: DISCARD } # periodic sampling per cpu, unwinding callstacks if # "com.android.settings" is running. data_sources { config { name: "linux.perf" perf_event_config { timebase { counter: SW_CPU_CLOCK frequency: 100 timestamp_clock: PERF_CLOCK_MONOTONIC } callstack_sampling { scope { target_cmdline: "com.android.settings" } kernel_frames: true } } } } # include scheduling data via ftrace data_sources: { config: { name: "linux.ftrace" ftrace_config: { ftrace_events: "sched/sched_switch" ftrace_events: "sched/sched_waking" } } } # include process names and grouping via procfs data_sources: { config: { name: "linux.process_stats" process_stats_config { scan_all_processes_on_start: true } } }
In the UI, the callstack samples will be shown as instant events on the timeline, within the process track group of the sampled process. There is a track per sampled thread, as well as a single track combining all samples from that process. By selecting time regions with perf samples, the bottom pane will show dynamic flamegraph views of the selected callstacks.
The sample data can also be queried from the perf_sample
table via SQL.
As well as visualizing traces on a timeline, Perfetto has support for querying traces using SQL. The easiest way to do this is using the query engine available directly in the UI.
In the Perfetto UI, click on the “Query (SQL)” tab in the left-hand menu.
This will open a two-part window. You can write your PerfettoSQL query in the top section and view the results in the bottom section.
You can then execute queries Ctrl/Cmd + Enter:
For example, by running:
INCLUDE PERFETTO MODULE linux.perf.samples; SELECT -- The id of the callstack. A callstack in this context -- is a unique set of frames up to the root. id, -- The id of the parent callstack for this callstack. parent_id, -- The function name of the frame for this callstack. name, -- The name of the mapping containing the frame. This -- can be a native binary, library, JAR or APK. mapping_name, -- The name of the file containing the function. source_file, -- The line number in the file the function is located at. line_number, -- The number of samples with this function as the leaf -- frame. self_count, -- The number of samples with this function appearing -- anywhere on the callstack. cumulative_count FROM linux_perf_samples_summary_tree;
you can see the summary tree of all the callstacks captured in the trace.
The perfetto profiling implementation is built for continuous (streaming) collection, and is therefore less optimised for short, high-frequency profiling. If all you need are aggregated flamegraphs, consider simpleperf
on Android and perf
on Linux. These tools are more mature and have a simpler user interface for this use case.
Now that you've recorded your first CPU profile, you can explore more advanced topics:
You can also include other data sources on the same timeline as CPU sampling to get a more complete picture of your system's performance.