| # Multi-machine architecture |
| |
| Perfetto can record a single trace that spans more than one operating system |
| image — for example, a host and one or more virtual-machine guests, an SoC |
| and a companion processor, or a fleet of test machines driving a shared |
| workload. The result is one timeline in which causality across machines is |
| visible and queryable, instead of one trace per machine that has to be |
| correlated by hand. |
| |
| This page explains *what* multi-machine tracing is and *how* the pieces fit |
| together. For the step-by-step setup, see |
| [Multi-machine recording](/docs/learning-more/multi-machine-tracing.md). |
| |
| ## Problem statement |
| |
| The standard [service model](/docs/concepts/service-model.md) assumes that |
| all producers, the `traced` service, and the consumer share one OS image: |
| they reach `traced` through a local UNIX socket, agree on PIDs, and observe |
| the same `CLOCK_BOOTTIME`. |
| |
| That assumption breaks as soon as a producer lives on a different kernel. |
| There is no shared filesystem socket. PID namespaces are independent. Boot |
| clocks start at different points and drift independently of each other. |
| Running a separate `traced` on every machine and stitching the resulting |
| traces together after the fact is possible but fragile, especially for |
| anything timing-sensitive (e.g. cross-machine scheduling or RPC latency). |
| |
| Multi-machine tracing solves this without duplicating buffers or consumer |
| machinery on every machine. |
| |
| ## Architecture |
| |
| Exactly one machine in the setup runs `traced` (the "host"). Every other |
| machine runs `traced_relay`, which forwards the producer-side IPC to the |
| host: |
| |
| ``` |
| Remote machine Host machine |
| ┌────────────────────────┐ ┌────────────────────────────┐ |
| │ traced_probes │ │ traced --enable-relay- │ |
| │ + other producers │ │ endpoint │ |
| │ │ │ │ ▲ │ |
| │ ▼ (local IPC) │ TCP/vsock │ │ (local IPC) │ |
| │ traced_relay ────────┼──────────────►│ relay endpoint │ |
| └────────────────────────┘ │ ▲ │ |
| │ │ │ |
| │ traced_probes / other │ |
| │ local producers │ |
| │ ▲ │ |
| │ │ (consumer IPC) │ |
| │ perfetto cmdline │ |
| └────────────────────────────┘ |
| ``` |
| |
| `traced_relay` is intentionally thin: it accepts producer connections on the |
| local producer socket, exchanges a small amount of metadata with the host |
| (see below), and then proxies producer IPC frames over TCP or vsock. It does |
| not buffer trace data, does not parse trace packets, and does not implement |
| any consumer-side functionality. |
| |
| The consumer (`perfetto` cmdline or the UI's WebSocket bridge) only ever |
| talks to the host's `traced`. Trace configuration, buffer ownership, and |
| final read-back stay on a single machine. |
| |
| ## Machine identity |
| |
| When `traced_relay` first connects to the host it sends a `SetPeerIdentity` |
| message containing a `machine_id_hint` — on Linux this is derived from |
| `/proc/sys/kernel/random/boot_id` when available, or a hash of `uname(2)` |
| plus a bootup-timestamp source as a fallback. The hint is stable across |
| reconnects of the same kernel, but distinct between different kernels. |
| |
| The host's `traced` maps each unique hint to a small integer `MachineId` |
| and stamps every `TracePacket` arriving from that relay with it (the |
| `machine_id` field on `TracePacket`). At import time, [Trace Processor] |
| materialises one row per machine in the `machine` table: |
| |
| | Column | Description | |
| | ------ | ----------- | |
| | `id` | Trace-Processor-assigned machine ID. Always `0` for the host. | |
| | `raw_id` | The raw machine identifier from the trace packet (`0` for the host, non-zero for remote machines). | |
| | `sysname`, `release`, `version`, `arch` | `uname(2)` fields for the machine. | |
| | `num_cpus` | CPU count visible to that kernel. | |
| | `system_ram_bytes`, `system_ram_gb` | Total RAM. | |
| | `android_build_fingerprint`, `android_device_manufacturer`, `android_sdk_version` | Populated only for Android machines. | |
| |
| Tables that have a per-CPU or per-thread dimension (`thread`, `cpu`, |
| `gpu_counter_track`, etc.) carry a nullable `machine_id` so cross-machine |
| data can be sliced by SQL. UI support for per-machine tracks is still |
| maturing, so `machine_id` joins remain the most reliable way to answer |
| cross-machine questions today. |
| |
| ## Clock synchronisation across machines |
| |
| Each remote machine has its own `CLOCK_BOOTTIME`, so timestamps written by |
| its producers cannot be compared directly to host timestamps. `traced_relay` |
| runs a lightweight ping protocol against the host's relay endpoint, sending |
| and receiving timestamped messages to estimate the per-machine clock offset |
| and round-trip time. The host periodically emits the resulting offsets as |
| `ClockSnapshot` packets in the trace. |
| |
| From there everything reuses the existing single-machine machinery |
| described in [Clock Synchronization](/docs/concepts/clock-sync.md): Trace |
| Processor folds the cross-machine offsets into the same clock graph it |
| already builds for `CLOCK_REALTIME`, `CLOCK_MONOTONIC`, etc., and resolves |
| every event to a single global trace clock at import. There is nothing |
| extra a data source has to do. |
| |
| ## {#data-source-dispatch} Data source dispatch |
| |
| By default `traced` only dispatches data sources to producers on the host |
| machine. To collect data from remote machines, the consumer's |
| `TraceConfig` must opt in, either globally with `trace_all_machines: true` |
| or per-data-source with `DataSource.machine_name_filter`. Without one of |
| these, `traced_probes` on the remote machine still registers and shows up |
| as a row in the `machine` table, but is never assigned the requested data |
| sources, so no events flow from it. |
| |
| `trace_all_machines` was introduced in v54; earlier versions matched all |
| machines by default. The remote-side machine name comes from the |
| `PERFETTO_MACHINE_NAME` env var when `traced_relay` is started, falling |
| back to `uname -s`. The literal name `"host"` is a synonym for the |
| machine running `traced`. |
| |
| Producers on a single kernel cannot stand in for "two machines" even for |
| testing. The two `traced_probes` instances would race over the same |
| `/sys/kernel/tracing/` ring buffers, and per-CPU events would be |
| partitioned arbitrarily between the two `machine_id`s — the trace looks |
| valid but is silently torn. Multi-machine setups need two kernels (two |
| machines, host plus a VM, separate containers with their own kernel |
| namespaces, etc.). |
| |
| ## Limitations and constraints |
| |
| * `traced_relay` cannot run on the same machine as `traced` — both bind the |
| local producer socket. Each machine in the setup runs *either* `traced` |
| (the host) *or* `traced_relay` (every other machine). |
| * Every remote machine must have a network path to the host's relay |
| endpoint, on TCP or vsock. |
| * Cross-machine clock alignment is only as good as the ping protocol's |
| measurement of the offset; a roughly-aligned wall clock (NTP or |
| similar) helps the first snapshots but is not strictly required. |
| * UI per-machine track rendering is still maturing. SQL on the `machine` |
| table and `machine_id` columns is the authoritative way to slice |
| cross-machine data today. |
| |
| ## Next steps |
| |
| * [Multi-machine recording](/docs/learning-more/multi-machine-tracing.md) — |
| step-by-step walk-through of recording a multi-machine trace between two |
| Linux hosts. |
| * [Clock Synchronization](/docs/concepts/clock-sync.md) — the single-machine |
| clock-sync graph that the cross-machine offsets fold into at import. |
| * [`machine` table reference](/docs/analysis/sql-tables.autogen#machine) — |
| full schema of the table populated from `SetPeerIdentity`. |
| |
| [Trace Processor]: /docs/analysis/trace-processor.md |