Docs: add Trace Processor docs
TBR: lalitm,taylori
Change-Id: I6c6688e0a6d4691fb2590988de97e38e89ff7edd
diff --git a/docs/trace-processor.md b/docs/trace-processor.md
new file mode 100644
index 0000000..5644457
--- /dev/null
+++ b/docs/trace-processor.md
@@ -0,0 +1,157 @@
+# Trace Processor
+
+The Trace Processor is a C++ library
+([/src/trace_processor](/src/trace_processor)) that is able to ingest traces of
+various format and expose them in a massaged, higher level format, queryable
+through SQL queries. The trace processor is used:
+* By the [Perfetto UI](https://ui.perfetto.dev/), in the form of a
+ Web Assembly module.
+* Standalone, using the `trace_processor_shell` target
+ (`ninja -C out/xxx trace_processor_shell`).
+* In internal google3 pipelines for batch processing.
+
+Supported input formats:
+ * Perfetto .proto traces
+ * Chrome .json trace events [WIP]
+ * [NOT IMPLEMENTED YET] ftrace format as per `/sys/kernel/debug/tracing/trace`.
+
+
+
+Rationale
+---------
+Traces are raw because they are optimized for fast & low overhead writing.
+Despite being protos, their output is not ideal for being consumed by third
+parties as-is. Some data massaging is required.
+Examples:
+* Ftrace sched_switch events only provide thread names and thread IDs.
+ In order to attribute execution times to the package/process that data needs
+ to be joined with the process_tree events to join TIDs with their parent PID
+ and process name.
+* Even after this join, sched_switch events produce two slices (one at the
+ beginning, one at the end) per sched event. What most consumers want to see
+ instead is one "interval" per thread execution time-slice.
+* Similarly ftrace ext4 events provide only inode numbers and those need to be
+ joined with inode->path events.
+
+
+Schema
+------
+
+### sched table
+`ts`
+Timestamp of the scheduling event, in nanoseconds. This comes from the
+CLOCK_BOOTTIME, when available.
+
+`dur`
+Duration of the scheduling event, in nanoseconds.
+
+`utid`
+ID of the thread. This is NOT the UNIX pid/tid (see below).
+This can be used to joined the `thread` and `process` tables.
+
+`cpu`
+CPU number where the scheduling event happened.
+
+`cycles`
+[WIP] Number of CPU cycles. This keeps into account cpu frequency changes that
+happened during the `dur` interval. This is available only if the `cpufreq`
+ftrace category has been enabled in the [trace config](/docs/trace-config.md).
+
+
+### process table
+`upid`
+Unique process ID. This is NOT the UNIX pid. This is a sequence number generated
+by the trace processor to uniquely identify a process in the trace. This is to
+deal with the fact that UNIX pids can be recycled and two distinct processes
+which lifetimes don't overlap can be assigned the same pid.
+
+`name`
+Process name, as per /proc/pid/cmdline.
+
+`pid`
+The UNIX pid (also known as Thread Group ID in Linux). This also matches the
+tid of the process' main thread.
+
+
+Example:
+
+| upid | name | pid |
+|-------------------|--------------------|--------------------|
+| 1 | /system/bin/logd | 601 |
+| 2 | rcu_preempt | 7 |
+| 3 | rcuop/4 | 44 |
+| 4 | rcuop/6 | 60 |
+
+### thread table
+`utid`
+Unique thread ID. This is NOT the Linux pid or tid. Like the above, this is a
+sequence number generated by the trace processor to uniquely identify a thread
+in the trace.
+
+`upid`
+ID of the parent process in the `process` table.
+This can be used to JOIN a thread with its process.
+
+`name`
+Threaf name, as per /proc/pid/task/tid/comm.
+
+`tid`
+The Linux thread id (confusingly named "pid" in the Linux-world).
+For the process' main thread `tid` == `tgid` == `pid`.
+
+Example:
+
+| utid | upid | name | tid |
+|--------|-------|------------------|------|
+| 1 | 1 | logd.klogd | 632 |
+| 2 | 2 | rcu_preempt | 7 |
+| 3 | 4 | rcuop/6 | 60 |
+| 4 | 6 | rcuop/3 | 36 |
+| 5 | 8 | sugov:0 | 588 |
+| 6 | 9 | kworker/u16:6 | 9283 |
+| 7 | 12 | sensors@1.0-ser | 1021 |
+| 8 | 12 | HwBinder:797_1 | 1626 |
+
+
+Sample queries for the `sched` (sched_switch events) table
+----------------------------------------------------------
+
+### Trace duration
+``` sql
+select ((select max(ts) from sched) - (select min(ts) from sched)) / 1e9 as duration_sec
+```
+
+### Total CPU usage
+``` sql
+select cpu, sum(dur)/1e9 as cpu_time_sec from sched group by cpu order by cpu
+```
+
+### List all processes
+``` sql
+select process.name, pid from process limit 100
+```
+
+### List all processes and threads
+``` sql
+select process.name as proc_name, pid, thread.name as thread_name, tid from thread left join process using(upid) limit 100
+```
+
+### CPU time for top 100 threads
+``` sql
+select thread.name as thread_name, tid, cpu_sec from (select utid, sum(dur)/1e9 as cpu_sec from sched group by utid order by dur desc limit 100) inner join thread using(utid)
+```
+
+With matching process names
+``` sql
+select thread.name as thread_name, process.name as proc_name, tid, pid, cpu_sec from (select utid, sum(dur)/1e9 as cpu_sec from sched group by utid order by dur desc limit 100) left outer join thread using(utid) left outer join process using(upid)
+```
+
+### CPU time for top 100 processes
+``` sql
+select proc_name, cpu_sec from (select process.name as proc_name, upid, cpu_sec from (select utid, sum(dur)/1e9 as cpu_sec from sched group by utid) left join thread using(utid) left join process using(upid)) group by upid order by cpu_sec desc limit 100
+```
+
+### CPU time for top 100 processes broken down by cpu
+``` sql
+select proc_name, cpu, cpu_sec from (select process.name as proc_name, upid, cpu, cpu_sec from (select cpu, utid, sum(dur)/1e9 as cpu_sec from sched group by utid) left join thread using(utid) left join process using(upid)) group by upid, cpu order by cpu_sec desc limit 100
+```