docs/metrics.md - third_party/perfetto - Git at Google

 Writing Perfetto-based metrics
 =============

 Contents
 ---------
 1. Background
 2. The Perfetto Metrics Platform
 3. Writing your first metric - step by step
 4. Breaking down and composing metrics (TBD)
 5. Adding a new metric or editing an existing metric (TBD)
 6. Running a metric over a set of traces (TBD)
 7. Metrics platform as an API (TBD)

 Background
 ---------
 Using traces allows computation of reproducible metrics in a wide range
 of situations; examples include benchmarks, lab tests and on
 large corpuses of traces. In these cases, these metrics allow for direct
 root-causing when a regression is detected.

 The Perfetto Metrics Platform
 ----------
 The metrics platform (powered by the
 [trace processor](trace-processor.md)) allows metrics authors to write
 SQL queries to generate metrics in the form of protobuf messages or proto text.

 We strongly encourage all metrics derived on Perfetto traces to be added to the
 Perfetto repo unless there is a clear usecase (e.g. confidentiality) why these
 metrics should not be publicly available.

 In return for upstreaming metrics, authors will have first class support for
 running metrics locally and the confidence that their metrics will remain stable
 as trace processor is developed.

 For example, generating the full (human readable) set of Android memory
 metrics on a trace is as simple as:
 ```shell
 trace_processor_shell --run-metrics android_mem <trace>
 ```

 As well as scaling upwards while developing from running on a single trace
 locally to running on a large set of traces, the reverse is also very useful.
 When an anomaly is observed in the metrics of a lab benchmark, you can simply
 download a representative trace and run the same metric locally in shell.

 Since the same code is running locally and remotely, you can be confident in
 reproducing the issue and use the power of trace processor and/or the Perfetto
 UI to identify the problem!

 Writing your first metric: A Step by Step Guide
 ----------
 To begin, all you need is some familiarity with SQL and you're ready to start!

 Suppose that want a write a metric which computes the CPU time for every process
 in the trace and lists the names of the top 5 processes (by CPU time)
 and the number of threads which were associated with those processes over its
 lifetime.

 *Note:*
 * If you want to jump straight to the code, at the end of this guide, your
 workspace should look something like this [GitHub gist](https://gist.github.com/tilal6991/c221cf0cae17e298dfa82b118edf9080). See Step 0 and 4
 below as to where to get trace processor and how to run it to output the
 metrics.

 ### Step 0
 As a setup step, you'll want to create a folder to act as a scratch workspace;
 this folder will be referred to using the env variable `$WORKSPACE` in Step 4.

 The other thing you'll need is trace processor shell. You can download this
 [here](https://get.perfetto.dev/trace_processor) or you can build from source
 using the instructions [here](trace-processor.md). Whichever method is
 chosen, $TRACE_PROCESSOR env variable will be used to refer to the location of
 the binary in Step 4.

 ### Step 1
 As all metrics in the metrics platform are defined using protos, the metric
 needs to be strctured as a proto. For this metric, there needs to be some notion
 of a process name along with its CPU time and number of threads.

 Starting off, in a file named `top_five_processes.proto` in our workspace,
 let's create a basic proto message called ProcessInfo with those three fields:
 ```protobuf
 message ProcessInfo {
   optional string process_name = 1;
   optional uint64 cpu_time_ms = 2;
   optional uint32 num_threads = 3;
 }
 ```

 Next up is a wrapping message which will hold the repeated field containing
 the top 5 processes.
 ```protobuf
 message TopProcesses {
   repeated ProcessInfo process_info = 1;
 }
 ```

 Finally, let's define an extension to the root proto for all metrics -
 the
 [TraceMetrics](https://android.googlesource.com/platform/external/perfetto/+/HEAD/protos/perfetto/metrics/metrics.proto#39)
 proto).
 ```protobuf
 extend TraceMetrics {
   optional TopProcesses top_processes = 450;
 }
 ```
 Adding this extension field allows trace processor to link the newly defined
 metric to the `TraceMetrics` proto.

 *Notes:*
 * The field ids 450-500 are reserved for local development so you can use
 any of them as the field id for the extension field.
 * The choice of field name here is important as the SQL file and the final
 table generated in SQL will be based on this name.

 Putting everything together, along with some boilerplate header information
 gives:
 ```protobuf
 syntax = "proto2";

 package perfetto.protos;

 import "protos/perfetto/metrics/metrics.proto";

 message ProcessInfo {
   optional string process_name = 1;
   optional int64 cpu_time_ms = 2;
   optional uint32 num_threads = 3;
 }

 message TopProcesses {
   repeated ProcessInfo process_info = 1;
 }

 extend TraceMetrics {
   optional TopProcesses top_processes = 450;
 }
 ```

 ### Step 2
 Let's write the SQL to generate the table of the top 5 processes ordered
 by the sum of the CPU time they ran for and the number of threads which were
 associated with the process. The following SQL should be to a file called
 `top_five_processes.sql` in your workspace:
 ```sql
 CREATE VIEW top_five_processes_by_cpu
 SELECT
   process.name as process_name,
   CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms,
   COUNT(DISTINCT utid) as num_threads
 FROM sched
 INNER JOIN thread USING(utid)
 INNER JOIN process USING(upid)
 GROUP BY process.name
 ORDER BY cpu_time_ms DESC
 LIMIT 5;
 ```
 Let's break this query down:
 1. The first table used is the `sched` table. This contains all the
    scheduling data available in the trace. Each scheduling "slice" is associated
    with a thread which is uniquely identified in Perfetto traces using its
    `utid`. The two pieces of information which needed from the sched table
    is the `dur` - short for duration, this is the amount of time the slice
    lasted - and the `utid` which will be use to join with the thread table.
 2. The next table is the thread table. This gives us a lot of information which
    are not particularly interested (including its thread name) but it does give
    us the `upid`. Similar to `utid`, `upid` is the unique identifier for a
    process in a Perfetto trace. In this case, `upid` will refer to the process
    which hosts the thread given by `utid`.
 3. The final table is the process table. This gives the name of the
    process associated with the original sched slice.
 4. With the process, thread and duration for each sched slice, all the slices
    for a single processes are collected and their durations summed to get the
    CPU time (dividing by 1e6 as sched's duration is in nanoseconds) and count
    the number of distinct threads.
 5. Finally, we order by the cpu time and take limit to the top 5.

 ### Step 3
 Now that the result of the metric has been expressed as an SQL table, it needs
 to be converted a proto. The metrics platform has built-in support for emitting
 protos using SQL functions; something which is used extensively in this step.

 Let's look at how it works for our table above.
 ```sql
 CREATE VIEW top_processes_output AS
 SELECT TopProcesses(
   'process_info', (
     SELECT RepeatedField(
       ProcessInfo(
         'process_name', process_name,
         'cpu_time_ms', cpu_time_ms,
         'num_threads', num_threads
       )
     )
     FROM top_five_processes_by_cpu
   )
 );
 ```
 Let's break this down again:
 1. Starting from the inner-most SELECT statement, there is
    what looks like a function call to the ProcessInfo function; in face this is
    no conincidence. For each proto that the metrics platform knows about,
    it generates a SQL function with the same name as the proto. This function
    takes key value pairs with the key as the name of the proto field to fill
    and the value being the data to store in the field. The output is the proto
    created by writing the fields described in the function! (*)

    In this case, this function is called once for each row in
    the `top_five_processes_by_cpu` table. The output of will be the fully filled
    ProcessInfo proto.

    The call to the `RepeatedField` function is the most interesting part and
    also the most important. In technical terms, `RepeatedField` is an aggregate
    function; practically, this means that it takes a full table of values and
    generates a single array which contains all the values passed to it.

    Therefore, the output of this whole SELECT statement is an array of
    5 ProcessInfo protos.
 2. Next is creation of the `TopProcesses` proto. By now, the syntax should
    already feel somewhat familiar; the proto builder function is called
    to fill in the `process_info` field with the array of protos from the
    inner funciton.

    The output of this SELECT is a single `TopProcesses` proto containing
    the ProcessInfos as a repeated field.
 3. Finally, the view is created. This view is specially named to allow the
    metrics platform to query it to obtain the root proto for each metric (in
    this case `TopProcesses`). See the note below as to the pattern behind
    this view's name.

 (*) - side note: this is not strictly true. To type-check the protos, we
 also return some metadata about the type of the proto but this is unimportant
 for metric authors

 *Note:*
 * It is important that the views be named
   {name of TraceMetrics extension field}_output. This is the pattern used
   and expected by the metrics platform for all metrics.

 And that's all the SQL we need to write! Our final file should look like so:
 ```sql
 CREATE VIEW top_five_processes_by_cpu AS
 SELECT
   process.name as process_name,
   CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms,
   COUNT(DISTINCT utid) as num_threads
 FROM sched
 INNER JOIN thread USING(utid)
 INNER JOIN process USING(upid)
 GROUP BY process.name
 ORDER BY cpu_time_ms DESC
 LIMIT 5;

 CREATE top_processes_output AS
 SELECT TopProcesses(
   'process_info', (
     SELECT RepeatedField(
       ProcessInfo(
         'process_name', process_name,
         'cpu_time_ms', cpu_time_ms,
         'num_threads', num_threads
       )
     )
     FROM top_five_processes_by_cpu
   )
 );
 ```

 *Notes:*
 * The name of the SQL file should be the same as the name of TraceMetrics
   extension field. This is to allow the metrics platform to associated the
   proto extension field with the SQL which needs to be run to generate it.

 ### Step 4
 This is the last step and where we get to see the results of our work!

 For this step, all we need is a one-liner, invoking trace processor
 shell (see Step 0 for downloading it):
 ```shell
 $TRACE_PROCESSOR --run-metrics $WORKSPACE/top_five_processes.sql $TRACE 2> /dev/null
 ```
 (If you want a example trace to test this on, see the Notes section below.)

 By passing the SQL file for the metric we want to compute, trace processor uses
 the name of this file to both find the proto and also to figure out the name
 of the output table for the proto and the name of the extension field for
 `TraceMetrics`; this is why it was important to choose the names of these other
 objects carefully.

 *Notes:*
 * If something doesn't work as intended, check that your workspace looks the
   same as the contents of this [GitHub gist](https://gist.github.com/tilal6991/c221cf0cae17e298dfa82b118edf9080).
 * A good example trace for this metric is the Android example trace used by
   the Perfetto UI found [here](https://storage.googleapis.com/perfetto-misc/example_android_trace_30s_1)
 * We're redirecting stderror to remove any noise from parsing the trace that
   trace processor generates.

 If everything went successfully, you should see something like the following
 (this is specifically the output for the Android example trace linked above):
 ```
 [perfetto.protos.top_five_processes] {
   process_info {
     process_name: "com.google.android.GoogleCamera"
     cpu_time_ms: 15154
     num_threads: 125
   }
   process_info {
     process_name: "sugov:4"
     cpu_time_ms: 6846
     num_threads: 1
   }
   process_info {
     process_name: "system_server"
     cpu_time_ms: 6809
     num_threads: 66
   }
   process_info {
     process_name: "cds_ol_rx_threa"
     cpu_time_ms: 6684
     num_threads: 1
   }
   process_info {
     process_name: "com.android.chrome"
     cpu_time_ms: 5125
     num_threads: 49
   }
 }
 ```

 ### Conclusion
 That finishes the introductory guide to writing an metric using the Perfetto
 metrics platform! For more information about where to go next, the following
 links may be useful:
 * To understand what data is available to you and how the SQL tables are
   structured see the [trace processor](trace-processor.md) docs.
 * To see how you can use the RUN_METRIC function to extract common snippets of
   SQL and reuse them for writing bigger metrics, continue reading!
 * To see how you can add your own metrics to the platform or edit an existing
   metric, continue reading!

 Breaking down and composing metrics
 ----------
 Coming soon!

 Adding a new metric or editing an existing metric
 ----------
 Coming soon!

 Running a metric over a set of traces
 ----------
 Coming soon!

 Metrics platform as an API
 ----------
 Coming soon!
	Writing Perfetto-based metrics
	=============

	Contents
	---------
	1. Background
	2. The Perfetto Metrics Platform
	3. Writing your first metric - step by step
	4. Breaking down and composing metrics (TBD)
	5. Adding a new metric or editing an existing metric (TBD)
	6. Running a metric over a set of traces (TBD)
	7. Metrics platform as an API (TBD)

	Background
	---------
	Using traces allows computation of reproducible metrics in a wide range
	of situations; examples include benchmarks, lab tests and on
	large corpuses of traces. In these cases, these metrics allow for direct
	root-causing when a regression is detected.

	The Perfetto Metrics Platform
	----------
	The metrics platform (powered by the
	[trace processor](trace-processor.md)) allows metrics authors to write
	SQL queries to generate metrics in the form of protobuf messages or proto text.

	We strongly encourage all metrics derived on Perfetto traces to be added to the
	Perfetto repo unless there is a clear usecase (e.g. confidentiality) why these
	metrics should not be publicly available.

	In return for upstreaming metrics, authors will have first class support for
	running metrics locally and the confidence that their metrics will remain stable
	as trace processor is developed.

	For example, generating the full (human readable) set of Android memory
	metrics on a trace is as simple as:
	```shell
	trace_processor_shell --run-metrics android_mem <trace>
	```

	As well as scaling upwards while developing from running on a single trace
	locally to running on a large set of traces, the reverse is also very useful.
	When an anomaly is observed in the metrics of a lab benchmark, you can simply
	download a representative trace and run the same metric locally in shell.

	Since the same code is running locally and remotely, you can be confident in
	reproducing the issue and use the power of trace processor and/or the Perfetto
	UI to identify the problem!

	Writing your first metric: A Step by Step Guide
	----------
	To begin, all you need is some familiarity with SQL and you're ready to start!

	Suppose that want a write a metric which computes the CPU time for every process
	in the trace and lists the names of the top 5 processes (by CPU time)
	and the number of threads which were associated with those processes over its
	lifetime.

	Note:
	* If you want to jump straight to the code, at the end of this guide, your
	workspace should look something like this [GitHub gist](https://gist.github.com/tilal6991/c221cf0cae17e298dfa82b118edf9080). See Step 0 and 4
	below as to where to get trace processor and how to run it to output the
	metrics.

	### Step 0
	As a setup step, you'll want to create a folder to act as a scratch workspace;
	this folder will be referred to using the env variable `$WORKSPACE` in Step 4.

	The other thing you'll need is trace processor shell. You can download this
	[here](https://get.perfetto.dev/trace_processor) or you can build from source
	using the instructions [here](trace-processor.md). Whichever method is
	chosen, $TRACE_PROCESSOR env variable will be used to refer to the location of
	the binary in Step 4.

	### Step 1
	As all metrics in the metrics platform are defined using protos, the metric
	needs to be strctured as a proto. For this metric, there needs to be some notion
	of a process name along with its CPU time and number of threads.

	Starting off, in a file named `top_five_processes.proto` in our workspace,
	let's create a basic proto message called ProcessInfo with those three fields:
	```protobuf
	message ProcessInfo {
	optional string process_name = 1;
	optional uint64 cpu_time_ms = 2;
	optional uint32 num_threads = 3;
	}
	```

	Next up is a wrapping message which will hold the repeated field containing
	the top 5 processes.
	```protobuf
	message TopProcesses {
	repeated ProcessInfo process_info = 1;
	}
	```

	Finally, let's define an extension to the root proto for all metrics -
	the
	[TraceMetrics](https://android.googlesource.com/platform/external/perfetto/+/HEAD/protos/perfetto/metrics/metrics.proto#39)
	proto).
	```protobuf
	extend TraceMetrics {
	optional TopProcesses top_processes = 450;
	}
	```
	Adding this extension field allows trace processor to link the newly defined
	metric to the `TraceMetrics` proto.

	Notes:
	* The field ids 450-500 are reserved for local development so you can use
	any of them as the field id for the extension field.
	* The choice of field name here is important as the SQL file and the final
	table generated in SQL will be based on this name.

	Putting everything together, along with some boilerplate header information
	gives:
	```protobuf
	syntax = "proto2";

	package perfetto.protos;

	import "protos/perfetto/metrics/metrics.proto";

	message ProcessInfo {
	optional string process_name = 1;
	optional int64 cpu_time_ms = 2;
	optional uint32 num_threads = 3;
	}

	message TopProcesses {
	repeated ProcessInfo process_info = 1;
	}

	extend TraceMetrics {
	optional TopProcesses top_processes = 450;
	}
	```

	### Step 2
	Let's write the SQL to generate the table of the top 5 processes ordered
	by the sum of the CPU time they ran for and the number of threads which were
	associated with the process. The following SQL should be to a file called
	`top_five_processes.sql` in your workspace:
	```sql
	CREATE VIEW top_five_processes_by_cpu
	SELECT
	process.name as process_name,
	CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms,
	COUNT(DISTINCT utid) as num_threads
	FROM sched
	INNER JOIN thread USING(utid)
	INNER JOIN process USING(upid)
	GROUP BY process.name
	ORDER BY cpu_time_ms DESC
	LIMIT 5;
	```
	Let's break this query down:
	1. The first table used is the `sched` table. This contains all the
	scheduling data available in the trace. Each scheduling "slice" is associated
	with a thread which is uniquely identified in Perfetto traces using its
	`utid`. The two pieces of information which needed from the sched table
	is the `dur` - short for duration, this is the amount of time the slice
	lasted - and the `utid` which will be use to join with the thread table.
	2. The next table is the thread table. This gives us a lot of information which
	are not particularly interested (including its thread name) but it does give
	us the `upid`. Similar to `utid`, `upid` is the unique identifier for a
	process in a Perfetto trace. In this case, `upid` will refer to the process
	which hosts the thread given by `utid`.
	3. The final table is the process table. This gives the name of the
	process associated with the original sched slice.
	4. With the process, thread and duration for each sched slice, all the slices
	for a single processes are collected and their durations summed to get the
	CPU time (dividing by 1e6 as sched's duration is in nanoseconds) and count
	the number of distinct threads.
	5. Finally, we order by the cpu time and take limit to the top 5.

	### Step 3
	Now that the result of the metric has been expressed as an SQL table, it needs
	to be converted a proto. The metrics platform has built-in support for emitting
	protos using SQL functions; something which is used extensively in this step.

	Let's look at how it works for our table above.
	```sql
	CREATE VIEW top_processes_output AS
	SELECT TopProcesses(
	'process_info', (
	SELECT RepeatedField(
	ProcessInfo(
	'process_name', process_name,
	'cpu_time_ms', cpu_time_ms,
	'num_threads', num_threads
	)
	)
	FROM top_five_processes_by_cpu
	)
	);
	```
	Let's break this down again:
	1. Starting from the inner-most SELECT statement, there is
	what looks like a function call to the ProcessInfo function; in face this is
	no conincidence. For each proto that the metrics platform knows about,
	it generates a SQL function with the same name as the proto. This function
	takes key value pairs with the key as the name of the proto field to fill
	and the value being the data to store in the field. The output is the proto
	created by writing the fields described in the function! (*)

	In this case, this function is called once for each row in
	the `top_five_processes_by_cpu` table. The output of will be the fully filled
	ProcessInfo proto.

	The call to the `RepeatedField` function is the most interesting part and
	also the most important. In technical terms, `RepeatedField` is an aggregate
	function; practically, this means that it takes a full table of values and
	generates a single array which contains all the values passed to it.

	Therefore, the output of this whole SELECT statement is an array of
	5 ProcessInfo protos.
	2. Next is creation of the `TopProcesses` proto. By now, the syntax should
	already feel somewhat familiar; the proto builder function is called
	to fill in the `process_info` field with the array of protos from the
	inner funciton.

	The output of this SELECT is a single `TopProcesses` proto containing
	the ProcessInfos as a repeated field.
	3. Finally, the view is created. This view is specially named to allow the
	metrics platform to query it to obtain the root proto for each metric (in
	this case `TopProcesses`). See the note below as to the pattern behind
	this view's name.

	(*) - side note: this is not strictly true. To type-check the protos, we
	also return some metadata about the type of the proto but this is unimportant
	for metric authors

	Note:
	* It is important that the views be named
	{name of TraceMetrics extension field}_output. This is the pattern used
	and expected by the metrics platform for all metrics.

	And that's all the SQL we need to write! Our final file should look like so:
	```sql
	CREATE VIEW top_five_processes_by_cpu AS
	SELECT
	process.name as process_name,
	CAST(SUM(sched.dur) / 1e6 as INT64) as cpu_time_ms,
	COUNT(DISTINCT utid) as num_threads
	FROM sched
	INNER JOIN thread USING(utid)
	INNER JOIN process USING(upid)
	GROUP BY process.name
	ORDER BY cpu_time_ms DESC
	LIMIT 5;

	CREATE top_processes_output AS
	SELECT TopProcesses(
	'process_info', (
	SELECT RepeatedField(
	ProcessInfo(
	'process_name', process_name,
	'cpu_time_ms', cpu_time_ms,
	'num_threads', num_threads
	)
	)
	FROM top_five_processes_by_cpu
	)
	);
	```

	Notes:
	* The name of the SQL file should be the same as the name of TraceMetrics
	extension field. This is to allow the metrics platform to associated the
	proto extension field with the SQL which needs to be run to generate it.

	### Step 4
	This is the last step and where we get to see the results of our work!

	For this step, all we need is a one-liner, invoking trace processor
	shell (see Step 0 for downloading it):
	```shell
	$TRACE_PROCESSOR --run-metrics $WORKSPACE/top_five_processes.sql $TRACE 2> /dev/null
	```
	(If you want a example trace to test this on, see the Notes section below.)

	By passing the SQL file for the metric we want to compute, trace processor uses
	the name of this file to both find the proto and also to figure out the name
	of the output table for the proto and the name of the extension field for
	`TraceMetrics`; this is why it was important to choose the names of these other
	objects carefully.

	Notes:
	* If something doesn't work as intended, check that your workspace looks the
	same as the contents of this [GitHub gist](https://gist.github.com/tilal6991/c221cf0cae17e298dfa82b118edf9080).
	* A good example trace for this metric is the Android example trace used by
	the Perfetto UI found [here](https://storage.googleapis.com/perfetto-misc/example_android_trace_30s_1)
	* We're redirecting stderror to remove any noise from parsing the trace that
	trace processor generates.

	If everything went successfully, you should see something like the following
	(this is specifically the output for the Android example trace linked above):
	```
	[perfetto.protos.top_five_processes] {
	process_info {
	process_name: "com.google.android.GoogleCamera"
	cpu_time_ms: 15154
	num_threads: 125
	}
	process_info {
	process_name: "sugov:4"
	cpu_time_ms: 6846
	num_threads: 1
	}
	process_info {
	process_name: "system_server"
	cpu_time_ms: 6809
	num_threads: 66
	}
	process_info {
	process_name: "cds_ol_rx_threa"
	cpu_time_ms: 6684
	num_threads: 1
	}
	process_info {
	process_name: "com.android.chrome"
	cpu_time_ms: 5125
	num_threads: 49
	}
	}
	```

	### Conclusion
	That finishes the introductory guide to writing an metric using the Perfetto
	metrics platform! For more information about where to go next, the following
	links may be useful:
	* To understand what data is available to you and how the SQL tables are
	structured see the [trace processor](trace-processor.md) docs.
	* To see how you can use the RUN_METRIC function to extract common snippets of
	SQL and reuse them for writing bigger metrics, continue reading!
	* To see how you can add your own metrics to the platform or edit an existing
	metric, continue reading!

	Breaking down and composing metrics
	----------
	Coming soon!

	Adding a new metric or editing an existing metric
	----------
	Coming soon!

	Running a metric over a set of traces
	----------
	Coming soon!

	Metrics platform as an API
	----------
	Coming soon!