docs/design-docs/batch-trace-processor.md - third_party/perfetto - Git at Google

 # Batch Trace Processor
 This document describes the overall design of Batch Trace Processor and
 aids in integrating it into other systems.

 ![BTP Overview](/docs/images/perfetto-btp-overview.svg)

 ## Motivation
 The Perfetto trace processor is the de-facto way to perform analysis on a
 single trace. Using the
 [trace processor Python API](/docs/analysis/trace-processor#python-api),
 traces can be queried interactively, plots made from those results etc.

 While queries on a single trace are useful when debugging a specific problem
 in that trace or in the very early stages of understanding a domain, it soon
 becomes limiting. One trace is unlikely to be representative
 of the entire population and it's easy to overfit queries i.e. spend a
 lot of effort on breaking down a problem in that trace while neglecting
 other, more common issues in the population.

 Because of this, what we actually want is to be able to query many traces
 (usually on the order of 250-10000+) and identify the patterns which show
 up in a significant fraction of them. This ensures that time is being spent
 on issues which are affecting user experience instead of just a random
 problem which happened to show up in the trace.

 One low-effort option for solving this problem is simply to ask people to use
 utilities like [Executors](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
 with the Python API to load multiple traces and query them in parallel.
 Unfortunately, there are several downsides to this approach:
 * Every user has to reinvent the wheel every time they want to query multiple
   traces. Over time, there would likely be a proliferation of slightly modified
   code which is copied from each place.
 * While the basics of parallelising queries on multiple traces on a single
   machine is straightforward, one day, we may want to shard trace processing
   across multiple machines. Once this happens, the complexity of the code would
   rise significantly to the point where a central implementation becomes a
   necessity. Because of this, it's better to have the API first before engineers
   start building their own custom solutions.
 * A big aim for the Perfetto team these days is to make trace analysis more
   accessible to reduce the number of places where we need to be in the loop.
   Having a well supported API for an important usecase like bulk trace analysis
   directly helps with this.

 While we've discussed querying traces so far, the experience for loading traces
 from different traces should be just as good. This has historically been a big
 reason why the Python API has not gained as much adoption as we would have
 liked.

 Especially internally in Google, we should not be relying on engineers
 knowing where traces live on the network filesystem and the directory layout.
 Instead, they should be able to simply be able to specify the data source (i.e.
 lab, testing population) and some parameters (e.g. build id, date, kernel
 version) that traces should match should match and traces meeting these criteria
 should found and loaded.

 Putting all this together, we want to build a library which can:
 * Interactively query ~1000+ traces in O(s) (for simple queries)
 * Expose full SQL expressiveness from trace processor
 * Load traces from many sources with minimal ceremony. This should  include
   Google-internal sources: e.g. lab runs and internal testing populations
 * Integrate with data analysis libraries for easy charting and visulazation

 ## Design Highlights
 In this section, we briefly discuss some of the most impactful design decisions
 taken when building batch trace processor and the reasons behind them.

 ### Language
 The choice of langugage is pretty straightforward. Python is already the go-to
 langugage for data analysis in a wide variety of domains and our problem
 is not unique enough to warrant making a different decision. Moreover, another
 point in favour is the existence of the Python API for trace processor. This
 further eases the implementation as we do not have to start from scratch.

 The main downside of choosing Python is performance but given that that all
 the data crunching happens in C++ inside TP,  this is not a big factor.

 ### Trace URIs and Resolvers
 [Trace URIs](/docs/analysis/batch-trace-processor#trace-uris)
 are an elegant solution to the problem of loading traces from a diverse range
 of public and internal sources. As with web URIs, the idea with trace URIs is
 to describe both the protocol (i.e. the source) from which traces should be
 fetched and the arguments (i.e. query parameters) which the traces should match.

 Batch trace processor should integrate tightly with trace URIs and their
 resolvers. Users should be able to pass either just the URI (whcih is really
 just a string for maximum flexibility) or a resolver object which can yield a
 list of trace file paths.

 To handle URI strings, there should be some mecahinsm of "registering" resolvers
 to make them eligible to resolve a certain "protocol". By default, we should
 provide a resolver to handle filesystem. We should ensure that the resolver
 design is such that resolvers can be closed soruce while the rest of batch trace
 processor is open.

 Along with the job of yielding a list of traces, resolvers should also be
 responsible for creating metadata for each trace these are different pieces
 of information about the trace that the user might be interested in e.g. OS
 version, device name, collected date etc. The metadata can then be used when
 "flattening" results across many traces as discussed below.

 ### Persisting loaded traces
 Optimizing the loading of traces is critical for the O(s) query performance
 we want out of batch trace processor. Traces are often accessed
 over the network meaning fetching their contents has a high latency.
 Traces also take at least a few seconds to parse, eating up the budget for
 O(s) before even getting the running time of queries.

 To address this issue, we take the decision to keep all traces fully loaded in
 memory in trace processor instances. That way, instead of loading them on every
 query/set of queries, we can issue queries directly.

 For the moment, we restrict the loading and querying of traces to a
 single machine. While querying n traces is "embarassngly parallel" and shards
 perfectly across multiple machines, introducing distributed systems to any
 solution simply makes everything more complicated. The move to multiple
 machines is explored further in the "Future plans" section.

 ### Flattening query results
 The naive way to return the result of querying n traces is a list
 of n elements, with each element being result for a single trace. However,
 after performing several case-study performance investigations using BTP, it
 became obvious that this obvious answer was not the most convienent for the end
 user.

 Instead, a pattern which proved very useful was to "flatten" the results into
 a single table, containing the results from all the traces. However,
 simply flattening causes us to lose the information about which trace a row
 originated from. We can deal with this by allowing resolvers to silently add
 columns with the metadata for each trace.


 So suppose we query three traces with:

 ```SELECT ts, dur FROM slice```

 Then in the flattening operation might do something like this behind the scenes:
 ![BTP Flattening](/docs/images/perfetto-btp-flattening.svg)


 ## Integration points
 Batch trace processor needs to be both open source yet allow deep integration
 with Google internal tooling. Because of this, there are various integration
 points built design to allow closed compoentns to be slotted in place of the
 default, open source ones.

 The first point is the formalization of the idea "platform" code. Even since the
 begining of the Python API, there was always a need for code internally to be
 run slightly different to open source code. For example, Google internal Python
 distrubution does not use Pip, instead packaging dependencies into a single
 binary. The notion of a "platform" loosely existed to abstract this sort of
 differences but this was very ad-hoc. As part of batch trace processor
 implementation, this has been retroactively formalized.

 Resolvers are another big point of pluggability. By allowing registration of
 a "protocol" for each internal trace source (e.g. lab, testing population), we
 allow for trace loading to be neatly abstracted.

 Finally, for batch trace processor specifically, we abstract the creation of
 thread pools for loading traces and running queries. The parallelism and memory
 available to programs internally is often does not 1:1 correspond with the
 available CPUs/memory on the system: internal APIs need to be accessed to find
 out this information.

 ## Future plans
 One common problem when running batch trace processor is that we are
 constrained by a single machine and so can only load O(1000) traces.
 For rare problems, there might only be a handful of traces matching a given
 pattern even in such a large sample.

 A way around this would be to build a "no trace limit" mode. The idea here
 is that you would develop queries like usual with batch trace processor
 operating on a O(1000) traces with O(s) performance. Once the queries are
 relatively finalized, we could then "switch" the mode of batch trace processor
 to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+
 traces loading O(n cpus) traces at any one time.

 This allows us to retain both the quick iteration speed while developing queries
 while also allowing for large scale analysis without needing to move code
 to a pipeline model. However, this approach does not really resolve the root
 cause of the problem which is that we are restricted to a single machine.

 The "ideal" solution here is to, as mentioned above, shard batch trace processor
 across >1 machine. When querying traces, each trace is entirely independent of
 any other so paralleising across multiple machines yields very close to perfect
 gains in performance at little cost.

 This is would be however quite a complex undertaking. We would need to design
 the API in such a way that allows for pluggable integration with various compute
 platforms (e.g. GCP, Google internal, your custom infra). Even restricting to
 just Google infra and leaving others as open for contribution, internal infra's
 ideal workload does not match the approach of "have a bunch of machines tied to
 one user waiting for their input". There would need to be significiant research
 and design work before going here but it would likely be wortwhile.
	# Batch Trace Processor
	This document describes the overall design of Batch Trace Processor and
	aids in integrating it into other systems.

	![BTP Overview](/docs/images/perfetto-btp-overview.svg)

	## Motivation
	The Perfetto trace processor is the de-facto way to perform analysis on a
	single trace. Using the
	[trace processor Python API](/docs/analysis/trace-processor#python-api),
	traces can be queried interactively, plots made from those results etc.

	While queries on a single trace are useful when debugging a specific problem
	in that trace or in the very early stages of understanding a domain, it soon
	becomes limiting. One trace is unlikely to be representative
	of the entire population and it's easy to overfit queries i.e. spend a
	lot of effort on breaking down a problem in that trace while neglecting
	other, more common issues in the population.

	Because of this, what we actually want is to be able to query many traces
	(usually on the order of 250-10000+) and identify the patterns which show
	up in a significant fraction of them. This ensures that time is being spent
	on issues which are affecting user experience instead of just a random
	problem which happened to show up in the trace.

	One low-effort option for solving this problem is simply to ask people to use
	utilities like [Executors](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
	with the Python API to load multiple traces and query them in parallel.
	Unfortunately, there are several downsides to this approach:
	* Every user has to reinvent the wheel every time they want to query multiple
	traces. Over time, there would likely be a proliferation of slightly modified
	code which is copied from each place.
	* While the basics of parallelising queries on multiple traces on a single
	machine is straightforward, one day, we may want to shard trace processing
	across multiple machines. Once this happens, the complexity of the code would
	rise significantly to the point where a central implementation becomes a
	necessity. Because of this, it's better to have the API first before engineers
	start building their own custom solutions.
	* A big aim for the Perfetto team these days is to make trace analysis more
	accessible to reduce the number of places where we need to be in the loop.
	Having a well supported API for an important usecase like bulk trace analysis
	directly helps with this.

	While we've discussed querying traces so far, the experience for loading traces
	from different traces should be just as good. This has historically been a big
	reason why the Python API has not gained as much adoption as we would have
	liked.

	Especially internally in Google, we should not be relying on engineers
	knowing where traces live on the network filesystem and the directory layout.
	Instead, they should be able to simply be able to specify the data source (i.e.
	lab, testing population) and some parameters (e.g. build id, date, kernel
	version) that traces should match should match and traces meeting these criteria
	should found and loaded.

	Putting all this together, we want to build a library which can:
	* Interactively query ~1000+ traces in O(s) (for simple queries)
	* Expose full SQL expressiveness from trace processor
	* Load traces from many sources with minimal ceremony. This should include
	Google-internal sources: e.g. lab runs and internal testing populations
	* Integrate with data analysis libraries for easy charting and visulazation

	## Design Highlights
	In this section, we briefly discuss some of the most impactful design decisions
	taken when building batch trace processor and the reasons behind them.

	### Language
	The choice of langugage is pretty straightforward. Python is already the go-to
	langugage for data analysis in a wide variety of domains and our problem
	is not unique enough to warrant making a different decision. Moreover, another
	point in favour is the existence of the Python API for trace processor. This
	further eases the implementation as we do not have to start from scratch.

	The main downside of choosing Python is performance but given that that all
	the data crunching happens in C++ inside TP, this is not a big factor.

	### Trace URIs and Resolvers
	[Trace URIs](/docs/analysis/batch-trace-processor#trace-uris)
	are an elegant solution to the problem of loading traces from a diverse range
	of public and internal sources. As with web URIs, the idea with trace URIs is
	to describe both the protocol (i.e. the source) from which traces should be
	fetched and the arguments (i.e. query parameters) which the traces should match.

	Batch trace processor should integrate tightly with trace URIs and their
	resolvers. Users should be able to pass either just the URI (whcih is really
	just a string for maximum flexibility) or a resolver object which can yield a
	list of trace file paths.

	To handle URI strings, there should be some mecahinsm of "registering" resolvers
	to make them eligible to resolve a certain "protocol". By default, we should
	provide a resolver to handle filesystem. We should ensure that the resolver
	design is such that resolvers can be closed soruce while the rest of batch trace
	processor is open.

	Along with the job of yielding a list of traces, resolvers should also be
	responsible for creating metadata for each trace these are different pieces
	of information about the trace that the user might be interested in e.g. OS
	version, device name, collected date etc. The metadata can then be used when
	"flattening" results across many traces as discussed below.

	### Persisting loaded traces
	Optimizing the loading of traces is critical for the O(s) query performance
	we want out of batch trace processor. Traces are often accessed
	over the network meaning fetching their contents has a high latency.
	Traces also take at least a few seconds to parse, eating up the budget for
	O(s) before even getting the running time of queries.

	To address this issue, we take the decision to keep all traces fully loaded in
	memory in trace processor instances. That way, instead of loading them on every
	query/set of queries, we can issue queries directly.

	For the moment, we restrict the loading and querying of traces to a
	single machine. While querying n traces is "embarassngly parallel" and shards
	perfectly across multiple machines, introducing distributed systems to any
	solution simply makes everything more complicated. The move to multiple
	machines is explored further in the "Future plans" section.

	### Flattening query results
	The naive way to return the result of querying n traces is a list
	of n elements, with each element being result for a single trace. However,
	after performing several case-study performance investigations using BTP, it
	became obvious that this obvious answer was not the most convienent for the end
	user.

	Instead, a pattern which proved very useful was to "flatten" the results into
	a single table, containing the results from all the traces. However,
	simply flattening causes us to lose the information about which trace a row
	originated from. We can deal with this by allowing resolvers to silently add
	columns with the metadata for each trace.


	So suppose we query three traces with:

	```SELECT ts, dur FROM slice```

	Then in the flattening operation might do something like this behind the scenes:
	![BTP Flattening](/docs/images/perfetto-btp-flattening.svg)


	## Integration points
	Batch trace processor needs to be both open source yet allow deep integration
	with Google internal tooling. Because of this, there are various integration
	points built design to allow closed compoentns to be slotted in place of the
	default, open source ones.

	The first point is the formalization of the idea "platform" code. Even since the
	begining of the Python API, there was always a need for code internally to be
	run slightly different to open source code. For example, Google internal Python
	distrubution does not use Pip, instead packaging dependencies into a single
	binary. The notion of a "platform" loosely existed to abstract this sort of
	differences but this was very ad-hoc. As part of batch trace processor
	implementation, this has been retroactively formalized.

	Resolvers are another big point of pluggability. By allowing registration of
	a "protocol" for each internal trace source (e.g. lab, testing population), we
	allow for trace loading to be neatly abstracted.

	Finally, for batch trace processor specifically, we abstract the creation of
	thread pools for loading traces and running queries. The parallelism and memory
	available to programs internally is often does not 1:1 correspond with the
	available CPUs/memory on the system: internal APIs need to be accessed to find
	out this information.

	## Future plans
	One common problem when running batch trace processor is that we are
	constrained by a single machine and so can only load O(1000) traces.
	For rare problems, there might only be a handful of traces matching a given
	pattern even in such a large sample.

	A way around this would be to build a "no trace limit" mode. The idea here
	is that you would develop queries like usual with batch trace processor
	operating on a O(1000) traces with O(s) performance. Once the queries are
	relatively finalized, we could then "switch" the mode of batch trace processor
	to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+
	traces loading O(n cpus) traces at any one time.

	This allows us to retain both the quick iteration speed while developing queries
	while also allowing for large scale analysis without needing to move code
	to a pipeline model. However, this approach does not really resolve the root
	cause of the problem which is that we are restricted to a single machine.

	The "ideal" solution here is to, as mentioned above, shard batch trace processor
	across >1 machine. When querying traces, each trace is entirely independent of
	any other so paralleising across multiple machines yields very close to perfect
	gains in performance at little cost.

	This is would be however quite a complex undertaking. We would need to design
	the API in such a way that allows for pluggable integration with various compute
	platforms (e.g. GCP, Google internal, your custom infra). Even restricting to
	just Google infra and leaving others as open for contribution, internal infra's
	ideal workload does not match the approach of "have a bunch of machines tied to
	one user waiting for their input". There would need to be significiant research
	and design work before going here but it would likely be wortwhile.