|  | # Batch Trace Processor | 
|  | This document describes the overall design of Batch Trace Processor and | 
|  | aids in integrating it into other systems. | 
|  |  | 
|  |  | 
|  |  | 
|  | ## Motivation | 
|  | The Perfetto trace processor is the de-facto way to perform analysis on a | 
|  | single trace. Using the | 
|  | [trace processor Python API](/docs/analysis/trace-processor#python-api), | 
|  | traces can be queried interactively, plots made from those results etc. | 
|  |  | 
|  | While queries on a single trace are useful when debugging a specific problem | 
|  | in that trace or in the very early stages of understanding a domain, it soon | 
|  | becomes limiting. One trace is unlikely to be representative | 
|  | of the entire population and it's easy to overfit queries i.e. spend a | 
|  | lot of effort on breaking down a problem in that trace while neglecting | 
|  | other, more common issues in the population. | 
|  |  | 
|  | Because of this, what we actually want is to be able to query many traces | 
|  | (usually on the order of 250-10000+) and identify the patterns which show | 
|  | up in a significant fraction of them. This ensures that time is being spent | 
|  | on issues which are affecting user experience instead of just a random | 
|  | problem which happened to show up in the trace. | 
|  |  | 
|  | One low-effort option for solving this problem is simply to ask people to use | 
|  | utilities like [Executors](https://docs.python.org/3/library/concurrent.futures.html#executor-objects) | 
|  | with the Python API to load multiple traces and query them in parallel. | 
|  | Unfortunately, there are several downsides to this approach: | 
|  | * Every user has to reinvent the wheel every time they want to query multiple | 
|  | traces. Over time, there would likely be a proliferation of slightly modified | 
|  | code which is copied from each place. | 
|  | * While the basics of parallelising queries on multiple traces on a single | 
|  | machine is straightforward, one day, we may want to shard trace processing | 
|  | across multiple machines. Once this happens, the complexity of the code would | 
|  | rise significantly to the point where a central implementation becomes a | 
|  | necessity. Because of this, it's better to have the API first before engineers | 
|  | start building their own custom solutions. | 
|  | * A big aim for the Perfetto team these days is to make trace analysis more | 
|  | accessible to reduce the number of places where we need to be in the loop. | 
|  | Having a well supported API for an important usecase like bulk trace analysis | 
|  | directly helps with this. | 
|  |  | 
|  | While we've discussed querying traces so far, the experience for loading traces | 
|  | from different traces should be just as good. This has historically been a big | 
|  | reason why the Python API has not gained as much adoption as we would have | 
|  | liked. | 
|  |  | 
|  | Especially internally in Google, we should not be relying on engineers | 
|  | knowing where traces live on the network filesystem and the directory layout. | 
|  | Instead, they should be able to simply be able to specify the data source (i.e. | 
|  | lab, testing population) and some parameters (e.g. build id, date, kernel | 
|  | version) that traces should match should match and traces meeting these criteria | 
|  | should found and loaded. | 
|  |  | 
|  | Putting all this together, we want to build a library which can: | 
|  | * Interactively query ~1000+ traces in O(s) (for simple queries) | 
|  | * Expose full SQL expressiveness from trace processor | 
|  | * Load traces from many sources with minimal ceremony. This should  include | 
|  | Google-internal sources: e.g. lab runs and internal testing populations | 
|  | * Integrate with data analysis libraries for easy charting and visulazation | 
|  |  | 
|  | ## Design Highlights | 
|  | In this section, we briefly discuss some of the most impactful design decisions | 
|  | taken when building batch trace processor and the reasons behind them. | 
|  |  | 
|  | ### Language | 
|  | The choice of langugage is pretty straightforward. Python is already the go-to | 
|  | langugage for data analysis in a wide variety of domains and our problem | 
|  | is not unique enough to warrant making a different decision. Moreover, another | 
|  | point in favour is the existence of the Python API for trace processor. This | 
|  | further eases the implementation as we do not have to start from scratch. | 
|  |  | 
|  | The main downside of choosing Python is performance but given that that all | 
|  | the data crunching happens in C++ inside TP,  this is not a big factor. | 
|  |  | 
|  | ### Trace URIs and Resolvers | 
|  | [Trace URIs](/docs/analysis/batch-trace-processor#trace-uris) | 
|  | are an elegant solution to the problem of loading traces from a diverse range | 
|  | of public and internal sources. As with web URIs, the idea with trace URIs is | 
|  | to describe both the protocol (i.e. the source) from which traces should be | 
|  | fetched and the arguments (i.e. query parameters) which the traces should match. | 
|  |  | 
|  | Batch trace processor should integrate tightly with trace URIs and their | 
|  | resolvers. Users should be able to pass either just the URI (whcih is really | 
|  | just a string for maximum flexibility) or a resolver object which can yield a | 
|  | list of trace file paths. | 
|  |  | 
|  | To handle URI strings, there should be some mecahinsm of "registering" resolvers | 
|  | to make them eligible to resolve a certain "protocol". By default, we should | 
|  | provide a resolver to handle filesystem. We should ensure that the resolver | 
|  | design is such that resolvers can be closed soruce while the rest of batch trace | 
|  | processor is open. | 
|  |  | 
|  | Along with the job of yielding a list of traces, resolvers should also be | 
|  | responsible for creating metadata for each trace these are different pieces | 
|  | of information about the trace that the user might be interested in e.g. OS | 
|  | version, device name, collected date etc. The metadata can then be used when | 
|  | "flattening" results across many traces as discussed below. | 
|  |  | 
|  | ### Persisting loaded traces | 
|  | Optimizing the loading of traces is critical for the O(s) query performance | 
|  | we want out of batch trace processor. Traces are often accessed | 
|  | over the network meaning fetching their contents has a high latency. | 
|  | Traces also take at least a few seconds to parse, eating up the budget for | 
|  | O(s) before even getting the running time of queries. | 
|  |  | 
|  | To address this issue, we take the decision to keep all traces fully loaded in | 
|  | memory in trace processor instances. That way, instead of loading them on every | 
|  | query/set of queries, we can issue queries directly. | 
|  |  | 
|  | For the moment, we restrict the loading and querying of traces to a | 
|  | single machine. While querying n traces is "embarassngly parallel" and shards | 
|  | perfectly across multiple machines, introducing distributed systems to any | 
|  | solution simply makes everything more complicated. The move to multiple | 
|  | machines is explored further in the "Future plans" section. | 
|  |  | 
|  | ### Flattening query results | 
|  | The naive way to return the result of querying n traces is a list | 
|  | of n elements, with each element being result for a single trace. However, | 
|  | after performing several case-study performance investigations using BTP, it | 
|  | became obvious that this obvious answer was not the most convienent for the end | 
|  | user. | 
|  |  | 
|  | Instead, a pattern which proved very useful was to "flatten" the results into | 
|  | a single table, containing the results from all the traces. However, | 
|  | simply flattening causes us to lose the information about which trace a row | 
|  | originated from. We can deal with this by allowing resolvers to silently add | 
|  | columns with the metadata for each trace. | 
|  |  | 
|  |  | 
|  | So suppose we query three traces with: | 
|  |  | 
|  | ```SELECT ts, dur FROM slice``` | 
|  |  | 
|  | Then in the flattening operation might do something like this behind the scenes: | 
|  |  | 
|  |  | 
|  |  | 
|  | ## Integration points | 
|  | Batch trace processor needs to be both open source yet allow deep integration | 
|  | with Google internal tooling. Because of this, there are various integration | 
|  | points built design to allow closed compoentns to be slotted in place of the | 
|  | default, open source ones. | 
|  |  | 
|  | The first point is the formalization of the idea "platform" code. Even since the | 
|  | begining of the Python API, there was always a need for code internally to be | 
|  | run slightly different to open source code. For example, Google internal Python | 
|  | distrubution does not use Pip, instead packaging dependencies into a single | 
|  | binary. The notion of a "platform" loosely existed to abstract this sort of | 
|  | differences but this was very ad-hoc. As part of batch trace processor | 
|  | implementation, this has been retroactively formalized. | 
|  |  | 
|  | Resolvers are another big point of pluggability. By allowing registration of | 
|  | a "protocol" for each internal trace source (e.g. lab, testing population), we | 
|  | allow for trace loading to be neatly abstracted. | 
|  |  | 
|  | Finally, for batch trace processor specifically, we abstract the creation of | 
|  | thread pools for loading traces and running queries. The parallelism and memory | 
|  | available to programs internally is often does not 1:1 correspond with the | 
|  | available CPUs/memory on the system: internal APIs need to be accessed to find | 
|  | out this information. | 
|  |  | 
|  | ## Future plans | 
|  | One common problem when running batch trace processor is that we are | 
|  | constrained by a single machine and so can only load O(1000) traces. | 
|  | For rare problems, there might only be a handful of traces matching a given | 
|  | pattern even in such a large sample. | 
|  |  | 
|  | A way around this would be to build a "no trace limit" mode. The idea here | 
|  | is that you would develop queries like usual with batch trace processor | 
|  | operating on a O(1000) traces with O(s) performance. Once the queries are | 
|  | relatively finalized, we could then "switch" the mode of batch trace processor | 
|  | to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+ | 
|  | traces loading O(n cpus) traces at any one time. | 
|  |  | 
|  | This allows us to retain both the quick iteration speed while developing queries | 
|  | while also allowing for large scale analysis without needing to move code | 
|  | to a pipeline model. However, this approach does not really resolve the root | 
|  | cause of the problem which is that we are restricted to a single machine. | 
|  |  | 
|  | The "ideal" solution here is to, as mentioned above, shard batch trace processor | 
|  | across >1 machine. When querying traces, each trace is entirely independent of | 
|  | any other so paralleising across multiple machines yields very close to perfect | 
|  | gains in performance at little cost. | 
|  |  | 
|  | This is would be however quite a complex undertaking. We would need to design | 
|  | the API in such a way that allows for pluggable integration with various compute | 
|  | platforms (e.g. GCP, Google internal, your custom infra). Even restricting to | 
|  | just Google infra and leaving others as open for contribution, internal infra's | 
|  | ideal workload does not match the approach of "have a bunch of machines tied to | 
|  | one user waiting for their input". There would need to be significiant research | 
|  | and design work before going here but it would likely be wortwhile. |