docs/tracing-101.md - third_party/perfetto - Git at Google

 # What is Tracing?

 NOTE: the word "tracing" in this document is used in the context of
 **client-side** side software (e.g. programs running on a single machine). In
 the server world, **tracing** is usually short for _distributed tracing_, a way
 to collect data from many different servers to understand the flow of a
 "request" throughout multiple services. As such, this document will _not_ be
 useful to you if you are interested in such traces.

 This page provides a bird's-eye view of performance analysis and tracing. The
 aim is to orient people who have no idea what "tracing" is.

 ## Introduction to Performance

 Performance analysis is concerned with making software run _better_. The
 definition of _better_ varies widely and depends on the situation. Examples
 include:

 - performing the same work using fewer resources (CPU, memory, network, battery,
   etc.)
 - increasing utilization of available resources
 - identifying and eliminating unnecessary work altogether

 Much of the difficulty in improving performance comes from identifying the root
 cause of performance issues. Modern software systems are complicated, having a
 lot of components and a web of cross-interactions. Techniques which help
 engineers understand the execution of a system and pinpoint issues that are
 critical.

 **Tracing** and **profiling** are two such widely-used techniques for
 performance analysis.

 ## Introduction to Tracing

 **Tracing** involves collecting highly detailed data about the execution of a
 system. A single continuous session of recording is called a trace file or
 **trace** for short.

 Traces contain enough detail to fully reconstruct the timeline of events. They
 often include low-level kernel events like scheduler context switches, thread
 wakeups, syscalls, etc. With the "right" trace, reproduction of a performance
 bug is not needed as the trace provides all necessary context.

 Application code is also **instrumented** in areas of the program which are
 considered to be _important_. This instrumentation keeps track of what the
 program was doing over time (e.g. which functions were being run, or how long
 each call took) and context about the execution (e.g. what were the parameters
 to a function call, or why was a function run).

 The level of detail in traces makes it impractical to read traces directly like
 a log file in all but the simplest cases. Instead, a combination of **trace
 analysis** libraries and **trace viewers** are used. Trace analysis libraries
 provide a way for users to extract and summarize trace events in a programmatic
 manner. Trace viewers visualize the events in a trace on a timeline which give
 users a graphical view of what their system was doing over time.

 ### Logging vs tracing

 A good intuition is that logging is to functional testing what tracing is to
 performance analysis. Tracing is, in a sense, "structured" logging: instead of
 having arbitrary strings emitted from parts of the system, tracing reflects the
 detailed state of a system in a structured way to allow reconstruction of the
 timeline of events.

 Moreover, tracing frameworks (like Perfetto) place heavy emphasis on having
 minimal overhead. This is essential so that the framework does not significantly
 disrupt whatever is being measured: modern frameworks are fast enough that they
 can measure execution at the nanosecond level without significantly impacting
 the execution speed of the program.

 _Small aside: theoretically, tracing frameworks are powerful enough to act as a
 logging system as well. However, the utilization of each in practice is
 different enough that the two tend to be separate._

 ### Metrics vs tracing

 Metrics are numerical values which track the performance of a system over time.
 Usually metrics map to high-level concepts. Examples of metrics include: CPU
 usage, memory usage, network bandwidth, etc. Metrics are collected directly from
 the app or operating system while the program is running.

 After glimpsing the power of tracing, a natural question arises: why bother with
 high level metrics at all? Why not instead just use tracing and compute metrics
 on resulting traces? In some settings, this may indeed be the right approach. In
 local and lab situations using **trace-based metrics**, where metrics are
 computed from traces instead of collecting them directly, is a powerful
 approach. If a metric regresses, it's easy to open the trace to root cause why
 that happened.

 However, trace-based metrics are not a universal solution. When running in
 production, the heavyweight nature of traces can make it impractical to collect
 them 24/7. Computing a metric with a trace can take megabytes of data vs bytes
 for direct metric collection.

 Using metrics is the right choice when you want to understand the performance of
 a system over time but do not want to or can not pay the cost of collecting
 traces. In these situations, traces should be used as a **root-causing** tool.
 When your metrics show there is a problem, targeted tracing can be rolled out to
 understand why the regression may have happened.

 ## Introduction to Profiling

 **Profiling** involves sampling some usage of a resource by a program. A single
 continuous session of recording is known as a **profile**.

 Each sample collects the function callstack (i.e. the line of code along with
 all calling functions). Generally this information is aggregated across the
 profile. For each seen callstack, the aggregation gives the percentage of usage
 of the resource by that callstack. By far the most common types of profiling are
 **memory profiling** and **CPU profiling**.

 Memory profiling is used to understand which parts of a program are allocating
 memory on the heap. The profiler generally hooks into `malloc` (and `free`)
 calls of a native (C/C++/Rust/etc.) program to sample the callstacks calling
 `malloc`. Information about how many bytes were allocated is also retained. CPU
 profiling is used for understanding where the program is spending CPU time. The
 profiler captures the callstack running on a CPU over time. Generally this is
 done periodically (e.g. every 50ms), but can be also be done when certain events
 happen in the operating system.

 ### Profiling vs tracing

 There are two main questions for comparing profiling and tracing:

 1. Why profile my program statistically when I can just trace _everything_?
 2. Why use tracing to reconstruct the timeline of events when profiling gives me
    the exact line of code using the most resources?

 #### When to use profiling over tracing

 Traces cannot feasibly capture execution of extreme high frequency events e.g.
 every function call. Profiling tools fill this niche: by sampling, they can
 significantly cut down on how much information they store. The statistical
 nature of profilers are rarely a problem; the sampling algorithms for profilers
 are specifically designed to capture data which is highly representative of the
 real resource use.

 *Aside: a handful of very specialized tracing tools exist which can capture
 every function call (e.g.
 [magic-trace](https://github.com/janestreet/magic-trace)) but they output
 *gigabytes* of data every second which make them impractical for anything beyond
 investigating tiny snippets of code. They also generally have higher overhead
 than general purpose tracing tools.*

 #### When to use tracing over profiling

 While profilers give callstacks where resources are being used, they lack
 information about _why_ that happened. For example, why was malloc being called
 by function _foo()_ so many times? All they say is _foo()_ allocated X bytes
 over Y calls to `malloc`. Traces are excellent at providing this exact context:
 application instrumentation and low-level kernel events together provide deep
 insight into why code was run in the first place.

 NOTE: Perfetto supports collecting, analyzing and visualizing both profiles and
 traces at the same time so you can have the best of both worlds!

 ## Next Steps

 Now that you have a better understanding of tracing and profiling, you can use
 Perfetto to:

 - **Record a trace** of your application and system to understand its behavior.
 - **Analyze a trace** to identify performance bottlenecks.
 - **Visualize a trace** to see a timeline of events.

 To learn how to do this, head to our
 [How do I start using Perfetto?](/docs/getting-started/start-using-perfetto.md)
 page.
	# What is Tracing?

	NOTE: the word "tracing" in this document is used in the context of
	client-side side software (e.g. programs running on a single machine). In
	the server world, tracing is usually short for _distributed tracing_, a way
	to collect data from many different servers to understand the flow of a
	"request" throughout multiple services. As such, this document will _not_ be
	useful to you if you are interested in such traces.

	This page provides a bird's-eye view of performance analysis and tracing. The
	aim is to orient people who have no idea what "tracing" is.

	## Introduction to Performance

	Performance analysis is concerned with making software run _better_. The
	definition of _better_ varies widely and depends on the situation. Examples
	include:

	- performing the same work using fewer resources (CPU, memory, network, battery,
	etc.)
	- increasing utilization of available resources
	- identifying and eliminating unnecessary work altogether

	Much of the difficulty in improving performance comes from identifying the root
	cause of performance issues. Modern software systems are complicated, having a
	lot of components and a web of cross-interactions. Techniques which help
	engineers understand the execution of a system and pinpoint issues that are
	critical.

	Tracing and profiling are two such widely-used techniques for
	performance analysis.

	## Introduction to Tracing

	Tracing involves collecting highly detailed data about the execution of a
	system. A single continuous session of recording is called a trace file or
	trace for short.

	Traces contain enough detail to fully reconstruct the timeline of events. They
	often include low-level kernel events like scheduler context switches, thread
	wakeups, syscalls, etc. With the "right" trace, reproduction of a performance
	bug is not needed as the trace provides all necessary context.

	Application code is also instrumented in areas of the program which are
	considered to be _important_. This instrumentation keeps track of what the
	program was doing over time (e.g. which functions were being run, or how long
	each call took) and context about the execution (e.g. what were the parameters
	to a function call, or why was a function run).

	The level of detail in traces makes it impractical to read traces directly like
	a log file in all but the simplest cases. Instead, a combination of **trace
	analysis libraries and trace viewers** are used. Trace analysis libraries
	provide a way for users to extract and summarize trace events in a programmatic
	manner. Trace viewers visualize the events in a trace on a timeline which give
	users a graphical view of what their system was doing over time.

	### Logging vs tracing

	A good intuition is that logging is to functional testing what tracing is to
	performance analysis. Tracing is, in a sense, "structured" logging: instead of
	having arbitrary strings emitted from parts of the system, tracing reflects the
	detailed state of a system in a structured way to allow reconstruction of the
	timeline of events.

	Moreover, tracing frameworks (like Perfetto) place heavy emphasis on having
	minimal overhead. This is essential so that the framework does not significantly
	disrupt whatever is being measured: modern frameworks are fast enough that they
	can measure execution at the nanosecond level without significantly impacting
	the execution speed of the program.

	_Small aside: theoretically, tracing frameworks are powerful enough to act as a
	logging system as well. However, the utilization of each in practice is
	different enough that the two tend to be separate._

	### Metrics vs tracing

	Metrics are numerical values which track the performance of a system over time.
	Usually metrics map to high-level concepts. Examples of metrics include: CPU
	usage, memory usage, network bandwidth, etc. Metrics are collected directly from
	the app or operating system while the program is running.

	After glimpsing the power of tracing, a natural question arises: why bother with
	high level metrics at all? Why not instead just use tracing and compute metrics
	on resulting traces? In some settings, this may indeed be the right approach. In
	local and lab situations using trace-based metrics, where metrics are
	computed from traces instead of collecting them directly, is a powerful
	approach. If a metric regresses, it's easy to open the trace to root cause why
	that happened.

	However, trace-based metrics are not a universal solution. When running in
	production, the heavyweight nature of traces can make it impractical to collect
	them 24/7. Computing a metric with a trace can take megabytes of data vs bytes
	for direct metric collection.

	Using metrics is the right choice when you want to understand the performance of
	a system over time but do not want to or can not pay the cost of collecting
	traces. In these situations, traces should be used as a root-causing tool.
	When your metrics show there is a problem, targeted tracing can be rolled out to
	understand why the regression may have happened.

	## Introduction to Profiling

	Profiling involves sampling some usage of a resource by a program. A single
	continuous session of recording is known as a profile.

	Each sample collects the function callstack (i.e. the line of code along with
	all calling functions). Generally this information is aggregated across the
	profile. For each seen callstack, the aggregation gives the percentage of usage
	of the resource by that callstack. By far the most common types of profiling are
	memory profiling and CPU profiling.

	Memory profiling is used to understand which parts of a program are allocating
	memory on the heap. The profiler generally hooks into `malloc` (and `free`)
	calls of a native (C/C++/Rust/etc.) program to sample the callstacks calling
	`malloc`. Information about how many bytes were allocated is also retained. CPU
	profiling is used for understanding where the program is spending CPU time. The
	profiler captures the callstack running on a CPU over time. Generally this is
	done periodically (e.g. every 50ms), but can be also be done when certain events
	happen in the operating system.

	### Profiling vs tracing

	There are two main questions for comparing profiling and tracing:

	1. Why profile my program statistically when I can just trace _everything_?
	2. Why use tracing to reconstruct the timeline of events when profiling gives me
	the exact line of code using the most resources?

	#### When to use profiling over tracing

	Traces cannot feasibly capture execution of extreme high frequency events e.g.
	every function call. Profiling tools fill this niche: by sampling, they can
	significantly cut down on how much information they store. The statistical
	nature of profilers are rarely a problem; the sampling algorithms for profilers
	are specifically designed to capture data which is highly representative of the
	real resource use.

	*Aside: a handful of very specialized tracing tools exist which can capture
	every function call (e.g.
	[magic-trace](https://github.com/janestreet/magic-trace)) but they output
	gigabytes of data every second which make them impractical for anything beyond
	investigating tiny snippets of code. They also generally have higher overhead
	than general purpose tracing tools.*

	#### When to use tracing over profiling

	While profilers give callstacks where resources are being used, they lack
	information about _why_ that happened. For example, why was malloc being called
	by function _foo()_ so many times? All they say is _foo()_ allocated X bytes
	over Y calls to `malloc`. Traces are excellent at providing this exact context:
	application instrumentation and low-level kernel events together provide deep
	insight into why code was run in the first place.

	NOTE: Perfetto supports collecting, analyzing and visualizing both profiles and
	traces at the same time so you can have the best of both worlds!

	## Next Steps

	Now that you have a better understanding of tracing and profiling, you can use
	Perfetto to:

	- Record a trace of your application and system to understand its behavior.
	- Analyze a trace to identify performance bottlenecks.
	- Visualize a trace to see a timeline of events.

	To learn how to do this, head to our
	[How do I start using Perfetto?](/docs/getting-started/start-using-perfetto.md)
	page.