[stdlib]: Rewrite the critical path implementation
Re-implemented using the newly added weight bounded (wb) DFS,
aosp/3011157. The key idea here is that if we treat every
thread_executing_span as a root_node and run a wb-DFS
(for performance) bounded by the prev node of the root,
the children of the root are the tasks in the critical path
of the root node.
The new algorithm traverses the graph in the same way as the old
there's a possible improvement to handle kernel stalls better
when they occur as part of critical path traversal.
We should ideally follow the stall and return to the previous node
before we began the kernel stall traversal. Currently, we follow the
stall and keep traversing parents. The current idea I have for
this improvement would complicate the API because it requires creating
tables and makes it impossible to have one e2e critical path macro.
The new algorithm makes heavy use of macros:
1. _flatten_critical_path_tasks: Flattens the critical path tasks of
every root so that there are no overlaps; the start of the next
becomes the end of the current.
2. _flatten_critical_paths: Flattens critical paths so that they don't
overlap with themselves.
3. _critical_path: This generates critical paths from 3 arguments:
wakeup_graph, roots and sleep_spans. The critical path result is
ready to be visualized in the UI without overlaps.
To improve performance, additional macros are added to limit the
search further.
4. _intervals_to_roots: Helper macro that converts a table of
<ts, dur, utid> spans into a list of <id> root nodes.
5. _critical_path_by_roots: Bounds the critical path result to
the nodes provided as roots.
6. _critical_path_by_intervals: Bounds the critical path result to
the roots overlapping the time spans provided as <ts, dur, utid>.
Note that this does a manual span join and is inefficient for
large number of spans. In the future with interval_intersect,
this would be performant.
For common use cases, the following methods exist:
1. _thread_executing_span_critical_path: Returns the critical path
for one <ts, dur, utid>. This is currently used in the UI and for
field traces takes ~100ms for an entire thread track. This makes
running the critical path on large 1G traces tractable. With the old
approach of pre-generating the entire critical path, such traces can
take >10mins. Now, the init takes ~3mins and subsequent thread tracks
load in few seconds.
2. _thread_executing_span_critical_path_all: Returns the critical path
of the entire trace. This is useful if batch processing a lot of data,
e.g all binder txns. Would be more efficient to generate all the
critical paths and span_join. This is also used in the more heavy weight
critical path generation for virtual stacks (slices) which has now been
moved to a different module thread_executing_span_with_slices. The UI
button for 'Critical path lite' will not load the _with_slices version,
but the 'Critical path' button will since that uses slice information
which is all initialized at the start and can be expensive in memory and
compute.
TODO:
1. Implement an algo in C++ that handles nesting of kernel
stalls in userspace stalls correctly.
2. Improve the performance of merging the critical_path with
thread_state and slice data. This will probably be better with the
ongoing interval_intersect feature.
3. Expose high level helpers in the relevant modules e.g binder and
monitor contention and app startup to generate detailed critical path
info and summaries
4. Improve the critical path UX in the UI so that contextual critical
paths (per-slice) are easy to generate including summaries of
sub-systems like binder transactions.
Test: tools/diff_test_trace_processor.py out/android/trace_processor_shell --name-filter '.*thread_executing_span.*'
Change-Id: I8708e03ddd61b0e022cb1fbfb3d9f3a26396c89f
diff --git a/Android.bp b/Android.bp
index 78939cf..724d479 100644
--- a/Android.bp
+++ b/Android.bp
@@ -12384,6 +12384,7 @@
"src/trace_processor/perfetto_sql/stdlib/sched/runnable.sql",
"src/trace_processor/perfetto_sql/stdlib/sched/states.sql",
"src/trace_processor/perfetto_sql/stdlib/sched/thread_executing_span.sql",
+ "src/trace_processor/perfetto_sql/stdlib/sched/thread_executing_span_with_slice.sql",
"src/trace_processor/perfetto_sql/stdlib/sched/thread_level_parallelism.sql",
"src/trace_processor/perfetto_sql/stdlib/sched/thread_state_flattened.sql",
"src/trace_processor/perfetto_sql/stdlib/sched/time_in_state.sql",