commit | 993282030a3b872f281e5077e976372441df4442 | [log] [tgz] |
---|---|---|
author | Zim <zezeozue@google.com> | Wed Mar 27 21:36:01 2024 +0000 |
committer | Zim <zezeozue@google.com> | Thu Mar 28 19:37:26 2024 +0000 |
tree | e725189571d1f908e0ebd5f129a59e27fe625ff1 | |
parent | 888ba06a4435e982005a992037435b614468fc0d [diff] |
[stdlib]: Rewrite the critical path implementation Re-implemented using the newly added weight bounded (wb) DFS, aosp/3011157. The key idea here is that if we treat every thread_executing_span as a root_node and run a wb-DFS (for performance) bounded by the prev node of the root, the children of the root are the tasks in the critical path of the root node. The new algorithm traverses the graph in the same way as the old there's a possible improvement to handle kernel stalls better when they occur as part of critical path traversal. We should ideally follow the stall and return to the previous node before we began the kernel stall traversal. Currently, we follow the stall and keep traversing parents. The current idea I have for this improvement would complicate the API because it requires creating tables and makes it impossible to have one e2e critical path macro. The new algorithm makes heavy use of macros: 1. _flatten_critical_path_tasks: Flattens the critical path tasks of every root so that there are no overlaps; the start of the next becomes the end of the current. 2. _flatten_critical_paths: Flattens critical paths so that they don't overlap with themselves. 3. _critical_path: This generates critical paths from 3 arguments: wakeup_graph, roots and sleep_spans. The critical path result is ready to be visualized in the UI without overlaps. To improve performance, additional macros are added to limit the search further. 4. _intervals_to_roots: Helper macro that converts a table of <ts, dur, utid> spans into a list of <id> root nodes. 5. _critical_path_by_roots: Bounds the critical path result to the nodes provided as roots. 6. _critical_path_by_intervals: Bounds the critical path result to the roots overlapping the time spans provided as <ts, dur, utid>. Note that this does a manual span join and is inefficient for large number of spans. In the future with interval_intersect, this would be performant. For common use cases, the following methods exist: 1. _thread_executing_span_critical_path: Returns the critical path for one <ts, dur, utid>. This is currently used in the UI and for field traces takes ~100ms for an entire thread track. This makes running the critical path on large 1G traces tractable. With the old approach of pre-generating the entire critical path, such traces can take >10mins. Now, the init takes ~3mins and subsequent thread tracks load in few seconds. 2. _thread_executing_span_critical_path_all: Returns the critical path of the entire trace. This is useful if batch processing a lot of data, e.g all binder txns. Would be more efficient to generate all the critical paths and span_join. This is also used in the more heavy weight critical path generation for virtual stacks (slices) which has now been moved to a different module thread_executing_span_with_slices. The UI button for 'Critical path lite' will not load the _with_slices version, but the 'Critical path' button will since that uses slice information which is all initialized at the start and can be expensive in memory and compute. TODO: 1. Implement an algo in C++ that handles nesting of kernel stalls in userspace stalls correctly. 2. Improve the performance of merging the critical_path with thread_state and slice data. This will probably be better with the ongoing interval_intersect feature. 3. Expose high level helpers in the relevant modules e.g binder and monitor contention and app startup to generate detailed critical path info and summaries 4. Improve the critical path UX in the UI so that contextual critical paths (per-slice) are easy to generate including summaries of sub-systems like binder transactions. Test: tools/diff_test_trace_processor.py out/android/trace_processor_shell --name-filter '.*thread_executing_span.*' Change-Id: I8708e03ddd61b0e022cb1fbfb3d9f3a26396c89f
Perfetto is a production-grade open-source stack for performance instrumentation and trace analysis. It offers services and libraries and for recording system-level and app-level traces, native + java heap profiling, a library for analyzing traces using SQL and a web-based UI to visualize and explore multi-GB traces.
See https://perfetto.dev/docs or the /docs/ directory for documentation.