base: add a new FlatHashMap based on Swiss Table concepts (#4205)

Note: this CL does *not* change any production code to actually use
this. All it's doing is adding the implementation, benchmarks and tests.
Future CL will actually do the switch.

Our current FlatHashMap implementation has seved us well.
It was originally designed just to replace std::unordered_map inside of
TraceProcessor's StringPool but due to its nice perf characteristics,
it's come to be used throughout the Perfetto codebase.

But over time I've realised that it has some flows which make it not
suitable to be a truly general purpose hash table:
* very sensitive to tombstones. we've made some modifications to
  rehash if there are tombstones but in heavy insert/erase workloads
  it's not enough
* very sensitive to collisions: if there is a collision, the default is
  quadratic probing, which can cause some quite sub-optimal memory
  access patterns. And using linear probing doesn't work for high load
  factor hash maps.
* suboptimal memory layout: putting the ctrl/keys/values in different
  arrays means that we're jumping around memory instead of being in a
  single place, especially important for small maps.

All in all, these are all things that absl::flat_has_map and Swiss
tables have solved. There's no reason for us not to take advantage of at
least their memory layout and orgniaation ideas even if we cannot use
them directly. If they're good enough for much of the C++ world (and
definitely all of Google) they should be good enough for us too!

So this CL adds a new FlatHashMapV2 which is based on Swiss Table
concepths. Specifically;
* The idea of storing the control bytes, keys and values all in the same
  array
* The logic to probe using triangular numbers and do group based
  probing.
* Using SIMD to access 8/16 elements at a time and do comparisions of
  hashes.
* The idea to store 7 bytes of a hash in the control byte to do fast
  checks and reduce the probability of a false match to 1/128 before
  even checking keys
* Using SWAR for non-SIMD platforms (including ARM64 though we might
  optimize this with NEON in the future depending on what we see)

The benchmarks are quite good for general cases.

For TraceStrings:

```
BM_HashMap_InsertTraceStrings_AppendOnly                        554254588 ns    553651249 ns            1 409.798M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<Ours_PreHashed>                   698876584 ns    698092350 ns            1 325.008M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<OursV2_PreHashed>                 662225759 ns    661307563 ns            1 343.086M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed>       2376921965 ns   2374903076 ns            1 95.5346M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed>        631832647 ns    631347455 ns            1 359.367M/s   226.885M   1.17868M
```

This is the only case where the old implementation wins (which makes
sense because it was specifically designed for it!). But I don't think a
loss here justifies the major wins elsewhere (especially also because in
the "normal" case, cache misses would drown out the small differences
here anyway given how widespread interning is in practice. It also only
wins with AppendOnly which is rarely used.

For TraceTids:

```
BM_HashMap_TraceTids<Ours_Tid>                                  741954852 ns    741389092 ns            1 183.063M/s    16.099k
BM_HashMap_TraceTids<OursV2_Tid>                                504387549 ns    503475605 ns            1 269.568M/s    16.099k
BM_HashMap_TraceTids<StdUnorderedMap_Tid>                      1401707453 ns   1400083835 ns            1 96.9378M/s    16.099k
BM_HashMap_TraceTids<AbslFlatHashMap_Tid>                       510576964 ns    510105976 ns            1 266.064M/s    16.099k
```

We're signifcantly better than V1 here: tids is exactly the sort of
"general behaviour" which it's very common to see in Perfetto and where
V1 behaves less well than I'd like.

For LookupRandInts:

```
BM_HashMap_LookupRandInts<Ours_Default>                         574181027 ns    573859405 ns            1 17.4259M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_Default>                       350956969 ns    350711465 ns            2 28.5135M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Default>              571851786 ns    571499561 ns            1 17.4978M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Default>              366032708 ns    365819315 ns            2 27.3359M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur>               611111313 ns    610638342 ns            1 16.3763M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur>               363292801 ns    362867428 ns            2 27.5583M/s   10.7378P
BM_HashMap_LookupRandInts<Ours_PreHashed>                       713060262 ns    712048102 ns            1  14.044M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_PreHashed>                     329546786 ns    329117625 ns            2 30.3843M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed>            580294840 ns    579547604 ns            1 17.2548M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed>            340158501 ns    339918114 ns            2 29.4188M/s   10.7378P
```

Another handy win which appears regardless of whether the hash function
is good or not. Notice how in the V1 case, prehashed causes poor
degeneration of the function but V2 is actually *faster*.

For InsertCollidingInts:

```
BM_HashMap_InsertCollidingInts<Ours_Default>                    701989492 ns    701578722 ns            1 14.2536M/s
BM_HashMap_InsertCollidingInts<OursV2_Default>                  427725299 ns    427302016 ns            2 23.4027M/s
BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default>         896047311 ns    895384570 ns            1 11.1684M/s
BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default>         420245635 ns    420006692 ns            2 23.8091M/s
```

Another big win for V2. Another similar example to the sort we'd see in
practice in Perfetto.

For InsertDupeInts:

```
BM_HashMap_InsertDupeInts<Ours_Default>                          63926278 ns     63874785 ns           11 156.556M/s
BM_HashMap_InsertDupeInts<OursV2_Default>                        40291231 ns     40268380 ns           17 248.334M/s
BM_HashMap_InsertDupeInts<StdUnorderedMap_Default>               37554643 ns     37527602 ns           19 266.471M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default>               29026714 ns     29008928 ns           24 344.721M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur>                38883999 ns     38857011 ns           18 257.354M/s
```

Same as above, another common case. Surprisingly, std does very well
here. And we seem to be bottlenecked on the hash function given how much
better absl does with its own hash vs murmur (its hash is slightly lower
quality but also faster). This is the only benchmark where this happens
thoough, others don't show this behaviour.

For *String:

```
BM_HashMap_HeterogeneousLookup_String<Ours_String>              575812136 ns    575383946 ns            1 17.3797M/s
BM_HashMap_HeterogeneousLookup_String<OursV2_String>            420531571 ns    420120838 ns            2 23.8027M/s
BM_HashMap_RegularLookup_String<Ours_String>                    820167931 ns    819494830 ns            1 12.2026M/s
BM_HashMap_RegularLookup_String<OursV2_String>                  405975726 ns    405482422 ns            2  24.662M/s
BM_HashMap_RegularLookup_String<StdUnorderedMap_String>        1577833963 ns   1576585547 ns            1 6.34282M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String>         491132758 ns    490802459 ns            2 20.3748M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur>  571520036 ns    571134272 ns            1  17.509M/s
```

All clear wins for us. We're doing actually much better than even absl
here!

InsertVaryingSizes:

```
BM_HashMap_InsertVaryingSize<Ours_Default>/100                        754 ns          754 ns       932167 132.677M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/10000                   200216 ns       200096 ns         3531 49.9759M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/1000000               54744741 ns     54674135 ns           13 18.2902M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/100                     1381 ns         1380 ns       505094 72.4606M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/10000                 164971 ns       164851 ns         4250 60.6608M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000             38351821 ns     38310617 ns           18 26.1024M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100            1268 ns         1267 ns       551576 78.9371M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000        137933 ns       137845 ns         5065 72.5454M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000    39980643 ns     39943874 ns           18 25.0351M/s
```

Note how v1 does better for very small maps but drops off as maps get
bigger. I think this is probably the effect of being able to have
everything in cache even with the different key/value/ctrl maps and SIMD
not being as useful with such small maps.

LookupWithMisses:

```
BM_HashMap_LookupWithMisses<Ours_Default>/0                     359571712 ns    359090314 ns            2 27.8481M/s
BM_HashMap_LookupWithMisses<Ours_Default>/50                    426399559 ns    425828348 ns            2 23.4836M/s
BM_HashMap_LookupWithMisses<Ours_Default>/100                   308777908 ns    308628319 ns            2 32.4014M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/0                   195633544 ns    195481786 ns            4 51.1557M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/50                  221380263 ns    221180357 ns            3  45.212M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/100                 139350634 ns    139236152 ns            5 71.8204M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0          201567396 ns    201412429 ns            4 49.6494M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50         206378283 ns    206259753 ns            3 48.4826M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100        123735273 ns    123601106 ns            6 80.9054M/s
```

Notice how much much better we do when we have a low/high miss rate.
This is indicative of the branch predicted doing a much better job of
actually dealing wit our usecases. But OTOH even 50% we are far
suprerior.

InsertSequential:

```
BM_HashMap_InsertSequentialInts<Ours_Default>                   781732694 ns    781402467 ns            1 12.7975M/s
BM_HashMap_InsertSequentialInts<OursV2_Default>                 423644733 ns    423098276 ns            2 23.6352M/s
BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default>        473308044 ns    472810618 ns            2 21.1501M/s
BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default>        420184095 ns    419919524 ns            2 23.8141M/s
```

Huge win for what is quite a common case in trace processor. Even std is
doing better than us here which is quite surprising!

LookupSequential:

```
BM_HashMap_LookupSequentialInts<Ours_Default>                   586390209 ns    585999772 ns            1 17.0649M/s
BM_HashMap_LookupSequentialInts<OursV2_Default>                 344488692 ns    344312148 ns            2 29.0434M/s
BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default>         49338334 ns     49299886 ns           15  202.84M/s
BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default>        230591313 ns    230421841 ns            3 43.3987M/s
```

Std destroys everyone which I find highly surprising and don't have a
good explanation for. Maybe it's because it has a perfect hash function
basically (since by default std::hash just returns the int?). Regardless
we're doing better than V1.


```
BM_HashMap_EraseRandInts<Ours_Erasable>                     231067139 ns    230741900 ns            3 43.3385M/s
BM_HashMap_EraseRandInts<OursV2_Erasable>                   234739797 ns    234546351 ns            3 42.6355M/s
BM_HashMap_EraseRandInts<StdUnorderedMap_Default>          2731290342 ns   2728010967 ns            1 3.66567M/s
BM_HashMap_EraseRandInts<AbslFlatHashMap_Default>           346642034 ns    346395066 ns            2 28.8688M/s

BM_HashMap_InsertEraseInterleaved<Ours_Erasable>            700174237 ns    698939463 ns            1 14.3074M/s
BM_HashMap_InsertEraseInterleaved<OursV2_Erasable>          427958820 ns    427531859 ns            2 23.3901M/s
BM_HashMap_InsertEraseInterleaved<StdUnorderedMap_Default> 3887662277 ns   3884060811 ns            1 2.57462M/s
BM_HashMap_InsertEraseInterleaved<AbslFlatHashMap_Default>  376082048 ns    375696540 ns            2 26.6172M/s
BM_HashMap_InsertEraseInterleaved<AbslFlatHashMap_Murmur>   379993339 ns    379412632 ns            2 26.3565M/s

BM_HashMap_EraseTombstoneStress<Ours_Erasable>               23412930 ns     23398168 ns           30         3M
BM_HashMap_EraseTombstoneStress<OursV2_Erasable>             14905630 ns     14893152 ns           47       4.7M
BM_HashMap_EraseTombstoneStress<StdUnorderedMap_Default>     53218188 ns     53175602 ns           13       1.3M
BM_HashMap_EraseTombstoneStress<AbslFlatHashMap_Default>     18289471 ns     18265641 ns           38       3.8M
```

While erase is not the most important for most usecases as primarily we
care about Find and Insert, it's important to see
it's not *terrible* either. And infact it's actually very good in
general! In most cases being a step above v1 and very competitive (or
even better than absl).
7 files changed
tree: de58da89058f6602db06de1621a4ad92393d2b75
  1. .github/
  2. bazel/
  3. build_overrides/
  4. buildtools/
  5. contrib/
  6. docs/
  7. examples/
  8. gn/
  9. include/
  10. infra/
  11. protos/
  12. python/
  13. src/
  14. test/
  15. third_party/
  16. tools/
  17. ui/
  18. .bazelignore
  19. .bazelrc
  20. .bazelversion
  21. .clang-format
  22. .clang-tidy
  23. .git-blame-ignore-revs
  24. .gitallowed
  25. .gitattributes
  26. .gitignore
  27. .gn
  28. .style.yapf
  29. Android.bp
  30. Android.bp.extras
  31. BUILD
  32. BUILD.extras
  33. BUILD.gn
  34. CHANGELOG
  35. CONTRIBUTORS.txt
  36. DIR_METADATA
  37. heapprofd.rc
  38. LICENSE
  39. meson.build
  40. METADATA
  41. MODULE.bazel
  42. MODULE.bazel.lock
  43. MODULE_LICENSE_APACHE2
  44. OWNERS
  45. OWNERS.github
  46. perfetto.rc
  47. perfetto_flags.aconfig
  48. PerfettoIntegrationTests.xml
  49. persistent_cfg.pbtxt
  50. README.chromium
  51. README.md
  52. TEST_MAPPING
  53. traced_perf.rc
  54. WORKSPACE
README.md

Perfetto - System profiling, app tracing and trace analysis

Perfetto is an open-source suite of SDKs, daemons and tools which use tracing to help developers understand the behaviour of complex systems and root-cause functional and performance issues on client and embedded systems.

It is a production-grade tool that is the default tracing system for the Android operating system and the Chromium browser.

Core Components

Perfetto is not a single tool, but a collection of components that work together:

  • High-performance tracing daemons: For capturing tracing information from many processes on a single machine into a unified trace file.
  • Low-overhead tracing SDK: A C++17 library for direct userspace-to-userspace tracing of timings and state changes in your application.
  • Extensive OS-level probes: For capturing system-wide context on Android and Linux (e.g. scheduling states, CPU frequencies, memory profiling, callstack sampling).
  • Browser-based UI: A powerful, fully local UI for visualizing and exploring large, multi-GB traces on a timeline. It works in all major browsers, requires no installation, and can open traces from other tools.
  • SQL-based analysis library: A powerful engine that allows you to programmatically query traces using SQL to automate analysis and extract custom metrics.

Why Use Perfetto?

Perfetto was designed to be a versatile and powerful tracing system for a wide range of use cases.

  • For Android App & Platform Developers: Debug and root-cause functional and performance issues like slow startups, dropped frames (jank), animation glitches, low memory kills, and ANRs. Profile both Java/Kotlin and native C++ memory usage with heap dumps and profiles.
  • For C/C++ Developers (Linux, macOS, Windows): Use the Tracing SDK to instrument your application with custom trace points to understand its execution flow, find performance bottlenecks, and debug complex behavior. On Linux, you can also perform detailed CPU and native heap profiling.
  • For Linux Kernel & System Developers: Get deep insights into kernel behavior. Perfetto acts as an efficient userspace daemon for ftrace, allowing you to visualize scheduling, syscalls, interrupts, and custom kernel tracepoints on a timeline.
  • For Chromium Developers: Perfetto is the tracing backend for chrome://tracing. Use it to debug and root-cause issues in the browser, V8, and Blink.
  • For Performance Engineers & SREs: Analyze and visualize a wide range of profiling and tracing formats, not just Perfetto's. Use the powerful SQL interface to programmatically analyze traces from tools like Linux perf, macOS Instruments, Chrome JSON traces, and more.

Getting Started

We‘ve designed our documentation to guide you to the right information as quickly as possible, whether you’re a newcomer to performance analysis or an experienced developer.

  1. New to tracing? If you're unfamiliar with concepts like tracing and profiling, start here:

  2. Ready to dive in? Our “Getting Started” guide is the main entry point for all users. It will help you find the right tutorials and documentation for your specific needs:

  3. Want the full overview? For a comprehensive look at what Perfetto is, why it's useful, and who uses it, see our main documentation page:

Debian Distribution

For users interested in the Debian distribution of Perfetto, the official source of truth and packaging efforts are maintained at Debian Perfetto Salsa Repository

Community & Support

Have questions? Need help?

We follow Google's Open Source Community Guidelines.