tp: speed up UpdateSetBits by 10-20x (worst cases) with PDEP intrinsic

As a follow up to aosp/2117218, further speed up UpdateSetBits by
another 10x in the worst cases by using PDEP intrinsic which is
available on all modern x64 machines.

Also add the necessary checks and compiler flags for this feature: note
that we're not excluding any more CPUs than we already did because every
non-niche chipset which had AVX2 also had BMI/BMI2.

Before:
--------------------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations  s/set bit s/set picker bit
--------------------------------------------------------------------------------------------------------------
BM_BitVectorUpdateSetBits/1234567/1/1       112970 ns       112971 ns         6195  9.20561ns        889.538ns
BM_BitVectorUpdateSetBits/1234567/5/1       118373 ns       118369 ns         5885  1.92129ns        183.803ns
BM_BitVectorUpdateSetBits/1234567/50/1      413089 ns       413089 ns         1704  668.353ps         66.735ns
BM_BitVectorUpdateSetBits/1234567/95/1      675030 ns       675000 ns         1036  575.491ps        57.2714ns
BM_BitVectorUpdateSetBits/1234567/99/1      423003 ns       422973 ns         1632  346.077ps        34.2266ns
BM_BitVectorUpdateSetBits/1234567/1/5       127430 ns       127421 ns         5520  10.3831ns        199.408ns
BM_BitVectorUpdateSetBits/1234567/5/5       201662 ns       201662 ns         3469  3.27326ns        64.7807ns
BM_BitVectorUpdateSetBits/1234567/50/5      743395 ns       743375 ns          953  1.20273ns        24.0388ns
BM_BitVectorUpdateSetBits/1234567/95/5     1177486 ns      1177326 ns          600  1003.76ps        20.0734ns
BM_BitVectorUpdateSetBits/1234567/99/5      680020 ns       679980 ns         1025  556.361ps        11.0686ns
BM_BitVectorUpdateSetBits/1234567/1/50      133754 ns       133753 ns         5286  10.8991ns        21.5766ns
BM_BitVectorUpdateSetBits/1234567/5/50      237373 ns       237366 ns         2956  3.85278ns        7.68797ns
BM_BitVectorUpdateSetBits/1234567/50/50     774907 ns       774916 ns          891  1.25377ns         2.5046ns
BM_BitVectorUpdateSetBits/1234567/95/50    1207646 ns      1207482 ns          575  1029.47ps        2.05698ns
BM_BitVectorUpdateSetBits/1234567/99/50     700648 ns       700650 ns          963  573.273ps        1.14605ns
BM_BitVectorUpdateSetBits/1234567/1/95      133795 ns       133788 ns         5249  10.9019ns         11.429ns
BM_BitVectorUpdateSetBits/1234567/5/95      239426 ns       239423 ns         2938  3.88618ns        4.08342ns
BM_BitVectorUpdateSetBits/1234567/50/95     775347 ns       775286 ns          904  1.25436ns        1.31996ns
BM_BitVectorUpdateSetBits/1234567/95/95    1213563 ns      1213393 ns          579  1034.51ps        1088.74ps
BM_BitVectorUpdateSetBits/1234567/99/95     712348 ns       712263 ns         1008  582.774ps        613.345ps
BM_BitVectorUpdateSetBits/1234567/1/99      137894 ns       137890 ns         4989  11.2361ns        11.3331ns
BM_BitVectorUpdateSetBits/1234567/5/99      242047 ns       242050 ns         2852   3.9288ns        3.96543ns
BM_BitVectorUpdateSetBits/1234567/50/99     774587 ns       774587 ns          890  1.25323ns        1.26574ns
BM_BitVectorUpdateSetBits/1234567/95/99    1206971 ns      1206867 ns          582  1028.95ps        1039.37ps
BM_BitVectorUpdateSetBits/1234567/99/99     701870 ns       701809 ns          998  574.221ps        579.968ps

After:
--------------------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations  s/set bit s/set picker bit
--------------------------------------------------------------------------------------------------------------
BM_BitVectorUpdateSetBits/1234567/1/1        97454 ns        97451 ns         7081  7.94096ns        767.334ns
BM_BitVectorUpdateSetBits/1234567/5/1        63700 ns        63701 ns        11110  1033.95ps         98.914ns
BM_BitVectorUpdateSetBits/1234567/50/1       67374 ns        67373 ns        10719  109.005ps        10.8842ns
BM_BitVectorUpdateSetBits/1234567/95/1       67917 ns        67909 ns        10710  57.8976ps        5.76182ns
BM_BitVectorUpdateSetBits/1234567/99/1       58296 ns        58296 ns        12354  47.6982ps        4.71731ns
BM_BitVectorUpdateSetBits/1234567/1/5       100398 ns       100392 ns         6783   8.1806ns        157.109ns
BM_BitVectorUpdateSetBits/1234567/5/5        62765 ns        62765 ns        10998  1018.77ps        20.1623ns
BM_BitVectorUpdateSetBits/1234567/50/5       67128 ns        67128 ns        10478  108.609ps        2.17074ns
BM_BitVectorUpdateSetBits/1234567/95/5       67899 ns        67899 ns         9989  57.8895ps        1.15768ns
BM_BitVectorUpdateSetBits/1234567/99/5       59342 ns        59343 ns        12129  48.5543ps        965.976ps
BM_BitVectorUpdateSetBits/1234567/1/50       97342 ns        97338 ns         7215  7.93171ns        15.7022ns
BM_BitVectorUpdateSetBits/1234567/5/50       63323 ns        63323 ns        11089  1027.82ps        2.05095ns
BM_BitVectorUpdateSetBits/1234567/50/50      65986 ns        65980 ns        10728  106.751ps        213.252ps
BM_BitVectorUpdateSetBits/1234567/95/50      66994 ns        66993 ns         9976  57.1164ps        114.124ps
BM_BitVectorUpdateSetBits/1234567/99/50      56002 ns        56003 ns        11666  45.8217ps        91.6041ps
BM_BitVectorUpdateSetBits/1234567/1/95       96038 ns        96035 ns         7297  7.82555ns        8.20393ns
BM_BitVectorUpdateSetBits/1234567/5/95       61998 ns        61995 ns        11285  1006.26ps        1057.33ps
BM_BitVectorUpdateSetBits/1234567/50/95      64021 ns        64022 ns        10812  103.584ps            109ps
BM_BitVectorUpdateSetBits/1234567/95/95      65208 ns        65204 ns        10544  55.5912ps        58.5054ps
BM_BitVectorUpdateSetBits/1234567/99/95      55856 ns        55854 ns        12737     45.7ps        48.0973ps
BM_BitVectorUpdateSetBits/1234567/1/99       95244 ns        95242 ns         7421  7.76094ns        7.82792ns
BM_BitVectorUpdateSetBits/1234567/5/99       61757 ns        61755 ns        11265  1002.37ps        1011.72ps
BM_BitVectorUpdateSetBits/1234567/50/99      65502 ns        65503 ns        10960   105.98ps        107.037ps
BM_BitVectorUpdateSetBits/1234567/95/99      65376 ns        65371 ns        10015  55.7339ps        56.2985ps
BM_BitVectorUpdateSetBits/1234567/99/99      56801 ns        56792 ns        12370   46.467ps        46.9321ps

Bug: 235104800
Change-Id: I22babf71b6ebc4898be3f5f5cde51f74cbebf65c
3 files changed
tree: ab9e3a03015bcd500e03b934e4efc19d73b9990e
  1. .github/
  2. bazel/
  3. build_overrides/
  4. buildtools/
  5. debian/
  6. docs/
  7. examples/
  8. gn/
  9. include/
  10. infra/
  11. protos/
  12. python/
  13. src/
  14. test/
  15. tools/
  16. ui/
  17. .clang-format
  18. .clang-tidy
  19. .gitattributes
  20. .gitignore
  21. .gn
  22. .style.yapf
  23. Android.bp
  24. Android.bp.extras
  25. BUILD
  26. BUILD.extras
  27. BUILD.gn
  28. CHANGELOG
  29. codereview.settings
  30. DIR_METADATA
  31. heapprofd.rc
  32. LICENSE
  33. meson.build
  34. METADATA
  35. MODULE_LICENSE_APACHE2
  36. OWNERS
  37. perfetto.rc
  38. PerfettoIntegrationTests.xml
  39. PRESUBMIT.py
  40. README.chromium
  41. README.md
  42. TEST_MAPPING
  43. traced_perf.rc
  44. WORKSPACE
README.md

Perfetto - System profiling, app tracing and trace analysis

Perfetto is a production-grade open-source stack for performance instrumentation and trace analysis. It offers services and libraries and for recording system-level and app-level traces, native + java heap profiling, a library for analyzing traces using SQL and a web-based UI to visualize and explore multi-GB traces.

See https://perfetto.dev/docs or the /docs/ directory for documentation.