commit | 25a45340b442bdf2231130fbfb3d5d3e44013d3f | [log] [tgz] |
---|---|---|
author | Lalit Maganti <lalitm@google.com> | Tue Jun 14 14:56:48 2022 +0100 |
committer | Lalit Maganti <lalitm@google.com> | Tue Jun 14 14:56:48 2022 +0100 |
tree | ab9e3a03015bcd500e03b934e4efc19d73b9990e | |
parent | f533ef185cb9157bf0046a50b7b23bea3382e5e0 [diff] |
tp: speed up UpdateSetBits by 10-20x (worst cases) with PDEP intrinsic As a follow up to aosp/2117218, further speed up UpdateSetBits by another 10x in the worst cases by using PDEP intrinsic which is available on all modern x64 machines. Also add the necessary checks and compiler flags for this feature: note that we're not excluding any more CPUs than we already did because every non-niche chipset which had AVX2 also had BMI/BMI2. Before: -------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations s/set bit s/set picker bit -------------------------------------------------------------------------------------------------------------- BM_BitVectorUpdateSetBits/1234567/1/1 112970 ns 112971 ns 6195 9.20561ns 889.538ns BM_BitVectorUpdateSetBits/1234567/5/1 118373 ns 118369 ns 5885 1.92129ns 183.803ns BM_BitVectorUpdateSetBits/1234567/50/1 413089 ns 413089 ns 1704 668.353ps 66.735ns BM_BitVectorUpdateSetBits/1234567/95/1 675030 ns 675000 ns 1036 575.491ps 57.2714ns BM_BitVectorUpdateSetBits/1234567/99/1 423003 ns 422973 ns 1632 346.077ps 34.2266ns BM_BitVectorUpdateSetBits/1234567/1/5 127430 ns 127421 ns 5520 10.3831ns 199.408ns BM_BitVectorUpdateSetBits/1234567/5/5 201662 ns 201662 ns 3469 3.27326ns 64.7807ns BM_BitVectorUpdateSetBits/1234567/50/5 743395 ns 743375 ns 953 1.20273ns 24.0388ns BM_BitVectorUpdateSetBits/1234567/95/5 1177486 ns 1177326 ns 600 1003.76ps 20.0734ns BM_BitVectorUpdateSetBits/1234567/99/5 680020 ns 679980 ns 1025 556.361ps 11.0686ns BM_BitVectorUpdateSetBits/1234567/1/50 133754 ns 133753 ns 5286 10.8991ns 21.5766ns BM_BitVectorUpdateSetBits/1234567/5/50 237373 ns 237366 ns 2956 3.85278ns 7.68797ns BM_BitVectorUpdateSetBits/1234567/50/50 774907 ns 774916 ns 891 1.25377ns 2.5046ns BM_BitVectorUpdateSetBits/1234567/95/50 1207646 ns 1207482 ns 575 1029.47ps 2.05698ns BM_BitVectorUpdateSetBits/1234567/99/50 700648 ns 700650 ns 963 573.273ps 1.14605ns BM_BitVectorUpdateSetBits/1234567/1/95 133795 ns 133788 ns 5249 10.9019ns 11.429ns BM_BitVectorUpdateSetBits/1234567/5/95 239426 ns 239423 ns 2938 3.88618ns 4.08342ns BM_BitVectorUpdateSetBits/1234567/50/95 775347 ns 775286 ns 904 1.25436ns 1.31996ns BM_BitVectorUpdateSetBits/1234567/95/95 1213563 ns 1213393 ns 579 1034.51ps 1088.74ps BM_BitVectorUpdateSetBits/1234567/99/95 712348 ns 712263 ns 1008 582.774ps 613.345ps BM_BitVectorUpdateSetBits/1234567/1/99 137894 ns 137890 ns 4989 11.2361ns 11.3331ns BM_BitVectorUpdateSetBits/1234567/5/99 242047 ns 242050 ns 2852 3.9288ns 3.96543ns BM_BitVectorUpdateSetBits/1234567/50/99 774587 ns 774587 ns 890 1.25323ns 1.26574ns BM_BitVectorUpdateSetBits/1234567/95/99 1206971 ns 1206867 ns 582 1028.95ps 1039.37ps BM_BitVectorUpdateSetBits/1234567/99/99 701870 ns 701809 ns 998 574.221ps 579.968ps After: -------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations s/set bit s/set picker bit -------------------------------------------------------------------------------------------------------------- BM_BitVectorUpdateSetBits/1234567/1/1 97454 ns 97451 ns 7081 7.94096ns 767.334ns BM_BitVectorUpdateSetBits/1234567/5/1 63700 ns 63701 ns 11110 1033.95ps 98.914ns BM_BitVectorUpdateSetBits/1234567/50/1 67374 ns 67373 ns 10719 109.005ps 10.8842ns BM_BitVectorUpdateSetBits/1234567/95/1 67917 ns 67909 ns 10710 57.8976ps 5.76182ns BM_BitVectorUpdateSetBits/1234567/99/1 58296 ns 58296 ns 12354 47.6982ps 4.71731ns BM_BitVectorUpdateSetBits/1234567/1/5 100398 ns 100392 ns 6783 8.1806ns 157.109ns BM_BitVectorUpdateSetBits/1234567/5/5 62765 ns 62765 ns 10998 1018.77ps 20.1623ns BM_BitVectorUpdateSetBits/1234567/50/5 67128 ns 67128 ns 10478 108.609ps 2.17074ns BM_BitVectorUpdateSetBits/1234567/95/5 67899 ns 67899 ns 9989 57.8895ps 1.15768ns BM_BitVectorUpdateSetBits/1234567/99/5 59342 ns 59343 ns 12129 48.5543ps 965.976ps BM_BitVectorUpdateSetBits/1234567/1/50 97342 ns 97338 ns 7215 7.93171ns 15.7022ns BM_BitVectorUpdateSetBits/1234567/5/50 63323 ns 63323 ns 11089 1027.82ps 2.05095ns BM_BitVectorUpdateSetBits/1234567/50/50 65986 ns 65980 ns 10728 106.751ps 213.252ps BM_BitVectorUpdateSetBits/1234567/95/50 66994 ns 66993 ns 9976 57.1164ps 114.124ps BM_BitVectorUpdateSetBits/1234567/99/50 56002 ns 56003 ns 11666 45.8217ps 91.6041ps BM_BitVectorUpdateSetBits/1234567/1/95 96038 ns 96035 ns 7297 7.82555ns 8.20393ns BM_BitVectorUpdateSetBits/1234567/5/95 61998 ns 61995 ns 11285 1006.26ps 1057.33ps BM_BitVectorUpdateSetBits/1234567/50/95 64021 ns 64022 ns 10812 103.584ps 109ps BM_BitVectorUpdateSetBits/1234567/95/95 65208 ns 65204 ns 10544 55.5912ps 58.5054ps BM_BitVectorUpdateSetBits/1234567/99/95 55856 ns 55854 ns 12737 45.7ps 48.0973ps BM_BitVectorUpdateSetBits/1234567/1/99 95244 ns 95242 ns 7421 7.76094ns 7.82792ns BM_BitVectorUpdateSetBits/1234567/5/99 61757 ns 61755 ns 11265 1002.37ps 1011.72ps BM_BitVectorUpdateSetBits/1234567/50/99 65502 ns 65503 ns 10960 105.98ps 107.037ps BM_BitVectorUpdateSetBits/1234567/95/99 65376 ns 65371 ns 10015 55.7339ps 56.2985ps BM_BitVectorUpdateSetBits/1234567/99/99 56801 ns 56792 ns 12370 46.467ps 46.9321ps Bug: 235104800 Change-Id: I22babf71b6ebc4898be3f5f5cde51f74cbebf65c
Perfetto is a production-grade open-source stack for performance instrumentation and trace analysis. It offers services and libraries and for recording system-level and app-level traces, native + java heap profiling, a library for analyzing traces using SQL and a web-based UI to visualize and explore multi-GB traces.
See https://perfetto.dev/docs or the /docs/ directory for documentation.