| # Writing efficient shaders |
| |
| When it comes to optimizing shaders for a wide range of devices, there is no |
| perfect strategy. The reality of different drivers written by different vendors |
| targeting different hardware is that they will vary in behavior. Any attempt at |
| optimizing against a specific driver will likely result in a performance loss |
| for some other drivers that end users will run Flutter apps against. |
| |
| That being said, newer graphics devices have architectures that allow for both |
| simpler shader compilation and better handling of traditionally slow shader |
| code. In fact, ostensibly "unoptimized" shader code filled with branches may |
| significantly outperform the equivalent branchless optimized shader code when |
| targeting newer GPU architectures. (See the "Don't flatten simple varying |
| branches" recommendation for an explanation of this with respect to different |
| architectures). |
| |
| Flutter actively supports mobile devices that are more than a decade old, which |
| requires us to write shaders that perform well across multiple generations of |
| GPU architectures featuring radically different behavior. Most optimization |
| choices are direct tradeoffs between these GPU architectures, and so having an |
| accurate mental model for how these common architectures maximize parallelism is |
| essential for making good decisions while authoring shaders. |
| |
| For these reasons, it's also important to profile shaders against some of the |
| older devices that Flutter can target (such as the iPhone 6s) when making |
| changes intended to improve shader performance. |
| |
| Also, even though the branching behavior is largely architecture dependent and |
| should remain the same when using different graphics APIs, it's still also a |
| good idea to test changes against the different backends supported by Impeller |
| (Metal and GLES). Early stage shader compilation (as well as the high level |
| shader code generated by ImpellerC) may vary quite a bit between APIs. |
| |
| ## GPU architecture primer |
| |
| GPUs are designed to have functional units running single instructions over many |
| elements (the "data path") each clock cycle. This is the fundamental aspect of |
| GPUs that makes them work well for massively parallel compute work; they're |
| essentially specialized SIMD engines. |
| |
| GPU parallelism generally comes in two broad architectural flavors: |
| **Instruction-level parallelism** and **Thread-level parallelism** -- these |
| architecture designs handle shader branching very differently and are covered |
| in the sections below. In general, older GPU architectures (on some products |
| released before ~2015) leverage instruction-level parallelism, while most if not |
| all newer GPUs leverage thread-level parallelism. |
| |
| Some of the earliest GPU architectures had no runtime control flow primitives at |
| all (i.e. jump instructions), and compilers for these architectures needed to |
| handle branches ahead of time by unrolling loops, compiling a different program |
| for every possible branch combination, and then executing all of them. However, |
| virtually all GPU architectures in use today have instruction-level support for |
| dynamic branching, and it's quite unlikely that we'll come across a mobile |
| device capable of running Flutter that doesn't. For example, the old devices we |
| test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic |
| runtime branching. For these reasons, the optimization advice in this document |
| isn't aimed at branchless architectures. |
| |
| ### Instruction-level parallelism |
| |
| Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on |
| SIMD vector or array instructions to maximize the number of computations |
| performed per clock cycle on each functional unit. This means that the shader |
| compiler must figure out which parts of the program are safe to parallelize |
| ahead of time and emit appropriate instructions. This presents a problem for |
| certain kinds of branches: If the compiler doesn't know that the same decision |
| will always be taken by all of the data lanes at runtime (meaning the branch is |
| _varying_), it can't safely emit SIMD instructions while compiling the branch. |
| The result is that instructions within non-uniform branches incur a |
| `1/[data width]` performance penalty when compared to non-branched instructions |
| because they can't be parallelized. |
| |
| VLIW ("Very Long Instruction Width") is another common instruction-level |
| parallelism design that suffers from the same compile time reasoning |
| disadvantage that SIMD does. |
| |
| ### Thread-level parallelism |
| |
| Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the |
| Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and |
| parallelize instructions at runtime by running the same instruction over many |
| threads in groups often referred to as "warps" (Nvidia terminology) or |
| "wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per |
| warp/wavefront. This design is also commonly referred to as SIMT ("Single |
| Instruction Multiple Thread"). |
| |
| To handle branching, SIMT programs use special instructions to write a thread |
| mask that determines which threads are activated/deactivated in the warp; only |
| the warp's activated threads will actually execute instructions. Given this |
| setup, the program can first deactivate threads that failed the branch |
| condition, run the positive path, invert the mask, run the negative path, and |
| finally restore the mask to its original state prior to the branch. The compiler |
| may also insert mask checks to skip over branches when all of the threads have |
| been deactivated. |
| |
| Therefore, the best case scenario for a SIMT branch is that it only incurs the |
| cost of the conditional. The worst case scenario is that some of the warp's |
| threads fail the conditional and the rest succeed, requiring the program to |
| execute both paths of the branch back-to-back in the warp. Note that this is |
| very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT |
| is able to retain significant parallelism in all cases, whereas SIMD cannot. |
| |
| ## Recommendations |
| |
| ### Don't flatten uniform or constant branches |
| |
| Uniforms are pipeline variables accessible within a shader which are guaranteed |
| to not vary during a GPU program's invocation. |
| |
| Example of a uniform branch in action: |
| |
| ```glsl |
| uniform struct FrameInfo { |
| mat4 mvp; |
| bool invert_y; |
| } frame_info; |
| |
| in vec2 position; |
| |
| void main() { |
| gl_Position = frame_info.mvp * vec4(position, 0, 1) |
| |
| if (frame_info.invert_y) { |
| gl_Position *= vec4(1, -1, 1, 1); |
| } |
| } |
| ``` |
| |
| While it's true that driver stacks have the opportunity to generate multiple |
| pipeline variants ahead of time to handle these branches, this advanced |
| functionality isn't actually necessary to achieve for good runtime performance |
| of uniform branches on widely used mobile architectures: |
| * On SIMT architectures, branching on a uniform means that every thread in every |
| warp will resolve to the same path, so only one path in the branch will ever |
| execute. |
| * On VLIW/SIMD architectures, the compiler can be certain that all of the |
| elements in the data path for every functional unit will resolve to the same |
| path, and so it can safely emit fully parallelized instructions for the |
| contents of the branch! |
| |
| ### Don't flatten simple varying branches |
| |
| Widely used mobile GPU architectures generally don't benefit from flattening |
| simple varying branches. While it's true that compilers for VLIW/SIMD-based |
| architectures can't emit efficient instructions for these branches, the |
| detrimental effects of this are minimal with small branches. For modern SIMT |
| architectures, flattened branches can actually perform measurably worse than |
| straight forward branch solutions. Also, some shader compilers can collapse |
| small branches automatically. |
| |
| Instead of this: |
| |
| ```glsl |
| vec3 ColorBurn(vec3 dst, vec3 src) { |
| vec3 color = 1 - min(vec3(1), (1 - dst) / src); |
| color = mix(color, vec3(1), 1 - abs(sign(dst - 1))); |
| color = mix(color, vec3(0), 1 - abs(sign(src - 0))); |
| return color; |
| } |
| ``` |
| |
| ...just do this: |
| |
| ```glsl |
| vec3 ColorBurn(vec3 dst, vec3 src) { |
| vec3 color = 1 - min(vec3(1), (1 - dst) / src); |
| if (1 - dst.r < kEhCloseEnough) { |
| color.r = 1; |
| } |
| if (1 - dst.g < kEhCloseEnough) { |
| color.g = 1; |
| } |
| if (1 - dst.b < kEhCloseEnough) { |
| color.b = 1; |
| } |
| if (src.r < kEhCloseEnough) { |
| color.r = 0; |
| } |
| if (src.g < kEhCloseEnough) { |
| color.g = 0; |
| } |
| if (src.b < kEhCloseEnough) { |
| color.b = 0; |
| } |
| return color; |
| } |
| ``` |
| |
| It's easier to understand, doesn't prevent compiler optimizations, runs |
| measurably faster on SIMT devices, and works out to be at most marginally slower |
| on older VLIW devices. |
| |
| ### Avoid complex varying branches |
| |
| Consider the following fragment shader: |
| |
| ```glsl |
| in vec4 color; |
| out vec4 frag_color; |
| |
| void main() { |
| vec4 result; |
| |
| if (color.a == 0) { |
| result = vec4(0); |
| } else { |
| result = DoExtremelyExpensiveThing(color); |
| } |
| |
| frag_color = result; |
| } |
| ``` |
| |
| Note that `color` is _varying_. Specifically, it's an interpolated output from a |
| vertex shader -- so the value may change from fragment to fragment (as opposed |
| to a _uniform_ or _constant_, which will remain the same for the whole draw |
| call). |
| |
| On SIMT architectures, this branch incurs very little overhead because |
| `DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all |
| the threads in a given warp. |
| However, architectures that use instruction-level parallelism (VLIW or SIMD) |
| can't handle this branch efficiently because the compiler can't safely emit |
| parallelized instructions on either side of the branch. |
| |
| To achieve maximum parallelism across all of these architectures, one possible |
| solution is to unbranch the more complex path: |
| |
| ```glsl |
| in vec4 color; |
| out vec4 frag_color; |
| |
| void main() { |
| frag_color = DoExtremelyExpensiveThing(color); |
| |
| if (color.a == 0) { |
| frag_color = vec4(0); |
| } |
| } |
| ``` |
| |
| However, this may be a big tradeoff depending on how this shader is used -- this |
| solution will perform worse on SIMT devices in cases where `color.a == 0` across |
| all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be |
| skipped with this solution! So if the cheap branch path covers a large solid |
| portion of a draw call's coverage area, alternative designs may be favorable. |
| |
| ### Beware of return branching |
| |
| Consider the following glsl function: |
| ```glsl |
| vec4 FrobnicateColor(vec4 color) { |
| if (color.a == 0) { |
| return vec4(0); |
| } |
| |
| return DoExtremelyExpensiveThing(color); |
| } |
| ``` |
| |
| At first glance, this may appear cheap due to its simple contents, but this |
| branch has two exclusive paths in practice, and the generated shader assembly |
| will reflect the same behavior as this code: |
| |
| ```glsl |
| vec4 FrobnicateColor(vec4 color) { |
| vec4 result; |
| |
| if (color.a == 0) { |
| result = vec4(0); |
| } else { |
| result = DoExtremelyExpensiveThing(color); |
| } |
| |
| return result; |
| } |
| ``` |
| |
| The same concerns and advice apply to this branch as the scenario under "Avoid |
| complex varying branches". |
| |
| ### Use lower precision whenever possible |
| |
| Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point |
| operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and |
| according to the |
| [Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible), |
| using lower precision floating point operations is more efficient on these |
| devices. |