impeller/docs/shader_optimization.md - mirrors/engine - Git at Google

 # Writing efficient shaders

 When it comes to optimizing shaders for a wide range of devices, there is no
 perfect strategy. The reality of different drivers written by different vendors
 targeting different hardware is that they will vary in behavior. Any attempt at
 optimizing against a specific driver will likely result in a performance loss
 for some other drivers that end users will run Flutter apps against.

 That being said, newer graphics devices have architectures that allow for both
 simpler shader compilation and better handling of traditionally slow shader
 code. In fact, ostensibly "unoptimized" shader code filled with branches may
 significantly outperform the equivalent branchless optimized shader code when
 targeting newer GPU architectures. (See the "Don't flatten simple varying
 branches" recommendation for an explanation of this with respect to different
 architectures).

 Flutter actively supports mobile devices that are more than a decade old, which
 requires us to write shaders that perform well across multiple generations of
 GPU architectures featuring radically different behavior. Most optimization
 choices are direct tradeoffs between these GPU architectures, and so having an
 accurate mental model for how these common architectures maximize parallelism is
 essential for making good decisions while authoring shaders.

 For these reasons, it's also important to profile shaders against some of the
 older devices that Flutter can target (such as the iPhone 6s) when making
 changes intended to improve shader performance.

 Also, even though the branching behavior is largely architecture dependent and
 should remain the same when using different graphics APIs, it's still also a
 good idea to test changes against the different backends supported by Impeller
 (Metal and GLES). Early stage shader compilation (as well as the high level
 shader code generated by ImpellerC) may vary quite a bit between APIs.

 ## GPU architecture primer

 GPUs are designed to have functional units running single instructions over many
 elements (the "data path") each clock cycle. This is the fundamental aspect of
 GPUs that makes them work well for massively parallel compute work; they're
 essentially specialized SIMD engines.

 GPU parallelism generally comes in two broad architectural flavors:
 **Instruction-level parallelism** and **Thread-level parallelism** -- these
 architecture designs handle shader branching very differently and are covered
 in the sections below. In general, older GPU architectures (on some products
 released before ~2015) leverage instruction-level parallelism, while most if not
 all newer GPUs leverage thread-level parallelism.

 Some of the earliest GPU architectures had no runtime control flow primitives at
 all (i.e. jump instructions), and compilers for these architectures needed to
 handle branches ahead of time by unrolling loops, compiling a different program
 for every possible branch combination, and then executing all of them. However,
 virtually all GPU architectures in use today have instruction-level support for
 dynamic branching, and it's quite unlikely that we'll come across a mobile
 device capable of running Flutter that doesn't. For example, the old devices we
 test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
 runtime branching. For these reasons, the optimization advice in this document
 isn't aimed at branchless architectures.

 ### Instruction-level parallelism

 Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
 SIMD vector or array instructions to maximize the number of computations
 performed per clock cycle on each functional unit. This means that the shader
 compiler must figure out which parts of the program are safe to parallelize
 ahead of time and emit appropriate instructions. This presents a problem for
 certain kinds of branches: If the compiler doesn't know that the same decision
 will always be taken by all of the data lanes at runtime (meaning the branch is
 _varying_), it can't safely emit SIMD instructions while compiling the branch.
 The result is that instructions within non-uniform branches incur a
 `1/[data width]` performance penalty when compared to non-branched instructions
 because they can't be parallelized.

 VLIW ("Very Long Instruction Width") is another common instruction-level
 parallelism design that suffers from the same compile time reasoning
 disadvantage that SIMD does.

 ### Thread-level parallelism

 Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
 Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
 parallelize instructions at runtime by running the same instruction over many
 threads in groups often referred to as "warps" (Nvidia terminology) or
 "wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
 warp/wavefront. This design is also commonly referred to as SIMT ("Single
 Instruction Multiple Thread").

 To handle branching, SIMT programs use special instructions to write a thread
 mask that determines which threads are activated/deactivated in the warp; only
 the warp's activated threads will actually execute instructions. Given this
 setup, the program can first deactivate threads that failed the branch
 condition, run the positive path, invert the mask, run the negative path, and
 finally restore the mask to its original state prior to the branch. The compiler
 may also insert mask checks to skip over branches when all of the threads have
 been deactivated.

 Therefore, the best case scenario for a SIMT branch is that it only incurs the
 cost of the conditional. The worst case scenario is that some of the warp's
 threads fail the conditional and the rest succeed, requiring the program to
 execute both paths of the branch back-to-back in the warp. Note that this is
 very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
 is able to retain significant parallelism in all cases, whereas SIMD cannot.

 ## Recommendations

 ### Don't flatten uniform or constant branches

 Uniforms are pipeline variables accessible within a shader which are guaranteed
 to not vary during a GPU program's invocation.

 Example of a uniform branch in action:

 ```glsl
 uniform struct FrameInfo {
   mat4 mvp;
   bool invert_y;
 } frame_info;

 in vec2 position;

 void main() {
   gl_Position = frame_info.mvp * vec4(position, 0, 1)

   if (frame_info.invert_y) {
     gl_Position *= vec4(1, -1, 1, 1);
   }
 }
 ```

 While it's true that driver stacks have the opportunity to generate multiple
 pipeline variants ahead of time to handle these branches, this advanced
 functionality isn't actually necessary to achieve for good runtime performance
 of uniform branches on widely used mobile architectures:
 * On SIMT architectures, branching on a uniform means that every thread in every
   warp will resolve to the same path, so only one path in the branch will ever
   execute.
 * On VLIW/SIMD architectures, the compiler can be certain that all of the
   elements in the data path for every functional unit will resolve to the same
   path, and so it can safely emit fully parallelized instructions for the
   contents of the branch!

 ### Don't flatten simple varying branches

 Widely used mobile GPU architectures generally don't benefit from flattening
 simple varying branches. While it's true that compilers for VLIW/SIMD-based
 architectures can't emit efficient instructions for these branches, the
 detrimental effects of this are minimal with small branches. For modern SIMT
 architectures, flattened branches can actually perform measurably worse than
 straight forward branch solutions. Also, some shader compilers can collapse
 small branches automatically.

 Instead of this:

 ```glsl
 vec3 ColorBurn(vec3 dst, vec3 src) {
   vec3 color = 1 - min(vec3(1), (1 - dst) / src);
   color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
   color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
   return color;
 }
 ```

 ...just do this:

 ```glsl
 vec3 ColorBurn(vec3 dst, vec3 src) {
   vec3 color = 1 - min(vec3(1), (1 - dst) / src);
   if (1 - dst.r < kEhCloseEnough) {
     color.r = 1;
   }
   if (1 - dst.g < kEhCloseEnough) {
     color.g = 1;
   }
   if (1 - dst.b < kEhCloseEnough) {
     color.b = 1;
   }
   if (src.r < kEhCloseEnough) {
     color.r = 0;
   }
   if (src.g < kEhCloseEnough) {
     color.g = 0;
   }
   if (src.b < kEhCloseEnough) {
     color.b = 0;
   }
   return color;
 }
 ```

 It's easier to understand, doesn't prevent compiler optimizations, runs
 measurably faster on SIMT devices, and works out to be at most marginally slower
 on older VLIW devices.

 ### Avoid complex varying branches

 Consider the following fragment shader:

 ```glsl
 in vec4 color;
 out vec4 frag_color;

 void main() {
   vec4 result;

   if (color.a == 0) {
     result = vec4(0);
   } else {
     result = DoExtremelyExpensiveThing(color);
   }

   frag_color = result;
 }
 ```

 Note that `color` is _varying_. Specifically, it's an interpolated output from a
 vertex shader -- so the value may change from fragment to fragment (as opposed
 to a _uniform_ or _constant_, which will remain the same for the whole draw
 call).

 On SIMT architectures, this branch incurs very little overhead because
 `DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
 the threads in a given warp.
 However, architectures that use instruction-level parallelism (VLIW or SIMD)
 can't handle this branch efficiently because the compiler can't safely emit
 parallelized instructions on either side of the branch.

 To achieve maximum parallelism across all of these architectures, one possible
 solution is to unbranch the more complex path:

 ```glsl
 in vec4 color;
 out vec4 frag_color;

 void main() {
   frag_color = DoExtremelyExpensiveThing(color);

   if (color.a == 0) {
     frag_color = vec4(0);
   }
 }
 ```

 However, this may be a big tradeoff depending on how this shader is used -- this
 solution will perform worse on SIMT devices in cases where `color.a == 0` across
 all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
 skipped with this solution! So if the cheap branch path covers a large solid
 portion of a draw call's coverage area, alternative designs may be favorable.

 ### Beware of return branching

 Consider the following glsl function:
 ```glsl
 vec4 FrobnicateColor(vec4 color) {
   if (color.a == 0) {
     return vec4(0);
   }

   return DoExtremelyExpensiveThing(color);
 }
 ```

 At first glance, this may appear cheap due to its simple contents, but this
 branch has two exclusive paths in practice, and the generated shader assembly
 will reflect the same behavior as this code:

 ```glsl
 vec4 FrobnicateColor(vec4 color) {
   vec4 result;

   if (color.a == 0) {
     result = vec4(0);
   } else {
     result = DoExtremelyExpensiveThing(color);
   }

   return result;
 }
 ```

 The same concerns and advice apply to this branch as the scenario under "Avoid
 complex varying branches".

 ### Use lower precision whenever possible

 Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
 operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
 according to the
 [Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
 using lower precision floating point operations is more efficient on these
 devices.
	# Writing efficient shaders

	When it comes to optimizing shaders for a wide range of devices, there is no
	perfect strategy. The reality of different drivers written by different vendors
	targeting different hardware is that they will vary in behavior. Any attempt at
	optimizing against a specific driver will likely result in a performance loss
	for some other drivers that end users will run Flutter apps against.

	That being said, newer graphics devices have architectures that allow for both
	simpler shader compilation and better handling of traditionally slow shader
	code. In fact, ostensibly "unoptimized" shader code filled with branches may
	significantly outperform the equivalent branchless optimized shader code when
	targeting newer GPU architectures. (See the "Don't flatten simple varying
	branches" recommendation for an explanation of this with respect to different
	architectures).

	Flutter actively supports mobile devices that are more than a decade old, which
	requires us to write shaders that perform well across multiple generations of
	GPU architectures featuring radically different behavior. Most optimization
	choices are direct tradeoffs between these GPU architectures, and so having an
	accurate mental model for how these common architectures maximize parallelism is
	essential for making good decisions while authoring shaders.

	For these reasons, it's also important to profile shaders against some of the
	older devices that Flutter can target (such as the iPhone 6s) when making
	changes intended to improve shader performance.

	Also, even though the branching behavior is largely architecture dependent and
	should remain the same when using different graphics APIs, it's still also a
	good idea to test changes against the different backends supported by Impeller
	(Metal and GLES). Early stage shader compilation (as well as the high level
	shader code generated by ImpellerC) may vary quite a bit between APIs.

	## GPU architecture primer

	GPUs are designed to have functional units running single instructions over many
	elements (the "data path") each clock cycle. This is the fundamental aspect of
	GPUs that makes them work well for massively parallel compute work; they're
	essentially specialized SIMD engines.

	GPU parallelism generally comes in two broad architectural flavors:
	Instruction-level parallelism and Thread-level parallelism -- these
	architecture designs handle shader branching very differently and are covered
	in the sections below. In general, older GPU architectures (on some products
	released before ~2015) leverage instruction-level parallelism, while most if not
	all newer GPUs leverage thread-level parallelism.

	Some of the earliest GPU architectures had no runtime control flow primitives at
	all (i.e. jump instructions), and compilers for these architectures needed to
	handle branches ahead of time by unrolling loops, compiling a different program
	for every possible branch combination, and then executing all of them. However,
	virtually all GPU architectures in use today have instruction-level support for
	dynamic branching, and it's quite unlikely that we'll come across a mobile
	device capable of running Flutter that doesn't. For example, the old devices we
	test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
	runtime branching. For these reasons, the optimization advice in this document
	isn't aimed at branchless architectures.

	### Instruction-level parallelism

	Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
	SIMD vector or array instructions to maximize the number of computations
	performed per clock cycle on each functional unit. This means that the shader
	compiler must figure out which parts of the program are safe to parallelize
	ahead of time and emit appropriate instructions. This presents a problem for
	certain kinds of branches: If the compiler doesn't know that the same decision
	will always be taken by all of the data lanes at runtime (meaning the branch is
	_varying_), it can't safely emit SIMD instructions while compiling the branch.
	The result is that instructions within non-uniform branches incur a
	`1/[data width]` performance penalty when compared to non-branched instructions
	because they can't be parallelized.

	VLIW ("Very Long Instruction Width") is another common instruction-level
	parallelism design that suffers from the same compile time reasoning
	disadvantage that SIMD does.

	### Thread-level parallelism

	Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
	Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
	parallelize instructions at runtime by running the same instruction over many
	threads in groups often referred to as "warps" (Nvidia terminology) or
	"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
	warp/wavefront. This design is also commonly referred to as SIMT ("Single
	Instruction Multiple Thread").

	To handle branching, SIMT programs use special instructions to write a thread
	mask that determines which threads are activated/deactivated in the warp; only
	the warp's activated threads will actually execute instructions. Given this
	setup, the program can first deactivate threads that failed the branch
	condition, run the positive path, invert the mask, run the negative path, and
	finally restore the mask to its original state prior to the branch. The compiler
	may also insert mask checks to skip over branches when all of the threads have
	been deactivated.

	Therefore, the best case scenario for a SIMT branch is that it only incurs the
	cost of the conditional. The worst case scenario is that some of the warp's
	threads fail the conditional and the rest succeed, requiring the program to
	execute both paths of the branch back-to-back in the warp. Note that this is
	very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
	is able to retain significant parallelism in all cases, whereas SIMD cannot.

	## Recommendations

	### Don't flatten uniform or constant branches

	Uniforms are pipeline variables accessible within a shader which are guaranteed
	to not vary during a GPU program's invocation.

	Example of a uniform branch in action:

	```glsl
	uniform struct FrameInfo {
	mat4 mvp;
	bool invert_y;
	} frame_info;

	in vec2 position;

	void main() {
	gl_Position = frame_info.mvp * vec4(position, 0, 1)

	if (frame_info.invert_y) {
	gl_Position *= vec4(1, -1, 1, 1);
	}
	}
	```

	While it's true that driver stacks have the opportunity to generate multiple
	pipeline variants ahead of time to handle these branches, this advanced
	functionality isn't actually necessary to achieve for good runtime performance
	of uniform branches on widely used mobile architectures:
	* On SIMT architectures, branching on a uniform means that every thread in every
	warp will resolve to the same path, so only one path in the branch will ever
	execute.
	* On VLIW/SIMD architectures, the compiler can be certain that all of the
	elements in the data path for every functional unit will resolve to the same
	path, and so it can safely emit fully parallelized instructions for the
	contents of the branch!

	### Don't flatten simple varying branches

	Widely used mobile GPU architectures generally don't benefit from flattening
	simple varying branches. While it's true that compilers for VLIW/SIMD-based
	architectures can't emit efficient instructions for these branches, the
	detrimental effects of this are minimal with small branches. For modern SIMT
	architectures, flattened branches can actually perform measurably worse than
	straight forward branch solutions. Also, some shader compilers can collapse
	small branches automatically.

	Instead of this:

	```glsl
	vec3 ColorBurn(vec3 dst, vec3 src) {
	vec3 color = 1 - min(vec3(1), (1 - dst) / src);
	color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
	color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
	return color;
	}
	```

	...just do this:

	```glsl
	vec3 ColorBurn(vec3 dst, vec3 src) {
	vec3 color = 1 - min(vec3(1), (1 - dst) / src);
	if (1 - dst.r < kEhCloseEnough) {
	color.r = 1;
	}
	if (1 - dst.g < kEhCloseEnough) {
	color.g = 1;
	}
	if (1 - dst.b < kEhCloseEnough) {
	color.b = 1;
	}
	if (src.r < kEhCloseEnough) {
	color.r = 0;
	}
	if (src.g < kEhCloseEnough) {
	color.g = 0;
	}
	if (src.b < kEhCloseEnough) {
	color.b = 0;
	}
	return color;
	}
	```

	It's easier to understand, doesn't prevent compiler optimizations, runs
	measurably faster on SIMT devices, and works out to be at most marginally slower
	on older VLIW devices.

	### Avoid complex varying branches

	Consider the following fragment shader:

	```glsl
	in vec4 color;
	out vec4 frag_color;

	void main() {
	vec4 result;

	if (color.a == 0) {
	result = vec4(0);
	} else {
	result = DoExtremelyExpensiveThing(color);
	}

	frag_color = result;
	}
	```

	Note that `color` is _varying_. Specifically, it's an interpolated output from a
	vertex shader -- so the value may change from fragment to fragment (as opposed
	to a _uniform_ or _constant_, which will remain the same for the whole draw
	call).

	On SIMT architectures, this branch incurs very little overhead because
	`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
	the threads in a given warp.
	However, architectures that use instruction-level parallelism (VLIW or SIMD)
	can't handle this branch efficiently because the compiler can't safely emit
	parallelized instructions on either side of the branch.

	To achieve maximum parallelism across all of these architectures, one possible
	solution is to unbranch the more complex path:

	```glsl
	in vec4 color;
	out vec4 frag_color;

	void main() {
	frag_color = DoExtremelyExpensiveThing(color);

	if (color.a == 0) {
	frag_color = vec4(0);
	}
	}
	```

	However, this may be a big tradeoff depending on how this shader is used -- this
	solution will perform worse on SIMT devices in cases where `color.a == 0` across
	all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
	skipped with this solution! So if the cheap branch path covers a large solid
	portion of a draw call's coverage area, alternative designs may be favorable.

	### Beware of return branching

	Consider the following glsl function:
	```glsl
	vec4 FrobnicateColor(vec4 color) {
	if (color.a == 0) {
	return vec4(0);
	}

	return DoExtremelyExpensiveThing(color);
	}
	```

	At first glance, this may appear cheap due to its simple contents, but this
	branch has two exclusive paths in practice, and the generated shader assembly
	will reflect the same behavior as this code:

	```glsl
	vec4 FrobnicateColor(vec4 color) {
	vec4 result;

	if (color.a == 0) {
	result = vec4(0);
	} else {
	result = DoExtremelyExpensiveThing(color);
	}

	return result;
	}
	```

	The same concerns and advice apply to this branch as the scenario under "Avoid
	complex varying branches".

	### Use lower precision whenever possible

	Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
	operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
	according to the
	[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
	using lower precision floating point operations is more efficient on these
	devices.