Managing-Unbounded-Growth--A-Story-of-Sh

Welcome, everybody. Thanks for coming to my talk. I know I am up against Nanite foliage, so I appreciate you all being here. I do not have a lot of eye candy. I do have some number candy, I guess, later on, which is going to be fun. This is my talk about what we are doing on the Unreal Engine team about the shader problem. My name is Dan Oxnidis. I work on the rendering architecture team. Our purview is primarily shader system, material system, that kind of thing. So let us get into it. Agenda for today. First, I want to just describe the motivation for some of the major changes we've made in the UAE shader system over the past few engine releases. I will give you some background information on the shader system in general, just stuff that is going to be helpful to know to understand the rest of the presentation's contents if you do not know it already. We will get into a discussion of what leads to high shader counts and why we are dealing with constant unbounded growth in this space. Some strategies to reduce shader counts in your own projects without having to make any engine changes. I will touch a little bit on the on-demand shader compilation mechanism that came around in UE5, early days of UE5. Then the bulk of the talk is going to be details about all the systemic improvements we made to the shader system between UE5.2 and now that have been attempted to deal with this problem. Then we will get into the results, some cook stats and numbers, And finally, just a glimpse into some future work that we're going to continue to be addressing to just improve the situation systemically as best we can. So motivation. What was the reason we took this work on in the first place? I imagine if you're here or you're watching this video after the fact, you're probably familiar with the shader system. You're familiar with massive shader counts, large amounts of compilation time, the problems that we've had for years and years and years. So most of what I'm going to talk about specifically pertaining to cook times. So, project cook times, if you care about that kind of thing, that's what I'm going to be digging into. So, around middle of 2022, we'd kind of reached a breaking point internally. Cooks of some of our internal projects were frequently basically needing to recompile every single shader. In some cases, these cooks were timing out. When cooks timed out, that meant we didn't get QA builds, QA couldn't validate builds to get to content creators, which is a huge workflow disruption. So, it was bad times for everybody. Now, the main driver of this problem here was just frequent shader code changes from our rendering team, stressing the shader drive data cache mechanism, DDC we call it. So we're only caching full shader maps. Full shader maps generally contain tens to hundreds of shaders each. In the cache key, these shaders was based on a hash of all the source and a bunch of other settings that went into the shaders in that shader map. So this meant even trivial changes, like changing whitespace or comments in a shader file, could incur a recompile of every single shader in the project. Changes to a common.ush file meant every shader map is invalidated, every shader in the world is recompiling, which is obviously not great in a stream with active shader development. This meant basically we were recompiling everything all the time, every day, multiple times a day, which is not great. We had some temporary workarounds, but spoiler alert, they sucked. First of all, we had a shader submission Friday where shader devs were only permitted to submit shader code on Friday afternoons without the automated builds to catch up over the weekend. And you can imagine this did not really go over that well. It didn't last very long. We improved the situation a little bit by separating shader devs into a separate stream. Essentially, we had an automated CI process that was pulling in changes from main, churning through the content, pre-populating the shared DDC, and eventually merging that shader code back into the main line. But this just in practice meant long delays between the work being finished and getting into the hands of users. and often like we get bad luck with a string of build health and we just have no progress on this process for like days at a time so it was just really not sustainable so we formed a strike team just kind of track of tackle this problem from as many angles as possible so the background information probably some of you are aware of us already so bear with me it's not going to take very long just some stuff I'm going to refer to in the remainder of the talk so first I want to talk about the distinction between global and material shaders so the global shaders are the ones that are all basically hand coded HLSL, typically written entirely by a programmer, used for specific passes, compute tasks, that kind of thing unrelated to scene objects, or not specifically attached to a scene object more accurately, versus material shaders, which are things that are applied to objects in the scene. They're constructed from a combination of hand coded HLSL and the material graphs that are translated in HLSL automatically. So what that means basically is that global shaders are not content dependent, while material shaders are content dependent. In large projects, generally speaking, material shaders dominate compile time, cook time, et cetera. There are other categories of shaders as well. Niagara, compute framework, et cetera, do also generate shaders. But they're generally in practice. The numbers are much smaller, so they don't have the same impact on cook time. So I'm not going to really talk about those in this talk. So shader types, these are basically the C++ and HLSL code that are mashed together to generate shaders. Each implementation of shader type roughly maps to a single rendering pass in the engine. So for Material shaders, you got things like your base passes, shadow pass, depth pass, velocity, et cetera, where global shader types tend to be more things like post effects lighting passes full screen things compute related stuff We also got vertex factories These are further variations of shaders that deal with unpacking different types of vertex data, so meshes, particles, hair, water, that kind of thing. And the same material in source, the material assets can generate shaders from multiple vertex factories depending on the usage flags. Shader formats are the term we just used to refer to the shader-related code and data types for a single runtime target. Now, most platforms only have one runtime target, but PC can target multiple. So you've got SM5, SM6, you've got Vulkan, you've got the mobile render. All these can be generated for PC. Each shader format has a corresponding compiler implementation in C++ that usually calls into an external compiler tool, but also does a bunch of UE-specific pre-processing and post-processing on the data. Now, shader compilation phases. When you are running the editor, Engine startup, map loading, and play in editor will have to compile different sets of shaders at different times. This is why we always have different delays on waiting for shaders in the editor. I just wanted to explain the reason for that. First, global shaders, default shaders, these are all prerequisites for running the editor at all. They are done during the startup phase. When you load a map in the editor, we're going to basically compile all the shaders that are needed to render that map in the editor context, versus play an editor where you're running game logic, loading additional assets, and you need to compile different shaders to render things in game context. Now, contrasting that to what occurs during a cook, we have the same exact initialization phase, except it's based on the number of whichever target platforms you're targeting for your cook. All the global shaders, all the default shaders, compile for every target platform. But then basically, cooks just are processing packages, loading a package up, saving a package out. And during that process, those packages will say, hey, I need to compile a bunch of shaders. So we block the package save until all that shader compilation is completed. Finally, a shader invalidation. That's just the term we use to describe any code or content changes which trigger shaders recompiling. This can be just a change to a base material, shader code change, various version bumps we have in the engine. Full invalidation is one which requires all shaders to recompile. So we have a shader version.ush header file, which we bump a GUID in to force all shaders to recompile. We typically have to do this if we change some kind of core serialization format, core data format, or just for some reason we want to compile everything again. So let's dig into the key problem then. Why do we have so many shaders? So as I mentioned on the previous slide, material shaders are generally by far the most numerous. So we're just going to consider those here. So we've got a simplified equation here. gives you an upper bound for the total material fermentation count, which is the number of materials multiplied by the number of vertex factories, multiplied by the number of shader types, multiplied by the number of shader formats. There's a lot of terms here, obviously. The reason I say this is upper bound is because obviously you're not going to have all vertex factories, all shader types enabled for all materials, but you could in theory. So this top magenta term in the equation is content dependent, essentially. So artists creating new materials or new material instances can add new shaders to a project. Whereas all the orange terms are essentially in engineering's control. So rendering engineers adding new features are quite frequently adding new shader types to a project or new permutations of shader types. Less frequently they'll add new vertex factories. It's not quite as common. Shader format is also generally in engineering's control, though these tend not to change very often. It's basically just down to what kind of configurations your project's targeting. But the key point I want to make here is that at least two of these variables grow regularly over time. That is where the crux of the problem is. We have to figure out some way of keeping up with that growth. Digging a bit deeper, this content term is actually kind of an oversimplification. There are a lot more levers that content actually has to create new permutations. In reality, we can create new and new materials, new material instances, add static switches to a material or a material function, add quality level variations to a material or a material function. Again, all these can but don't necessarily result in permutations being created. This depends entirely on how material instances differ from their base material. If the base material exposes static switches, changing the values on a material instance will result in permutations being created for that unique set of static parameter values. You can also just toggle material property overrides in instances, which can result in activation of additional shader types. We don't automatically create all possible variations of the base material. we only create those that are required by the superset of instances that exist in your project. So this goes back to what I said before about this all being an upper bound. So basically, one of the keynotes is if you have multiple material instances that all happen to set the exact same parameter values and have the same base material, they'll be deduplicated in memory. So we're not wasting space when we don't have to. I do want to dig into the static switch issue a bit more, though, because this is one of the biggest pitfalls in some of our internal projects, at least, is their very widespread use. In each Boolean switch you add to a material, could have potentially increased the number of shaders compiled for that material by up to 2x, right? So a long time ago dynamic branching was expensive on GPUs by default always So the idea of reducing instruction count to optimize shader execution cost was kind of like the de facto standard way of optimizing materials The thing is, it's not really as relevant on modern hardware, but that's still kind of standard practice for a lot of content creators. I've had conversations many times with artists who say, oh, I can't do this because it's going to add shader instructions and I'm wasting instructions, and say, well, maybe you should measure it, right? So that is my suggestion, is basically don't always add static switches unless you really need to. We do have a similar problem with global shaders too, even though they are in engineering control. But basically, that is a historical issue. You know, years and years ago, it was just default permute, don't dynamic branch in a global shader. And going back and kind of undoing this is a very long time-consuming, lengthy process. Older versions, EUE did some level of shader and shader compilation deduplication. As I just mentioned, material instances matching parameters will share the same shader map, so we do not even actually submit compile jobs for those. Those are just the cheapest way of avoiding shader compilation. Now, at the point of submitting a shader compilation job, if all the code and compilation settings match exactly, the compilation itself is going to be done only once, and the results are shared and cached into memory for reuse later on. This can be the case basically when a shader type isn't influenced by differences in static switches or material property override values. Now, after compilation, duplicate shader byte code outputs, what comes out of the compiler, is only stored once in the final runtime shader library. So any jobs which result in identical output will just reference a single shared asset. So this prevents additional runtime memory usage, additional PSO compilation costs on platforms which have PSO compilation at runtime. This final bytecode deduplication is actually the most impactful. We save a lot here, but to do this, we see that we are just doing a lot of unnecessary compilation work. We just cannot tell it is unnecessary until after the compiler runs, dead strips, unused code, and we get the output the other end. With all this in mind, there are a few strategies I can suggest to manage shader counts without needing to make any changes to the engine. First, maybe most obvious, is just keep tight control over the number of materials you create for your project. Fewer base materials means fewer potential permutation variations that can create a done line, and artists should generally prefer to be creating material instances whenever possible. They're more likely to be deduplated. Second is just avoid static switches and base materials and material functions whenever you can. Prefer you to perform parameters even at the cost of increased instruction count, and don't assume that higher instruction count means slower, just profile. Possibly, it's not going to help in the way you think it will, especially on more modern hardware. Now, caveat, it's not always possible to avoid static switches. Material Translator currently doesn't handle dynamic branching around uniform parameters very well. It computes both inputs and then selects on the result, which is obviously not ideal. This is one of the things our new Material Translator is going to address, so stay tuned. It's in development right now. We don't have any promised versions yet, but it's coming. You also might need a static switch to branch around a texture sample, both just to avoid memory loads or even just to avoid exceeding sample limits and preventing your shaders from compiling at all in the first place. These are some of the valid reasons to have Swatic switches right now, but we are hoping to do some stuff to improve this. Third is just analyzing what shaders are actually used in your game. We find often we just have things that are compiled that are never used. Look at your materials, disable any unnecessary material usage flags, unused rendering features can be disabled, sometimes by Cvars and settings, that kind of thing. We do have a couple of new features in 5.6 that can help you analyze this stuff. First is a runtime command called ListShaders. You can see the output of that on the left here. Basically, it just tells you all the shaders have been loaded and created at runtime. On the right is a cook artifact, the shader type stats, which is another CSV which basically tells you all the shaders that were compiled during the cook and packaged into the Shader Library. So any discrepancies between these two lists can basically be leads into features that you might be able to disable or usage flags you might be able to turn off. So if you find any unused shaders that don't have any apparent toggle, please reach out to us. We're going through this process a lot internally right now and trying to analyze and turn things off where we can. But our use cases are probably very different than yours, so we're probably going to miss some. Now, this is just a brief detour to talk about on-demand shader compilation. This came out in UE5. It, roughly speaking, allows you to run with incomplete and initially empty Material shader maps in the editor. So only shaders required by the visible scene will be compiled when they are needed on-demand. While this compilation is in progress, objects are just going to render with the default Materials. Now, as the request is complete, we replace the shader maps, and the render objects are invalidated and then picked up the new materials on the fly. This means that map load and play in editor, those two compilation phases I mentioned before, are just compiling a lot less shaders than they were in UE4 days. So it is a much better situation than it was. It is not perfect, but it is better. This, however, is basically an editor workflow thing. It does not really impact the cook process at all. So I am not really going to delve into a lot of detail here but there is a reason I bring it up and it is that it introduced a shader cache Basically results of a single shader compilation job is cached in DDC meaning the results of any ODSE requests could be persisted from one editor session to the next. Now, at the time, this only happened locally, so only one user could share results over and over again. It wasn't put into shared caches at all. And again, like I said, only in editor. But this was a key kind of piece of groundwork and made us realize that per-shader caching was somewhat viable. It just needed some work. So just a bit of a diagram to kind of show you the old version of the shader compilation flow. Only full shadermaps cast DDC. Any misses on DDC queries necessitated that all shaders for the shadermap would be either compiled or possibly and not really terribly commonly retrieved from the previously mentioned in-memory shader compiled job cache. The reason why this cache, on average, wasn't very impactful is because the cache key for the individual shaders was also based on full source, not preprocessed source or anything like that. So this effectively only happened when different instances that say material contained shaders, which weren't dependent on static parameters or usage flags or sorry, static parameters or material property overrides. So with that, I'm going to move on to what is the bulk of the talk. It's the work that came out of our strike team. So our first big win came in the form of a library we've or it was the shader minifier. So we were profiling DXC, the DirectX shader compiler, and we saw just a lot of time spent in frontend parsing, ASD creation, that kind of thing. UE shaders tend to be very large after pre-processing, like over a megabyte in size, thousands and thousands of functions, often very little of which was actually used in the final shader, but it was still being parsed. So the idea came about, what if we do a coarse dead stripping prepass? can we do this fast enough to offset the cost of it by gains in compile time? And it turns out the answer is yes, and by quite a margin. So the rough process here is we divide our shader source files into chunks. And the chunks are things like functions, cbuffer declarations, variable declarations, et cetera. We find all the chunks that are relevant to a single entry point or a set of entry points. And then we just emit a new copy of a source containing only those chunks. So results, basically that frontend compile cost almost disappeared. It pretty much evaporated. Overall compile time, especially on some of the worst case shaders, which strip most of the code, improved significantly. So we rolled this out. It was experimental in 5.1 and 5.2. It was on by default in 5.3 for everything except for Metal and OpenGL. And then I think in 5.4, it was on by default for everything. Now, spoiler alert, with this in place, we started just noticing that a lot of shaders were exactly the same from different materials, different passes, things that don't even make sense to be related. They just were identical. So this led us to think we can probably take advantage of this somehow to further improve our pre-compilation deduplication. So next, we actually revisited our preprocessing library as well. So in 5.1 and earlier, we were relying on MCBP, which is a really old, mid-2000s C++ library, preprocessed all our shaders, but not really maintained, not owned by anybody. Quite slow. We profiled it, compared it against DXC's preprocessor, and found it was almost twice as slow. It also wasn't thread safe, so we always had to run preprocessing in our shader compile worker, separate executable process, in order to paralyze. We did evaluate DXC itself, as I mentioned, but we did abandon this approach, because first of all, the way it output code was just really weird and inflexible. There was no configurability whatsoever. We did not want to have to make changes to it. In general, we also did not want to have to rely on DXC for all of our target platforms, something we do not really have direct control over. Now, Epic, around the same time, acquired Rad Game Tools, which was great. Came to light, they had a fast Pure C preprocessor that came about as part of a shell Rad compiler project. So we started testing it. Early tests showed quite a good improvement over DXC and even better over MCPP, though it was incomplete. It needed some work to make it fit for a purpose in the shader compile system. Most notably, we had to add this file loading virtualization mechanism because we have virtual shader file paths in UE. We also did a big pass to improve stability, memory usage, performance, heavily optimized a bunch of the bottlenecks that appeared when we were analyzing how it performed on our large shader codebase. But with these improvements in place, it was introduced as experimental also in UE5 too, enabled by default in 5.3, at which point we kicked MCVP to the curb and everyone was happy. So, the most critical work in this whole process follows on from what I noted two slides ago, which is that shaders just tend to become identical after pre-processing and minification. So, we had this per-shader DDC casting mechanism from ODSE, and we theorized and confirmed with some analysis that pre-processing and minifying the source for each individual shader during the the job submission process, and then enabling this per shader cache by default and allowing it to cache the shared caches, jettling cache keys for this layer based on a hash of the pre-processed and stripped source, would allow us to significantly reduce the number of shaders compiled in the whole cook. Again, some work required to accomplish this. The main thing was we had to refactor all of our shader formats. So preprocessing and compilation were previously part of a monolithic compile step. They were kind of intrinsically linked, so there was just a lot of refactoring work to separate these two things out. We did a big optimization pass on shader job creation, shader job submission, and we also parallelized it. So it was previously mostly serial work on the game thread, and just a lot of time will hang through it there to gain on algorithm improvements. And then finally, we flipped the switch on allowing individual shaders to be attached to the shared caches, so lifting the editor-only and local-only restrictions that ODIC came with. Now, the final step here revealed some bottlenecks within DDC. Most notably, the shared file system DDC mechanism just couldn't handle the load of millions of requests during a cook. And this actually became the primary cook bottleneck and actually slowed our cooks down, so it was kind of taking us backwards. However, we have Zen server. So we just punted file system DDC to the curb internally, spin up shared Zen server mechanisms in a bunch of locations, and that just kind of solved the problem. We also did some restructuring of how DDC's response tasks were prioritized to allow better interleaving and background work, which basically meant the DDC requests completed faster, we could kick off shader compile jobs faster, get higher throughput, and faster cooks. That was pretty impactful as well. Sorry. On top of the improved deduplication rate, there are actually a lot of other benefits from caching based on a pre-processed and minified source. The biggest advantage comes from the fact that the big shader invalidations, which were the crux of most of our cook problems, become much less frequent. There are a few reasons why this is the case. First, we added a final kind of pre-processing step to strip comments, normalize white space, else to make shaders look as close as possible, which meant those trivial invalidations, things like changing comments and changing whitespace, no longer cause recompilation, which is great. Small changes in a shared header to isolated functions won't necessarily trigger a large invalidation anymore. Only the shaders that actually use the code tend to be recompiled, because otherwise Minifier probably strips those functions. Great. This is a fun one. I don't know if you've caught this before, but in 5.3 and earlier, if you changed an Xbox header, all the PS5 shaders would compile. And it sounds dumb, but the reason this was happening was because our hashing mechanism for hashing the shader force, we don't have enough contacts at that point to do preprocessing, so we can't strip out all those includes. The hashing is basically just a dumb scan of all the include directives, loads any of the files it can, it's permissive, so if it can't load the file, it just ignores it. But that meant, yeah, if you could find a file on disk or we had some PS5-specific code in the header, that is going to invalidate all shaders on all platforms, even though it should only touch PS5. So obviously, during preprocessing, all the includes are removed. This issue just magically disappears. So also, obviously, enabling per-shader caching meant that on average, fewer shaders are compiled because you are sharing results between multiple users. Chances are pretty good someone has already requested and compiled the shader, especially for active projects on large teams whose shared binary builds, people who are not modifying materials, just rarely have to compile any shaders at all anymore. We analyzed performance on this around the 5.5 timeframe because we were not 100% sure that it was going to be faster to cache than it was to compile a lot of these shaders. But after a lot of measurements, we determined it made sense. to just turn this on by default. Both for cooks and for editor, per shader caching is now on by default in 5.5 and later. Effectively, now we have two layers of caching for shaders. The previous shader mapping caching still exists and functions basically exactly the same way it did before. But the big distinction is when we get a miss on the shader map cache, we just generally retrieve cache results for all the individual shaders rather than need to actually do any compilation. Cache hits on shader maps if actually it just bypass the preprocessing cost. So those are the big ticket items, but there was a bunch of other work that came out of the strike team as well. So I'll just run through that here quickly. First, a number of similar vertex factories were merged. And this kind of just reduced the overall shader count by a small, but not insignificant, margin and had negligible runtime performance cost. And this one saves both compile time, but also saves runtime memory and PSO compilation cost as well, because we have less shaders the final shader library. We also had this problem for a long time. We knew we were incurring extra duplication of shader data in DDC because of how the shader map records are structured. Essentially, these are just raw byte arrays of shader code that's packed into each individual shader map, which meant multiple shader maps that did share the same code in runtime weren't sharing it in DDC. This affected both the cook memory as well as DDC overhead. So at various points in the cook, we're basically copying these buffers around from the job results into the shadermap output object into the in cache and then finally into the shader library So there just multiple levels of duplication which is unnecessary Also, once we enabled per-shader caching, we weren't duplicating between the shader map and the per-shader bucket either. So that was basically more than doubling our DDC overhead. It was a huge increase in overhead for the initial rollout of our pre-processed cache. So we did address this before it got released. We exploited a nice property of DDC here, which is that value attachments on a DDC record are stored in the content address store, so a cast mechanism. Basically, to minimize this footprint, fix this problem, all we need to do is separate these bytecode arrays into value attachments. If their hashes are identical, they're automatically going to be duplicated by DDC even across buckets, so it was a fairly easy change. With this in mind, we create a shared buffer for code immediately on job completion. We keep this in the in-memory cache and pass it around as a pointer instead. So this reduced DDC storage overhead, got us back into the range we were prior to doing some per-shader caching in the first place. But it also was a big memory optimization for cooks, especially when you're compiling a lot of shaders. We did also a similar optimization for symbol buffers, shader debug symbols with similar large gains. So with both of these in place, our job cache mechanism, which already did have a memory limit, wasn't really doing a lot. Now that job cache owns the bytecode and symbol buffers as well. So effectively, it introduced a nice hard memory limit on how much shader output data we stored in memory during a cook. If we exceed the limit, we just evict things. But if we need those things later again during the cook, we have them in DDCs, so we just retrieve them again. So the eviction is not really that impactful. We didn't really notice any problems from doing that. So it's good to have kind of like a fixed memory limit for shader compilation, which we never had before. Mystical points there. We also did a round of low-level hoppnizations to DXC itself. This was primarily focused on improving performance of inlining and exporting debug information. The target here was to address the compile time of some particularly bad outlier shaders. We're taking over a minute to compile a single permutation, as well as reducing the compile time discrepancy from cooking with symbols on versus symbols off. So on D3DSM6, which is the one that uses our modified version of DXC, the situation is much improved here, but this obviously only addressed one shader format. Vulkan has its own set of problems, which remain to be addressed, unfortunately. We also implemented caching and material translation results. So we noticed that large materials could take hundreds of milliseconds to translate, well, round-trip time to our ZN shared DDC is usually a single or double-digit middle second. So we did implement this as a race to avoid the possibility that cache requests would take longer than translation itself, which tended to be common for some very small materials. So whichever of the translation or DDC request completes first causes the other one to abort. And one final piece of the puzzle here is Unreal Build Accelerator, or UBA for short. If you are not already aware of this, It's an EBC-developed distributed build system. Out of the box, it just performs better than other external products on identical hardware. I don't have any actual hard numbers to back this up, but anecdotally, it's been condescending faster for everyone who's tried this. We suggest exploring it if you have the means. This was built specifically to improve code builds initially, but we also use it for distributed shader compilation now. We've had a really good collaboration with the primary developer, added a bunch of features are specifically useful for shader compilation. So I've got a couple of examples. Is a virtual input output file mechanism. So our shader compile worker files were previously being written out to disk. And that just costs I.O. and it's not really necessary. So now we just keep that in memory. We register a virtual file with UBA. And it handles automatically distributing that content to workers as needed. This isn't rolled out yet, all caveat. We're aiming for 5.8. There's still a few kinks to work out, but it's coming. We also have some visualization improvements. The UBA visualizer you see on the right here. So our shader jobs are batched in groups. You can see the list of all the shaders that compiled in this one batch of jobs right here. We also have the blue line at the top there, which gives you a graph over time of how many processes are in flight at once, which was really useful to help us analyze how often during a cook we are making good use of shared helpers in UBA. So, having gone through the details of all these optimizations, let's revisit that compilation flow diagram from earlier with the new components in place. So first, shadermaps, query DTC as before, keys still constructed from raw source file hashes, but it just serves to avoid additional costs if everything is guaranteed to be up to the date, so in case that no shader code has changed at all. But if we get a miss on the shader map query, instead of compiling all shaders, we are going to execute preprocessing and minification, generate the cache keys for all the individual shaders based on that preprocessed output And then we run another query for each shader which first checks the in cache followed by querying DDC if we did not find it in Then only then any out compiled jobs were actually executing cache the results Then we combine the results of any cache hits with the results of any shaders that compiled and cache the new complete shader map in-memory. I mentioned earlier the number candy, and here we go. There is a lot of data on this slide, so I'm going to run you through it. Gathered these numbers. I just ran a number of cooks of the CitySample project on both 5.1 and on 5.7, well, the state of 5.7 around early August. I haven't updated these numbers since then, so I don't think there's going to be much difference in the final release. And this was specifically just compiling D3DSM6. That was the only shader format that was enabled in CitySample in 5.1. We do have Vulkan enabled on CitySample by default right now, but I turned that off for this test, just so we're comparing apples to apples. There were three tests that I ran. First is a full-on validation. I bumped shader version.ush, causes all shaders to recompile. Did the same thing again, but with symbols on. Then finally, I did a partial invalidation. I just made a small change to a single function in common.ush, which is the pseudo volume texture function, just used by only a small handful of shaders. I also ran all these tests with a fully primed DDC from a previous cook, just to make sure that only shader work is being done in these cooks, so no other impact. I want to highlight a few things that can be attributed to the work we've done. First, you see the number of shaders requested, and the number of shaders compiled are quite a bit lower on 5.7 than 5.1. This is showing the impact of both the shader reduction and deduplication improvements. Cook durations across the board, you can see are significantly better, 37 minutes to 14 minutes for a full invalidation, and an hour and 15 to 15 minutes for a full invalidation with Zimbalzon. Also, you will notice the symbols on cook, the discrepancy between those cooks on 5.7 has basically disappeared. The compile time is still higher, it's just we're parallelizing way better. That compile time is essentially hidden now. You also notice the average preprocessing time is quite a bit better. Even though we've added two additional preprocessing passes in the minifier, and the normalization pass. There was one note here in that now we pre-process every shader, not just the ones that actually make it to the compiler. The averages are calculated based on the higher of those two numbers, the requested number rather than the compiled number on 5.7 versus 5.1. We pre-processed more, but it cost less, so it is still good. We see quite a bit of improvement to average compile time with symbols off and a much better improvement with symbols on. This is attributed to both the combination of minifier reducing the front end cost and our DXC optimizations kind of improving the remainder. We see a bit of an increase on cache hit rate. I do want to say with CitySample, this is kind of an outlier. CitySample having a 32% cache hit rate on 5.1 was very unusual. We saw on most projects this is way lower. Just CitySample was a good test case because it had a lot of shaders to compile. So that's the one we used. In practice, we find the improvement in cache hit rate much bigger on other real game projects. So the main win here, at least for our internal projects, is the big difference between the two partial invalidation cooks. So we went from 37 to 12 minutes. So that small change in common.ush was basically a full invalidation on 5.1. As you see, we basically recompiled every single shader versus 21 shaders compiled in 5.7. so that's a big improvement, obviously. We also have a bit of a sneak peek of the incremental cook mechanism. You could see that those two cooks there, the wall time improvement was not proportional to how much shader compilation we were no longer doing, so that indicates to me that it wasn't shader compilation that was the bottleneck in those cooks anymore, and this incremental cook kind of proves that. Through 37 minutes down to three minutes. There was a bit of DDC storage impact here because essentially we're still changing the shader map keys, so we need to repackage the shader maps. But because we're deduplicating bytecode, the impact is pretty small. It's only 20 megabytes. Now, there is a bit of a cost. DTC storage requirements are still about 20% higher in 5.7 than 5.1 with symbols disabled, sorry. But we feel this is really well worth all the other improvements. If you're cooking with symbols on, though, you're actually saving quite a bit of space here, the cost of not deduplicating those in memory was really hurting us before. I also wanted to show you a graph of memory usage over time on the CM Cooks. Well, not exactly the CM Cooks, because this is tracing was enabled, so it extended the wall time a little bit. But you can compare memory usage before and after, and you can see basically it peaked about 20 gigabytes lower in 5.7 than 5.1. So Cooks are both significantly faster and less memory intensive. and my previous measurements suggested that the shared buffer optimizations I mentioned earlier were responsible for basically most of this improvement. Great My intent for this talk was to summarize the last few years of effort to improve the default state of compiling shaders in UE but we are not finished by any stretch of the imagination A caveat that not everything I mentioned here is guaranteed to actually happen. We don't actually have timelines for a lot of it, but at the very minimum, they're validated ideas that we have in the pipeline. First is just we want to provide better in-editor mechanisms to allow content creators to see and understand when changes they're making are incurring additional permutations. As you can see by the explanation of it at the start of the talk, it's not really trivial or easy to understand for the average content creator or even the average engineer for that matter. Vertex factories, I think, may not be necessary, at least on a lot of modern GPUs. We should be able to just write dynamically branching vertex fetch code. The question is just what platforms we can do it on, what's the GPU time impact is it acceptable? But if we can make it work, it should be a big reduction in the number of shaders we actually need overall, even in the runtime final shader library. It's also possible that if we can't go fully dynamic, we still might be able to collapse some additional targeted vertex factories that are similar enough to not have any performance regression. So we're looking into that right now. As I mentioned previously, the new Material Translator is coming. Implementations underway. There's a number of areas where this is going to help us in terms of shader reduction. So first is, as I mentioned before, proper dynamic branching, reduce the need for static switches to avoid expensive work. The one that I think is potentially more promising is combining proper dynamic branching with bindless means we're able to just convert all of our static switches that are hiding texture samples to dynamic branches and just reduce tons of permutations without actually having any impact at all, because we're not going to be exceeding sampler limits anymore, and we also won't be loading the redundant data in the GPU. So we're still compiling a lot more shaders than we need. Generally, we're still seeing about two to two and a half times as many compile jobs as we are seeing final result by code output. So limitation of minifiers, it doesn't modify functions. It only calls them as a whole piece. So with materials, our translator outputs everything in one single monolithic function, which contains a bunch of what is dead code, but we can't tell it's dead code, especially for things like shadow, depth, velocity passes, all you really need is your position and your opacity in the case you're doing kind of alpha masking stuff. So what the new translator is going to have is a decoupling of the translation slash compile time, compile step from code generation output. So we'll be able to output multiple code gen passes very quickly and easily without incurring a big cost. So essentially we could then take, you know, for depth and shadow passes, just a version of the material code that only outputs, vertex positions, opacity, WPO, anything that affects that kind of stuff, and strip, basically not generate the code that is going to be dead stripped by the compiler in the first place, and then significantly improve our duplication rate. Our goal is to get as close to 1 to 1 as possible. I don't know how close we're actually going to get, but we're working on it. Now, if you do want more details about the new Material Translator, not specifically related to reducing shader counts, you can go back and watch the Unreal Materials talk from Camille Kaye. It was this morning, so it's already happened, but the video will be posted later. There's a few minutes at the end of that talk talking about more details of what the new translator is about. So preprocessing is still quite heavy, even when we're doing it fully parallel. And on low core count systems, we've found this can be a new bottleneck that wasn't there prior to 5.5. We've had a few UDNs related to this. So we want to introduce a shader module concept, essentially, that's going to allow us to have shared libraries of preprocess code that we can just copy verbatim into the output shader rather than having to preprocess the entire code for every single shader. And that should allow us to cut down the preprocessing costs significantly and increase the throughput of submitting compile jobs, especially in the editor. So I want to also expand our usage of the UBA virtual files once we've rolled out the initial functionality. I mentioned before, we're batching shader compile jobs to reduce sending redundant shared data to workers over and over again. But we can also do that by this virtual file mechanism. We can define things like our shader parameter metadata, material shader compiler environment settings once, store them in memory, and then just have UBA distribute, deduplicate for us. It just happens automatically. And what this will allow us to do is basically get rid of job batching entirely, which means we should be able to load balance a lot better, get much better throughput. And finally, we've had a bunch of requests to implement some form of ODSC for global shaders. Now, it's not quite the same situation as materials because global shaders don't have a valid fallback. But at the very least, we should be able to asynchronously submit all the global shader compiled jobs and just block on a result when it's actually needed for rendering. And yeah, so obviously, this was not all my work. There's a lot of people that contributed to this initiative in one way or another. So this is not in any particular order, but thanks to all these people listed here, it was a big team effort and I think a pretty successful result. And we are continuing to work on this stuff. So please reach out to us if you have any feedback in this space. And I think I have a few minutes left for questions. So thank you very much.