https://www.youtube.com/watch?v=EdNkm0ezP0o

All right. Good afternoon, everybody. I hope you're having a good Unreal Fest so far. My name is Kevin, and together with Thais and Vignesh from CD Projekt, we will take you behind the scenes on the topic of large-scale animated foliage, specifically in the context of the Witcher 4 Unreal Gen 5 tech demo. This is the demo we showcased at Unreal Fest Orlando just a few months ago. So the story arc of today's session is us telling you about the difficulties of rendering foliage to early experiments, leading into solutions, and finally talk about how this collaboration and demo culminates in engine features for you guys to try out in 5.7. So before we kick things off, here's a brief snippet from the demo just to set the tone. It is running on a base PS5 at 60 FPS with all the bells and whistles that UE5 has to offer. So a bit of a spoiler for those who haven't seen it. This is kind of the middle section of the demo where we go on rails and we just kind of fly through, look at all the awesome foliage and vegetation tech that we built. And in case you haven't seen it, it's online. You can see it in its full glory. And as we move up here, we're going to catch up with our protagonist and continue the demo, but we're going to skip that part for now. So I'm going to hand it over to Thijs to start us off with the challenges of rendering large-scale animated forests. And I'll be back at the end to tell you more about 5.7 stuff. All right, take it away, Thijs. THIJ SINGHALAVYSKI- All right. So from the very start, we set ourselves two goals. First of all, we wanted the player to feel like they're in a huge forest, but we also wanted the forest to feel alive. So basically, it responds naturally to forces from the environment, especially wind. And we wanted this to work not just on a small scale, but really push it as far as we can. But first, we had to start with basically from nothing, because there was no nanite foliage at that time. So why is foliage hard? Well, foliage has always been a challenge in video games. The state-of-the-art solution hasn't changed much in a very long time. There were some improvements made, for sure. But fundamentally, we were making trees and foliage the same way. We were using cards. We were using alpha testing. And many of the advantages that Nanite gave, they didn't apply to foliage. So first, we tried cards. We started with the known solution. On the screen, you can see the same tree made with cards and made with geometry. It's the reference mesh. So the cards approach, I would not recommend it in any case. It's still around three to four times slower than trying to mesh it out and let Nanite handle the geometry. But if you do use cards, we found that it's, well, Try to make the geometry match the cards as closely as possible. So don't have big planes. Try to cut it out if you can. But generally, we recommend staying away from alpha testing. So why is alpha testing slow? Well, especially in the software resorizer, the main reason is the barycentrics. So Nanite, it handles all the geometry in one fixed function pipeline. But with alpha testing, this is no longer possible. The reason it's slow is that in the inner loop of the software rasterizer, we need to calculate the barycentrics to get the UVs and then sample the alpha mask. You can see this Nanite rasterization common file for the logic. But because it's in the inner loop of the software rasterizer, there's a heavy performance penalty to pay. That is not easy to get around. So we tried the other approach, we tried geometry. This had some advantages. We didn't use alpha, it was faster to render because it didn't have this problem with the barycentrics. It also looked better because we didn't have to fake the normals, which vegetation artists have gotten very good at over the years But this kind of gave us good server scattering and all that out of the box But there was another problem mainly the disk size and the RAM So Nanite geometry it rendered quite fast but we were using up to 63 megabytes of VRAM for a single tree versus 6 megabytes of cards. So a 10x increase in memory usage and even more on disk, a 30x increase. so we would be shipping a huge game if we made all our foliage this way. Next, there is also just the inherent problem of foliage, which is that it's hard to build good LODs for it. Nanite has the preserve area flag, which is an absolute must for foliage, otherwise it will just disappear, but even with the flag, it doesn't do a great job of keeping the shape at the distance. As you can see here on the left you see a rock and on the right you see a tree. So clearly the quality of the rock LOD is much higher where the tree basically becomes a triangle soup. And that's just an inherently hard problem to try to model a tree with a very low triangle count. Plus all the holes that you can see in the geometry, they also cause excessive overdraw. So it's not just worse lots, they're also more expensive. And then we have animation. So we started out, again, with kind of industry best practice, which was vertex shader-based wind. There was, you know, friendly out-of-the-box solutions available. We started with Pivot Painter 2, which was a vertex shader-based wind, where the artist would store basically in the textures, the pivots, and the hierarchy of the tree that we then evaluate at runtime in the vertex shader. Actually, it reminded us a lot of skeletal animation how this approach was doing it, but it was all happening in the vertex shader. As you may remember from earlier slides, if we go with geometry, we have a lot of vertices. Even with cards, we're around 100k, but with geometry, over a million vertices that need to be transformed at runtime. So it was quite clear that the work that we need to do for each vertex needs to be absolutely minimal. And there's another problem, which is that using vertex shader logic to move things around, or world position offset, the way to handle the cluster bounds in the night is that you need to inflate them. So in this case, we can imagine the tree on the right being our kind of reference tree, and the rotated tree is the one that's animated. The little green box that you see, those are the cluster bounds of that particular cluster up top, which is used for culling within nanite, and it is very, very important for performance. Now, if we want to move those vertices to that point, we have to inflate those cluster bounds so that they fit the original position. So that means they need to be absolutely huge. And while you may say, well, foliage is not good at occlusion culling anyway, also remember that these cluster bounds are used for invalidating VSM pages. So clearly, vertex shader animation was not the way to go. So we needed some new solutions. We saw potential in Nanite, especially of geometry, but it was also clear that we had to develop on it to make it work. So one of the first things we did was skin foliage. This was quite quickly proving to be a success. We switched from doing everything in the vertex shader to a two-pass system, where we first calculated all the bones, their transformations, did animations on a compute shader, and then in the vertex shader just move the vertices around. This was quite cheaply done. We also went with rigid skinning. This was a good enough trade-off for us where we only had one bone influence one vertex. This could also result in potential benefits later, such as with ray tracing. And we ended up using skin foliage in the demo We also tried to make alpha testing work We didn go full in on geometry right away The first thing that we tried was Nanite, but with custom LODs. So basically, we were using Nanite as it works out of the box, but we had manual LODs at certain distances. This did bring good perf improvement, around 20% to 30%, But it also came with downsides like brought back popping and all the other issues associated with cards. So this was not enough for us to achieve that vision that we had. Then with the help of Epic, we also tried to optimize the software restorizer as much as we could to make the alpha testing as fast as we can. And significant improvements were made, but we started hitting diminishing returns. So at that point, it was kind of clear that alpha testing was just a no-go. One other thing that we considered but didn't pursue was inspired by opacity micro maps. If you're familiar with it, it's a technique used in ray tracing to speed up the ray tracing of cards and alpha testing. This idea could be transferred to the Nanite software rasterizer to basically get rid of that expensive barycentric calculation that I was talking about earlier and to use a more efficient way of well basically checking if this pixel is opaque or not. But in the end we didn't pursue this direction because the geometry was showing more potential. So there's no alpha-tested foliage in a demo anywhere, not even the small foliage. Everything is using the same pipeline and nowhere did we have to use cards. So back to geometry. So fully modeling a tree was not really feasible due to memory costs. But actually, the artists were already constructing trees in a very modular way, where they were using node systems to combine different pieces and scatter them around to create a tree. So this was something we could make use of. So the first thing we tried was, well, why not just use instancing, right? Just instance a bunch of tree parts all over the place and try to construct trees with that. It worked quite well memory-wise, but we quickly hit some limits. For example, just hitting the instance limit in the GPU scene due to the very large number of parts that we had to instance. And we also had to pay for the cost that comes with an instance. And Rails stores instances in the GPU scene, which actually stores quite a lot of data. And really, we didn't need this data. What we needed was very lightweight instances. So now I will hand it over to Viknesh, who will explain how we overcame these limitations. So Viknesh. All right. So we could not use the regular instancing, like I mentioned, because of the instance limit. So we came up with virtual instancing. So it involves composing a tree out of lightweight instances out of a set of archetype meshes. These lightweight copies, they do not count towards the instance limit, so are very free. What we see here is the set of archetype meshes that I mentioned before. We compose this tree entirely out of this set of meshes. These do not count towards the instance limit, and they're quite lightweight. And this is the structure we use to store those lightweight instances. We just have a opaque pointer towards the archetype mesh, a pointer towards a transform, and the bone it belongs to in the skeleton. So I think right now we fixed it, right? Because right now the memory is pretty good. We don't affect the instance limit. But we ran into another issue. This is the Nanite clusters visualization that you can see. So what you see right now is the clusters stopping to simplify. They have reached the closest map right now in the MIP type. So those clusters which are gray, they cannot simplify anymore. And they do it pretty quickly. This is because each of this virtual instance it a separate entity these virtual instances do not have any kind of geometry connections with each other So they do not combine together to form a simplified continuous geometry basically And one of the side effects of this was the explosion in the number of virtual instances. That's the first thing. The second thing was the explosion in the triangle count. In especially larger scenes, we have very small, like 128 triangle clusters all over the place. And this was causing performance issues. So this is the Nanette instances visualization. What we tried next was basically creating levels of detail for these trees and swapping them in based on the distance from the camera to avoid this explosion of triangles. So these are what we call the LODs for the virtual instances basically. I'll go back and front to see the not just the amount of virtual instances, but these levels of details are much less in number of triangles as well. Let's look at some numbers. So what you see here is the amount of time taken by the GPU to fill this buffer. And on the y axis, we see the distance from the camera from the near to the far. And as you can see here, near, there is not much difference between these three cases. The first case is a tree without any kind of LODs. The second is the tree with the LODs. And the third is the reference mesh, which is the fully modeled tree. In the nearby case, there is not much differences between these because things are still simplifying. as the distance increases the tree without simplification has this huge explosion in the number of triangles in the scene and this contributes to the very big numbers in the WIS buffer. So we can see the clear winner here is both the reference tree and the tree with simplification. They are pretty close in all those cases. But now let's take a look memory. So it's a similar case but this is the 9.8 streamed in memory and as you can see the reference tree mesh up close it's streaming in so many high quality clusters from the MIP tile that it takes almost 165 megs while the tree with simplification just has to pay for the architect meshes as well as the LODs. But we did not end up using virtual instances in the demo as is. It's not because it is the failure of the tech, but we wanted to do it properly with Epic with better integration into the engine without any kind of manual load generation that we saw. So the second thing, Thais previously mentioned about distance representation, but we have poor shape preservation. with triangles, even with preserved area turned on, as well as inefficient inefficiency due to holes in the aggregate geometry that is poor occlusion curling and generally trying to make a model out LODs for aggregate geometry is quite difficult. So we wanted a different representation and point clouds is something we tried out. So what we did was during the build time when the import when the foliage is imported. We created point out clusters of the virtual instances. At runtime, we just replaced the virtual instances with the point close clusters at a certain distance from the camera. We also added a pass that these point load clusters are just ladded onto the Nanite Vis buffer. Visually, it looks really good. It preserves the area much better. The one on the right is the point clouds, while on the left is the triangles. As you can see, the area preservation is much better. But we did not end up using this in the demo due to performance reasons. This is because of the nature of how we write, how we do atomic writes to the WIS buffer. We are writing to the whisper for all over the place with this new point clouds. It was causing too much traffic in the memory controller, and that was causing lots of performance issues. So we decided not to go this route. So, what we ended up doing with the tech demo. We did three things. The first is voxels, that is for the better distance LOD in terms of efficiency as well as shape preservation. The second is assemblies for the reduced disk size, as well as at the same time having the ability to have high fidelity assets. And the third is fast and efficient skinning. So let's have a quick overview on how voxels are used. So all these subpixel triangle clusters we have during the 9.8 build process are now voxel clusters. That is a simplification error for each cluster. And if it falls below a certain threshold, then instead of having them as triangles, we make them into voxels. This happens during the 9.8 builder process. These voxels are stored in 4x4x4 bricks and it's represented by a 64 bit field basically. And these bricks are traced at runtime during the 9x8 trasterization pass depending on if a cluster is a triangle or a voxel. And these bricks are actually bent in depth buckets from front to back for early depth listing. So this helps a bunch with old raw as well. And the video you see right here, it's been exaggerated to show the voxels. So in practice you won't see the voxels because they're usually subpixel. Next is assemblies. So what you see here, each of the separate colored entity is assembly part. So the image at the bottom would seem familiar because the input we give to assemblies is the same as the input we gave to virtual instances. We didn't have to change anything. We just had to re-import with the new assembly builder. So each tree is composed of this instance assembly paths. These are what we call assembly paths. And you can see on the right the number of instances of each assembly path that's been used in that tree. And this shows how the assemblies are simplified. Each of the colored part eventually combined together to form a simple continuous geometry. This is the advantage of assemblies or what we had with virtual instances where we had to generate our own LODs. Here it's already taken care by the Nanite Builder in the offline build process. And now the primitive scene data has a range of assembly transforms. And during runtime, the cluster has assembly transform index, and it just picks this assembly transform, this assembly transform and flattens it so that we know the final world space transform of the assembly. And now it's a quick comparison between the alpha cards, the fully modeled geometry and the assemblies and voxels. The assemblies have the performance characteristic of the fully modeled tree while maintaining a fraction of the memory footprint. And voxels makes it much faster than the fully fully model geometry by helping with the lot inefficiency solar distance as well as having a better shape preservation. Keep in mind that assemblies and voxels, they don't need to be used together. There could be a non-assembly tree that can be voxelized. It just doesn't need to be a tree. It can be any kind of mesh. Especially like aggregate geometry, it's really helpful. At the same time, the other ways around is true where a tree can be made of assemblies but it doesn't necessarily have voxels enabled. Next is skinning. How did we skin these assemblies? So all the assembly parts, they are transformed rigidly. All the clusters within the assembly parts are transformed rigidly. So they are pretty watertight. There is no cracks between them. Each of the assembly part can be influenced by up to eight bones But in the case of foliage our assembly parts are usually associated with one branch So in our case we just use the single bone influence basically We did see in one of the previous slides that assembly parts simplify to form a continuous geometry. And this is during that process. This simplified geometry that's represented in blue that undergoes per-vertex skinning instead of rigid transformation. The bone influences of these vertices, they just pick it up from the assembly part they originally belong to basically. So in code it's pretty straightforward. At each primitive on being added to the world, it receives a skinning header and we just get the transform buffer offset and load the bone transforms and move the clusters basically. And what you're seeing right here is one of the optimizations we did to help with skinning lots of foliage in the world. That is this default min animation screen size which is used to stop animating skin meshes in the world which fall below the certain threshold of screen size and this helped a lot in our case. And this is the same C-Var in motion. pretty pretty difficult to see the trees that have stopped moving because of most of the foliage is just noise if you look at it so yeah the next is animation our animation is fully compute based it's quite fast this is one of our bigger trees that we used in the demo it has more than a a thousand bones. There is no pre-break animation involved. The wind is evaluated for each bone every frame. And this is our debug view to see the bones. So how it works is we dispatch a thread in shader for each bone in the skeleton and we recursively accumulate the position and rotation all the way from the bone to the root and this happens for all the bones so for example the bone b just needs to accumulate just its portion accumulation sorry orientation while bones are the leaf nodes they have to go all the way and we do this every frame basically and in code it looks something something like this that the skeleton data has the current bone index and the parent bone index and we do the accumulation in a recursive manner where we start from the current bone and go all the way up to the root bone. And this is pretty fast. It just takes around 300 microseconds on the critical path, but with the async computer enabled we basically hit it. It's less than 20 microseconds. The next is animation instancing. So you'd have noticed that with around 500,000 plus foliage animated, 300 microseconds is quite small. That's because we don't run the wind evaluation for all the 500,000 foliage. So this is one of the shots from the demo, and I would like to show the debug view of the bones. So there is no way to realistically animate this many bones. And we did try to do it with a naive approach, the render threat suffers the most. The skinning scope takes almost 20 milliseconds on PS5, more than 20 milliseconds on PS5. So we introduced animation instancing. So animation instancing is needed because we wanted to have a sense of directionality for our trees We didn want trees to be moving in random directions We did want to have the trees follow a certain direction along the bend So we evaluate eight variants of the animation for each unique skeleton, not each unique primitive, each unit skeleton in the world and select an appropriate template based on the direction of the skeleton. So let's say Say I have a primitive in the world that's facing 160 degrees in the z-axis, then we'll probably select the skeleton that's been simulated for the octane 4. It was a good approximation, it didn't look bad, and we were happy with the results. And this shows a high level overview of what's happening. We have the unique foliage skeletons and the wind shader is run per unique skeleton and it generates eight templates. And these templates are used on the instances in the world based on their orientation. When a primitive is added to the world, an animation template is assigned to it based on the direction it's facing. The template just corresponds to an offset into the skinning buffer. So it's a matter of just like sampling the proper bones. So using this, the number of unique items that we processed in the skinning scene went down from almost 500,000 to just 150, which is the number of unique skeletons in the world. This is around 200% improvement in the render thread. And that is how we also got the very low amount of time on the GPU because we just simulate 115 to like 8 variants and that amounts to almost like 100,000 bones per frame. So, and we heat it with async compute so it's almost free. So next up, Kevin, we'll be coming up to show some numbers because we love numbers. Thank you. Alright, so with all these juicy details and beautiful content, What did we actually end up paying for all of this? So in this vista shot here, we have about 20,000 trees post-cull. I mean, the source triangle count coming into nanite here is in the billions. That doesn't really matter with nanites. Across the whole valley, we have scattered over 500,000 trees and 1.7 million shrubs and 7 million grasses, and that's across these 26 square kilometers. We have about 28 species of vegetation in a couple of different biomes. So the tree assets, they kind of range from below a million to about 10 million triangles and up to several hundred bones per tree. That kind of varies with the state of the tree. If it's a dead tree, obviously you don't need that many leaves on it, or maybe it doesn't even move at that point. We also had a very big tree, which had 40 million triangles, which was kind of the hero tree where we showed off some other debug view modes. on. And here I put some of the numbers from the actual foliage section that we saw earlier. So in this particular shot, we have about 100,000 bones. I think it was 135,000 that updates in 0.1 milliseconds on the GPU. So that's very fast and scalable, what Vignesh outlined. I put here the nanite timings roughly on average for this, what is this, like 10,000 frames? This is kind of the whole forest section. We had a budget of around four milliseconds for the nanite passes. That excludes the shadow passes that had its own kind of budget. So what's interesting here is that the big arrow I put here points to the actual frame of this shot. And I put the numbers for that particular frame in the image there. And it's actually a little cheaper than the average of the whole thing. You can also see the kind of the dip in the graph. And it's because you'd think that because you're seeing so many trees and everything is in frame at once, it would be more expensive. But Nanite and VSM actually scales fairly well with the distance. And that's also thanks to the voxel tech. All right. So that's what we did for the demo. And let's go through what's coming in the engine for you guys to play with So starting in 5 we got some experimental things for you And that mainly to give us some more time to polish things and take some feedback for future releases as well So there's kind of a long list here with 5.7. We've got a brand new Win plugin. We've got support for an edit assemblies through our USD importer paths and some in-editor tools if you want to stay in-editor. We've got some example tree assets from Quixel on Fab that are using best practices, easy for you to use. And finally, we have a brand new tree authoring tool directly inside of Unreal Engine. So for the Dynamic Wind plugin, you can easily enable it and start playing with it right away. And the idea is that you should be able to get your Nanite foliage up and running with wind easily. So it supports procedural wind just like what we showed in the actual demo here. And you can configure your wind easily, and it will affect instant skin mesh assemblies and support the whole wind direction batching and physical parameters like Vignesh mentioned. So in our USD importer in the engine, we have now support for nanodassemblies. So through exposing some USD schemas to use in DCC tools to mark up your scenes, you can actually build your own nanodassembly structures and import them straight into the engine. So with this, you can actually create your own pipelines in any tool that supports USD. We also have some in-editor functionality to create nanite assemblies with. So in this new plugin called nanite assembly editor utils, you'll find things such as right-click actions to create nanite assemblies from a selection in your viewport, and a blueprint API for manually building your own nanite assemblies and saving them out to disk as a mesh. That's kind of the bottom one here. We also added support for Nanet assemblies right in the procedural content generation tools. You might be familiar with those. We've got an example here in the middle. So that can also kind of spit out your Nanet assemblies as a mesh, which is actually quite powerful. Because I mean, the Nanet assemblies goes beyond foliage. You can actually kind of instance. It's a micro-instancing solution, right? So you can use that to kind of build up any kind of mesh if you want to. And for content examples, to get you started out of the box, Quixel is providing a pack of preset trees, which are optimized for nano assemblies and the foliage pipeline as a whole. And these are procedurally generated. They're gamer-ready assets, which are extensively customizable through the new tooling coming in 5.7. And there will be more additions and foliage packs over time to Fab, so keep an eye out on that. All right. So note that it will come on FAB in 5.7 when it goes live for a preview one. If you're downloading that, these things will be in the plugging itself, in case you're looking for that. And the tool I'm mentioning is the new procedural vegetation editor. It's kind of built on Quicksilce extensive experience with procedural foliage growing algorithms. And using this tool, you can edit and create new variations of existing trees with graph-based rules and logic, so right in the editor, similar to how you do things with geometry scripting or PCG, tools you might already be familiar with. And this tool will produce Nana assembly assets ready for you to use right in your projects. And that will also support the Dynamic Win plugin right out of the box. And over time, the idea is for this to become your kind of one-stop shop for foliage generation, even stuff from scratch. To learn more about how to set up your content using these new Engine features. You can go see Simon and Hassan from Quixel in session room, this room, I think, tomorrow at 2.30, so the same time as this talk. They will walk you through all the new content examples and tools, as well as content pipelines and best practices with Nanite Foliage. Highly recommend that talk as a follow up to this one. And that's it. I want to thank everybody for listening here. And we also want to extend a special thanks to everybody who worked on this ad epic and CD project. We wanted to put an acknowledgment slide up, but there was just going to be too many names on it. So you'll have to do with this. I think we have time for some Q&A. Yeah, let's do that. Thank you.