Can creating manual mip-maps be more efficient?

I’ve been circling around this topic for a while looking for solutions that don’t eat up a serious amount of gpu memory, cpu cook or gpu cook.

My issue isn’t reading from mipmaps, its generating them manually in the first place. I know I can set a TOP’s sampling mode to mipmap. This works fine, but doesn’t give me control over how I down-sample, blur, etc.

I’m pretty green still with compute shaders and c++. I’ve poked around a bit with those, but not had much luck. Compute shaders seemed promising for generating atlas mip maps, but I kept running into the issue of work groups trying to load data that had not been written yet, etc.

I’m curious if I’m chasing performance gains that can’t be had… or if it may be possible to get faster results.


Goal

Real-time mipmapping of a render pass for generating Bloom and using it in other post process FX.


Attempts

GLSL 1-pass Atlas MipMap

( 0.07ms cpu / 0.05ms gpu / 15 mb gpu )

I lifted this technique from an excellent shader toy here.

This seemed like a magic bullet at first due to it’s speed. However. it’s super fast because it’s technically doing the mip mapping wrong. It’s sampling from the input texture and generating all of the mips from input mip0. If you use textureLOD() it can achieve the desired result… However then it negates the purpose of doing it in an atlas anways.

I noticed when using this for bloom, really sharp bright highlights failed pretty badly, and the upsampling mechanism (which I’m not including here but you can see in the shadertoy) produced some very noticeable artifacts:
image

The problems with the approach stem from the fact that it’s not doing the blurs between iterative mip passes… it tries to blur the entire atlas after the fact.


Node Multi-pass MipMap

( 0.82ms cpu / 0.38ms gpu / 27.65 mb gpu )


This was the easiest to setup, it’s also the least efficient, primarily as far as cpu goes.
The nice thing about this approach is I can use nodes or glsl tops to handle each down sampling pass, so it’s really easy to get in there and do custom shader work between any two mips. It’s also a really convenient format for injecting my own totally different tops from disk, or wherever into any stage in the mip chain. Useful if I want to bring in mips from another software for example (static images)

However this is a bit harder to manage when it comes to varying resolutions and aspect ratios. need to setup complicated switching networks or connect/disconnect/replicate passes to fit the right number of mip maps.

It also scales the worst, in just about any way due to more nodes/memory.


GLSL TDPass Multi-pass MipMap

( 0.40ms cpu / 0.32ms gpu / 6.2 mb gpu )

At present, I believe this is the winner for me… but I still wonder if it could be faster.
Here I’m doing the same basic thing in the previous attempt, but instead of doing things through different tops, I’m leveraging the glsl TOP’s TD Passes functionality to iterate on itself, taking the input of the previous pass, and mipping that.

The bottleneck across most of these solutions but especially this one feels like a bandwidth issue. The cost I think comes with the fact that each pass, I first have to copy over the entire buffer 1:1, then process the mips for the current pass, and add that to the results.

This means that if I’m mipmapping a 1080p texture, that’s 11 levels, and 11 times that I sample a full HD res texture. I don’t know how to get around this fact with out something that does less sampling. If you then add even a modest gauss blur between mips, and consider each mip level’s sampling that goes on, it’s a lot more than 11x. maybe 13, or 15x. maybe more depending.

I’m doing some branching in shader to improve this quite a bit… but it’s not clear how close or far that puts this method to whatever the inherent hardware accelerated mipmap generation is.


Any suggestion or ideas for compute shaders? or ++ top?
Thanks!

1 Like

Here’s a file with the above 3 methods for comparison

MipMap_Comparison.23.toe (13.4 KB)

Nvidia released a nice library for generating mipmaps using compute shaders:

The workflow within TD could be improved if you want to this manually though. In the GLSL TOP when in Compute Shader mode I could prealloc the mipmaps and assume you are going to fill them using a compute shader.

Woww, that library looks fast as all hell from the reads of it. Is that something that at present, would have to be implemented via c++ top? Or is that not even possible yet because it’s a vulcan thing?

As far as the TD centered idea of pre-allocating mipmaps, that sounds super ideal. I guess the question is still,how does one correctly write to those in succession using a compute shader with out running into issues with conflicting workgroups, data being accessed before it’s ready, etc.

Is it a matter of using the 3rd dimension of dispatch groups to access/write to lower level mips? I tried all the barrier() functions (probably incorrectly) to no avail.

I assume you could port their shader to CUDA at the very least right now.

I haven’t dug too much into their shader yet, but I think their algo would likely answer your second question.

Awesome, that makes total sense. I’ll dig into their repo a bit more and see what I can learn from it, if nothing else a great place to understand a bit more about thread control and compute shaders. Thanks Malcom!

You’re probably on nvidia hardware, but if you’re looking to get deeper understanding there’s also an amd one AMD FidelityFX – Single Pass Downsampler - AMD GPUOpen, and they also did a case study about it Optimizing for the Radeon™ RDNA Architecture - YouTube (40 minutes in)

I looked a bit at both, the nvidia one gets a bit hairy with all the more difficult cases (non power of 2 res…), the amd doesn’t deal with those :smiley:

ahh I will def check out that video, sounds like an enlightening dive. I came across the FidelityFX but haven’t peeled back the layers on that yet either.

Thanks!

Just wanted to revive this topic to inquire if there were any new features or workflows in the latest version of Touch/Vulcan that would make doing the mipmap generation manually possible via compute, or otherwise?

Nothing currently. I am internally using that Nvidia library for our mipmap generation, but it’s still a bit hidden from you

1 Like