My first post is here.
I have a performance issue with GLSL TOP on Vulkan builds w/ NVIDIA GPU.
More info below.
Many thanks for your help !
GLSL TOP performance drop w/ NVIDIA GPU on Vulkan builds, see attached toe.
- CPU : AMD Ryzen Threadripper 3970x
- GPU : NVIDIA Geforce 3090 w/ 522.30 WHQL then 526.86 WHQL
- RAM : 256Go
- CPU : AMD Ryzen 5800H
- APU : 5800H w/ Radeon Graphics
- RAM : 16Go
Tested w/ 2021.16410
Laptop - OpenGL + AMD APU : Performances are GOOD (around 7ms / frame)
Desktop - OpenGL + NVIDIA GPU : Performances are GOOD (around 0.9ms / frame)
Tested w/ 2022.26590 and 2022.29850
Laptop - Vulkan + AMD APU : Performances are GOOD (around 5ms / frame)
Desktop - Vulkan + NVIDIA GPU : Performances are BAD (around 260ms / frame)
glsl_TOP_bug.toe (26.4 KB)
Just tested here on a nvidia gpu (4090) and also goes quite terribly slow :D. I did some investigations and I think it’s because of the amount of ‘switch’ branches in the shader. For some GPUs (as far as I understand it) branches are not so good for optimization. They run all branches first and after discard the ones that are not valid/accessed. This makes the shader run all the different noise algorithm all at once. For example if you return the _fnlGenNoiseSingle3D() immediately before the switch with for example a cll to the opensimplex algorithm, it seems to work super fast again.
I think it would be better to separate those different algorithms in different shaders, and using a switchTOP to select the one you need.
Also, a small side note, not sure if you are aware of the buildin glsl TDSimplexNoise() function, which gives you some nice simplex noise relatively fast.
Hope this helps.
Thanks for your feedback and tips, i’ll check that !
I didn’t expect this to still be an issue in 2022.
Does anyone have a list which GPUs “properly” support branches, I.e. without running all of them ?
To be honest I thought the same :D. I just realized by debugging the code of miu-lab. (when not doing the switch it suddenly runs super fast). Might also be a vulkan thing since the opengl version seems to run faster. So perhaps Malcolm can enlighten us
I don’t think any modern GPUs will run all branches. Within a group of 32 or 64 pixels (AMD is 64), all of them need to take the same branch for full speed of that part of the code. Otherwise each divergent branch will also need to be taken, in serial (no more multi-GPU-core work) and then parallel work can continue once all the divergent branches have been resolved.
Yes, it’s likely a Vulkan bug in either the driver or TD, I’ll determine soon.
This does seem like an Nvidia bug, I’ve reported it to them.
One thing you should consider doing with Vulkan though, is make use of:
To assign some of these different types. It should improve performance is most cases, even after this bug is fixed.
Thanks for this report and the sample. Nvidia has fixed this bug and rolled it into upcoming drivers 530.89 and later.
I’ll mention those that the specialization constant workflow is at least 2x as fast than using uniforms for these branches, so you should definitely also consider starting to use those for shader ‘modes’.
Sorry, one update. I don’t think the full speedup will be until the 535.xx series. The 530 (and even some newer 528.xx versions) will be far faster than they are now, but still quite slow compared to the AMD or Nvidia-OpenGL versions (7ms or so on my machine).
Thank you Malcolm for your feedback. The solutions you proposed work as they are, but it’s always good to know that you’re on top of it! Furthermore, I opened a new topic yesterday regarding another issue I encountered in the compute shaders with NVIDIA’s hardware derivative functions, they seem not to be working on my end. Thank you for your help!