[2022.29850 Win10] RayTK textureField crash

I’m getting a consistent GPU-related crash in one of the RayTK test cases:

“The Vulkan Device has returned a fatal error code…”

To reproduce the issue, download and expand this package:

https://drive.google.com/file/d/1bdmgl2382PBtVVB-ifJ630KR95bFI_Jo/view?usp=sharing

Then:

  1. Open raytk-test.toe
  2. In the panel on the top left, in the dropdown menu, choose “Build 0.29 (exp)”
  3. Click “Load”, which will take some time.
  4. Once the UI starts responding again, in the filter bar in the panel on the middle left, type in “compositeFields” and press enter.
  5. Click the “Run Queued Tests” button on the bottom right.

After a delay, TD will crash with a message about increasing TdrDelay.

I’ve attempted this with TdrDelay set to 8 and 20.

Hey @tekt

Thanks for the report and sample.

Pretty massive impact here as my screens went dark, ah !

I’ve logged this issue for a developer to look into it.

Best,
Michel

Thanks for the report. The problematic node seems to be
/tester/test_host/compositeFields_context_test/raymarchRender3D/render_glsl
If I force the Maxsteps constant to 1, then I no longer get a GPU hang. So it somehow this operation is taking too long. There is another bug report on the forum about poor GLSL performance in some cases with Nvidia.

How would I setup a simple testcase with just that one test that is functional, so I can play with the shader/constant values and try to narrow this down more?

Here’s a condensed version of that part of that test:
raytk-compositeFields_context_test_crash.toe (244.1 KB)

The Extendmode parameter on textureField2 uses a specialization constant.

If I add a Limit CHOP to the CHOP you are using for your values, and ensure nothing is 0, then the hang disappears. I feel like you have a divide by zero somewhere that is causing an issue. Is that possible?

It’s possible, but would that cause TD to crash? I thought divide by zero in shaders just produces an undefined (arbitrary) output.

Here’s a variant that switches out the noiseField and floatToVector for a constantField, which generates a lot less code. It does not crash.
raytk-compositeFields_context_test_crash.toe (261.1 KB)

So the issue may just be the amount of code that’s generated.

It shouldn’t cause a crash itself, but if you are doing work on the undefined/infinite value that comes out of that, you may end up in an endless loop or maybe an out of bounds lookup, both of which could cause that crash.

Further info, if the loop in castRay() is only 1 iteration then it doesn’t crash, if it’s 2 it does. Diving deeper, it seems like the map() call within the second loop iteration is where it fails.

More specifically, whatever ‘res’ is getting set to after the second castRay() call is ultimately ends up causing the crash. If it set ‘res’ back to it’s value it got from the first iteration after calling map() on the second iteration, then it gets avoided. Although the compiler may be removing the second call if it’s unrolling then optimizing…

I’ve found another case that consistently reproduces this, in what should be a fairly simple scene.
The issue occurs when connecting a textureField, which adds a texture input to the generated shader.

It doesn’t seem to be a matter of the complexity of the scene. I’ve been able to create scenes with a lot more operators and a lot more code.

If the Enable parameter on the assignColor is switched off, it won’t crash when the textureField is connected, but it will crash once that parameter is switched on afterwards. That parameter is passed via a uniform and basically does a runtime if/else that skips over the call to the textureField function.
So it doesn’t seem to be an issue of simply including the code that the textureField produces. It seems to be when it actually tries to execute that function.

The textureField operator adds <10 lines of code (when taking into account macros and specialization constant code elimination).

textureField-crash.toe (160.0 KB)

FYI I’ve spent the last 2 days working on this, more specifically working on getting much better shader hang debugging working with Nvidia Aftermath. At the very least it should show us what GLSL line the hang is occurring on, which can help us to know if it’s a Nvidia or shader bug. More info coming, thanks for the example!

Thanks for putting so much time into this!
It will be really useful to be able to get more insight into what’s going on in there.

The good news is that through these efforts, interactive shader debugging (line by line stepping) is now working in RenderDoc with TD in the next experimental… still no solution as to the hang yet though. I need to test this on AMD and Intel next to see if it’s more likely a driver bug.

That’s going to be incredibly useful!

I might actually be to fix refractive materials!

How would I go about using RenderDoc to debug a crash?
Is that something that you’d need to do using internal tools, or is it exposed for external users?
I’m able to capture render frames by launching TD through RenderDoc, but only for the main UI window. And I haven’t figured out a way to capture a frame that caused a crash.

Ya I haven’t found a way to capture a crash frame either. Basically it’ll never finish the capture due to the crash even if you manage to start the capture on the correct frame.

I brought in an AMD from the office to test this on there, if it doesn’t crash there then I’ll submit this to Nvidia as a driver bug

Doesn’t happen on AMD. I’ve submitted this to Nvidia.

Awesome. Thanks!

The update from Nvidia is that the issue is that after a few iterations of your shader, some of the threads will diverge and take different paths, and some will call texture() while others will not.
“Since RTK_project1_textureField1() is called from divergent control flow (since some threads in castRay()'s loop may break early), its texture() call uses undefined derivatives and produces an undefined value.”

When derivatives are needed, such as for mipmapping, all threads in the warp must call texture(), so the rate of change of the uvs used for the texture can be calculated.
So currently you can avoid this crash by forcing non-derivative version of texture(), such as textureLod(sample, uv, 0), which will force the mip lod to be 0, with the 3rd argument.

Ofcourse, you aren’t even using mipmapping in this shader, so derivatives shouldn’t be needed. So this does seem like a driver bug, but for now this is how you avoid it.

Should I just replace most calls to texture() with textureLod()? I think there are few (or no) places where I’m (intentionally) using mipmapping.