[2022.29850 Win10] RayTK textureField crash

tekt · December 10, 2022, 12:25am

I’m getting a consistent GPU-related crash in one of the RayTK test cases:

“The Vulkan Device has returned a fatal error code…”

To reproduce the issue, download and expand this package:

https://drive.google.com/file/d/1bdmgl2382PBtVVB-ifJ630KR95bFI_Jo/view?usp=sharing

Then:

Open raytk-test.toe
In the panel on the top left, in the dropdown menu, choose “Build 0.29 (exp)”
Click “Load”, which will take some time.
Once the UI starts responding again, in the filter bar in the panel on the middle left, type in “compositeFields” and press enter.
Click the “Run Queued Tests” button on the bottom right.

After a delay, TD will crash with a message about increasing TdrDelay.

I’ve attempted this with TdrDelay set to 8 and 20.

JetXS · December 10, 2022, 3:27pm

Hey @tekt

Thanks for the report and sample.

Pretty massive impact here as my screens went dark, ah !

I’ve logged this issue for a developer to look into it.

Best,
Michel

malcolm · December 14, 2022, 3:06am

Thanks for the report. The problematic node seems to be
/tester/test_host/compositeFields_context_test/raymarchRender3D/render_glsl
If I force the Maxsteps constant to 1, then I no longer get a GPU hang. So it somehow this operation is taking too long. There is another bug report on the forum about poor GLSL performance in some cases with Nvidia.

How would I setup a simple testcase with just that one test that is functional, so I can play with the shader/constant values and try to narrow this down more?

tekt · December 14, 2022, 6:39pm

Here’s a condensed version of that part of that test:
raytk-compositeFields_context_test_crash.toe (244.1 KB)

The Extendmode parameter on textureField2 uses a specialization constant.

malcolm · December 14, 2022, 10:00pm

If I add a Limit CHOP to the CHOP you are using for your values, and ensure nothing is 0, then the hang disappears. I feel like you have a divide by zero somewhere that is causing an issue. Is that possible?

tekt · December 14, 2022, 11:56pm

It’s possible, but would that cause TD to crash? I thought divide by zero in shaders just produces an undefined (arbitrary) output.

Here’s a variant that switches out the noiseField and floatToVector for a constantField, which generates a lot less code. It does not crash.
raytk-compositeFields_context_test_crash.toe (261.1 KB)

So the issue may just be the amount of code that’s generated.

malcolm · December 15, 2022, 12:44am

It shouldn’t cause a crash itself, but if you are doing work on the undefined/infinite value that comes out of that, you may end up in an endless loop or maybe an out of bounds lookup, both of which could cause that crash.

malcolm · December 15, 2022, 1:25am

Further info, if the loop in castRay() is only 1 iteration then it doesn’t crash, if it’s 2 it does. Diving deeper, it seems like the map() call within the second loop iteration is where it fails.

More specifically, whatever ‘res’ is getting set to after the second castRay() call is ultimately ends up causing the crash. If it set ‘res’ back to it’s value it got from the first iteration after calling map() on the second iteration, then it gets avoided. Although the compiler may be removing the second call if it’s unrolling then optimizing…

tekt · March 11, 2023, 8:58pm

I’ve found another case that consistently reproduces this, in what should be a fairly simple scene.
The issue occurs when connecting a textureField, which adds a texture input to the generated shader.

It doesn’t seem to be a matter of the complexity of the scene. I’ve been able to create scenes with a lot more operators and a lot more code.

If the Enable parameter on the assignColor is switched off, it won’t crash when the textureField is connected, but it will crash once that parameter is switched on afterwards. That parameter is passed via a uniform and basically does a runtime if/else that skips over the call to the textureField function.
So it doesn’t seem to be an issue of simply including the code that the textureField produces. It seems to be when it actually tries to execute that function.

The textureField operator adds <10 lines of code (when taking into account macros and specialization constant code elimination).

github.com

t3kt/raytk/blob/master/src/operators/field/textureField.glsl

ReturnT thismap(CoordT p, ContextT ctx) {
	#if defined(THIS_HAS_INPUT_uvField)
	vec2 uv = inputOp_uvField(p, ctx).xy;
	#elif defined(THIS_COORD_TYPE_float)
	vec2 uv = vec2(p, 0);
	#elif defined(THIS_COORD_TYPE_vec2)
	vec2 uv = p;
	#else
	vec2 uv = getAxisPlane(p, int(THIS_Axis));
	#endif
	uv = (uv - THIS_Translate) / THIS_Scale;
	switch (int(THIS_Extendmode)) {
		case THISTYPE_Extendmode_hold:
			uv = clamp(uv, -0.5, 0.5);
			break;
		case THISTYPE_Extendmode_repeat:
			uv = fract(uv+0.5)-0.5;
			break;
		case THISTYPE_Extendmode_mirror:
			uv = modZigZag(uv+0.5)-0.5;

This file has been truncated. show original

textureField-crash.toe (160.0 KB)

malcolm · March 15, 2023, 4:03am

FYI I’ve spent the last 2 days working on this, more specifically working on getting much better shader hang debugging working with Nvidia Aftermath. At the very least it should show us what GLSL line the hang is occurring on, which can help us to know if it’s a Nvidia or shader bug. More info coming, thanks for the example!

tekt · March 15, 2023, 5:48pm

Thanks for putting so much time into this!
It will be really useful to be able to get more insight into what’s going on in there.

malcolm · March 18, 2023, 2:07am

The good news is that through these efforts, interactive shader debugging (line by line stepping) is now working in RenderDoc with TD in the next experimental… still no solution as to the hang yet though. I need to test this on AMD and Intel next to see if it’s more likely a driver bug.

tekt · March 18, 2023, 3:28am

That’s going to be incredibly useful!

I might actually be to fix refractive materials!

tekt · April 28, 2023, 10:14pm

How would I go about using RenderDoc to debug a crash?
Is that something that you’d need to do using internal tools, or is it exposed for external users?
I’m able to capture render frames by launching TD through RenderDoc, but only for the main UI window. And I haven’t figured out a way to capture a frame that caused a crash.

malcolm · April 28, 2023, 10:20pm

Ya I haven’t found a way to capture a crash frame either. Basically it’ll never finish the capture due to the crash even if you manage to start the capture on the correct frame.

malcolm · April 30, 2023, 4:17pm

I brought in an AMD from the office to test this on there, if it doesn’t crash there then I’ll submit this to Nvidia as a driver bug

malcolm · May 5, 2023, 9:12pm

Doesn’t happen on AMD. I’ve submitted this to Nvidia.

tekt · May 5, 2023, 9:30pm

Awesome. Thanks!

malcolm · May 12, 2023, 6:35pm

The update from Nvidia is that the issue is that after a few iterations of your shader, some of the threads will diverge and take different paths, and some will call texture() while others will not.
“Since RTK_project1_textureField1() is called from divergent control flow (since some threads in castRay()'s loop may break early), its texture() call uses undefined derivatives and produces an undefined value.”

When derivatives are needed, such as for mipmapping, all threads in the warp must call texture(), so the rate of change of the uvs used for the texture can be calculated.
So currently you can avoid this crash by forcing non-derivative version of texture(), such as textureLod(sample, uv, 0), which will force the mip lod to be 0, with the 3rd argument.

Ofcourse, you aren’t even using mipmapping in this shader, so derivatives shouldn’t be needed. So this does seem like a driver bug, but for now this is how you avoid it.

tekt · May 12, 2023, 8:39pm

Should I just replace most calls to texture() with textureLod()? I think there are few (or no) places where I’m (intentionally) using mipmapping.