Hi,
I’m building a C++ TOP (TOP_ExecuteMode::CUDA) that integrates NVIDIA
DLSS Ray Reconstruction via the NGX CUDA API. The plugin works
correctly (NGX evaluate returns Success, output is visually
DLSS-RR-denoised). But I’m seeing a very strange behavior with the
reported cook times that I think might generalize beyond my case.
Observed behavior
The CPU and GPU cook times reported for my C++ TOP vary wildly based
on what’s happening elsewhere in the network, even when nothing
about my plugin’s actual work changes from frame to frame.
Concrete repro:
- Steady state: CPU cook ~0.6 ms, GPU cook ~1.7 ms
- Open a few node viewers in unrelated branches: GPU cook spikes to
3–4 ms - A CHOP (CPU-only work) elsewhere in the network cooks once: GPU
cook of my C++ TOP spikes to 8–10 ms - Toggle through several operator selections in the editor while
viewers are visible: GPU cook spikes 6–10 ms in regular bursts - Heavy GPU work elsewhere (e.g. a complex visualizer like audio analysis
cooking): CPU cook of my C++ TOP grows to 25+ ms while the kernel
itself didn’t change
The reported “GPU cook time” is consistently inversely correlated
with the CPU cook time. When one is small, the other is huge, and
the sum looks roughly stable.
What I’ve verified
- Build is in Release (not Debug). Confirmed via the binary path.
- The behavior occurs even with DLSS-RR disabled in my plugin
(when I set a toggle that falls back to a pure CUDA passthrough
kernel using the same setupCudaSurface/cudaSurfaceObject pattern
as the official CudaTOP sample). - Same overall pattern (large reported cook time variance unrelated
to actual GPU work) was already documented to me by Malcolm in an
earlier conversation: the official CudaTOP sample also shows
reported cook time being inflated when other GPU contention
occurs, due to how Vulkan↔CUDA interop semaphores are timed
relative to the cook measurement. Nsight Systems back then
showed the actual CUDA kernel running in microseconds while TD
reported milliseconds. - The NVIDIA DLSS Programming Guide explicitly states:
“Any interoperability between APIs, for example, D3D11 to CUDA,
must be handled by the game or application outside of the NGX
SDK.” So NGX itself is unaware of TouchDesigner’s Vulkan↔CUDA
semaphores in beginCUDAOperations/endCUDAOperations.
My questions
-
Is this expected behavior? Specifically, do the reported CPU/GPU
cook times for a CUDA-mode C++ TOP include time waiting on the
external semaphores that TD itself manages around
begin/endCUDAOperations? -
If so, is there a recommended way to measure the actual cost of
a CUDA-mode C++ TOP plugin in isolation, without going through
Nsight Systems every time? -
Are there patterns to minimize stalling? My plugin does a fair
amount of work between beginCUDAOperations and endCUDAOperations
(calls into NGX CUDA which itself enqueues a large pipeline on
the same stream). Would moving NGX init or recurring allocations
outside of begin/end help, or is that not valid for a CUDA
context that needs to be active? -
Is there any documentation on what beginCUDAOperations actually
does internally (which semaphores it waits on, what context it
activates) so I can reason about which patterns are safe?
Thanks for any insight !!!
TD version: 2025.32820, Windows, RTX 5090, CUDA 12.8.