CUDA TOP cook time varies wildly based on unrelated network activity

Hi,

I’m building a C++ TOP (TOP_ExecuteMode::CUDA) that integrates NVIDIA
DLSS Ray Reconstruction via the NGX CUDA API. The plugin works
correctly (NGX evaluate returns Success, output is visually
DLSS-RR-denoised). But I’m seeing a very strange behavior with the
reported cook times that I think might generalize beyond my case.

Observed behavior

The CPU and GPU cook times reported for my C++ TOP vary wildly based
on what’s happening elsewhere in the network, even when nothing
about my plugin’s actual work changes from frame to frame.

Concrete repro:

  • Steady state: CPU cook ~0.6 ms, GPU cook ~1.7 ms
  • Open a few node viewers in unrelated branches: GPU cook spikes to
    3–4 ms
  • A CHOP (CPU-only work) elsewhere in the network cooks once: GPU
    cook of my C++ TOP spikes to 8–10 ms
  • Toggle through several operator selections in the editor while
    viewers are visible: GPU cook spikes 6–10 ms in regular bursts
  • Heavy GPU work elsewhere (e.g. a complex visualizer like audio analysis
    cooking): CPU cook of my C++ TOP grows to 25+ ms while the kernel
    itself didn’t change

The reported “GPU cook time” is consistently inversely correlated
with the CPU cook time. When one is small, the other is huge, and
the sum looks roughly stable.

What I’ve verified

  • Build is in Release (not Debug). Confirmed via the binary path.
  • The behavior occurs even with DLSS-RR disabled in my plugin
    (when I set a toggle that falls back to a pure CUDA passthrough
    kernel using the same setupCudaSurface/cudaSurfaceObject pattern
    as the official CudaTOP sample).
  • Same overall pattern (large reported cook time variance unrelated
    to actual GPU work) was already documented to me by Malcolm in an
    earlier conversation: the official CudaTOP sample also shows
    reported cook time being inflated when other GPU contention
    occurs, due to how Vulkan↔CUDA interop semaphores are timed
    relative to the cook measurement. Nsight Systems back then
    showed the actual CUDA kernel running in microseconds while TD
    reported milliseconds.
  • The NVIDIA DLSS Programming Guide explicitly states:
    “Any interoperability between APIs, for example, D3D11 to CUDA,
    must be handled by the game or application outside of the NGX
    SDK.” So NGX itself is unaware of TouchDesigner’s Vulkan↔CUDA
    semaphores in beginCUDAOperations/endCUDAOperations.

My questions

  1. Is this expected behavior? Specifically, do the reported CPU/GPU
    cook times for a CUDA-mode C++ TOP include time waiting on the
    external semaphores that TD itself manages around
    begin/endCUDAOperations?

  2. If so, is there a recommended way to measure the actual cost of
    a CUDA-mode C++ TOP plugin in isolation, without going through
    Nsight Systems every time?

  3. Are there patterns to minimize stalling? My plugin does a fair
    amount of work between beginCUDAOperations and endCUDAOperations
    (calls into NGX CUDA which itself enqueues a large pipeline on
    the same stream). Would moving NGX init or recurring allocations
    outside of begin/end help, or is that not valid for a CUDA
    context that needs to be active?

  4. Is there any documentation on what beginCUDAOperations actually
    does internally (which semaphores it waits on, what context it
    activates) so I can reason about which patterns are safe?

Thanks for any insight !!!

TD version: 2025.32820, Windows, RTX 5090, CUDA 12.8.

Quick update, I tested one more thing that I think is important:

I added a toggle to my plugin that disables the NGX evaluate call
and falls back to a pure CUDA passthrough kernel (same pattern as
the official CudaTOP sample). The C++ TOP still does its
beginCUDAOperations / setupCudaSurface / kernel launch /
endCUDAOperations cycle, but with no NGX work in between.

With NGX disabled, the variance pattern I described above is
still present, just less amplified. Cook times still spike when
other CHOPs cook, when viewers are visible, or when I click
through operators in the editor.

This confirms the issue is in the CUDA-mode C++ TOP / Vulkan
interop layer itself, not specifically in NGX. NGX just makes it
more visible because it does more GPU work and amplifies any
upstream stall.

So my questions in the original post still apply, but please
read them in that context, this isn’t an NGX integration issue,
it’s a more general CUDA TOP measurement concern that NGX
happens to expose.

Thanks again!

Up ! :slight_smile:

SARV

You likely need to use your own CUDA timing commands within your stream to get fine grained measurements. Yes, there are a lot of things at play here, including vsync. If your project is running at 60hz, but taking less than 16ms to do its work, then extra time shows up ‘somewhere’ in the network, often at GPU sync points. This is because the GPU can’t run faster than 16ms, since it needs to stay slower than the vsync.

beginCUDAOperations() will submit our current vulkan command buffer, and cudaWaitExternalSemaphoresAsync() calls for your cudaStream so it waits for any pending Vulkan work to complete.

On the endCUDAOperations(), we issue cudaSignalExternalSemaphoresAsync, so our Vulkan commands will wait on your CUDA work.

These don’t inherently cause a stall on the CPU though, as they are GPU→GPU semaphores the CPU doesn’t need to wait for.