Compute Shader atomic operations

Background info:
https://www.khronos.org/opengl/wiki/Atomic_Counter#Operations

I think that atomicCounterIncrement and atomicCounterAdd work correctly, but I’m not sure atomicCounter works correctly.

atomicCounter(pCount) is my attempt to get the current state of the atomic uint pCount, but it seems to always be zero even though atomicCounterIncrement increments it later in the code.

I’m also not sure atomicCounterDecrement and atomicCounterSubtract work correctly.compute_shader_atomic_questions.tox (1.5 KB)

Any advice is appreciated.

Hey david,

I’m definitely no expert in those atomic operations, so dont take my advice too serious :slight_smile:
In your first example you are setting the atomic counter initially to 1000, then adding one using the increment and use that as texture coordinate. This is a pixel way out of the output buffer so that’s why it appears its 0 (if you just plot the color value to 0,0 you’ll see it’s 1000)

Other thing that is happening I think is that those atomic operations are not completely ‘synced’. As in if you for example run 2 times the atomicCounterIncrement() on the same counter, you get unexpected random behaviour. (add for example an absTime.seconds as uniform so it cooks every frame, the pixels will jump all over the place)
Maybe (not sure) there is some kind of barrier function to wait for all dispatches to sync.

I used it once to make a ‘sumTOP’, but in the end I still had sometimes random output values. So I guess I dont fully understand it as well… so following this post :slight_smile:

cheers,
tim

Here is this sumTOP using atomicCounterAdd(), it seems to work, so might be old nightmare I had :smiley:

SumCompute.tox (1.1 KB)

Thanks for checking it now and for catching that mistake. I meant for it to be initialized at 0 for all the examples. Even when it’s initialized at zero in the new tox I uploaded, the result for atomicCounter is different than for atomicCounterIncrement.

I think you’d be interested in the goal I’m trying to achieve. I have a buffer of particles (x, y, brightness).
For every particle in the buffer there are 3 situations:
The particle may be too “bright” -> “kill” the particle by NOT copying it to the new buffer (and not incrementing pCount)
The particle is nicely “bright” -> copy the particle over to the new buffer and increment pCount
The particle is too dim -> place two new particles in the new buffer with positions similar to this particle and increment pCount twice total.

So I’m trying to use pCount and I don’t necessarily want to increment it on every thread.

This other tox is like a simpler case: a random shuffle from one buffer into the other. But again, only the atomicCounterIncrement way workscompute_shader_random_shuffle.tox (1.2 KB)

Update:
I was able to write the code a little differently and avoid solving this problem. I only call increment if I really need to. I don’t just automatically do it in the middle of the main function. If I want to “kill” the particle, increment hasn’t been called yet.

Hey david,

Was just playing with it some more and I realized I truely dont understand what is happening :). See attached tox, if you update the compute shader every frame, the value of the atomic counter wont be the same. It feels we are missing something.
I saw @vinz99 using the ImageAtomicAdd() function, haven’t played around with it yet, but perhaps that gives some more understandable results :smiley:
(https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/imageAtomicAdd.xhtml)

When solving your problem, did you still use an atomic counter? I’m not sure I’m following what you are trying to do with this pCount. In your description it seems like pCount will either incremented once or twice every frame, which will result to infinity over time.

compute_shader_random_shuffle_over_time.tox (1.7 KB)

Yes I’m using atomic counters in this section of my project. pCount gets reset to zero every frame. At the end of the frame, it’s ideally about the same as the number of pixels in the image. So some particles get deleted (not copied over), some get copied over as is, some get copied twice. Hopefully you end up with a number of particles equal to what the size of the buffer can handle. If pCount goes above that, which is possible, don’t write at all, just return in main.

What I implemented was “Weighted Linde-Buzo-gray Stippling” https://kops.uni-konstanz.de/bitstream/handle/123456789/41075/Deussen_2-gu29mv4u87jh2.pdf;jsessionid=CBA553E15000B0419FF1BD6A5FB9B1A9?sequence=1

One step is getting a voronoi diagram rendered at a big resolution like 1280x720 or 1920x1080. You can do this with either the cone method or jump-flood-algorithm. Every pixel needs to be an integer indicating which voronoi seed is closest. No anti-aliasing is allowed.

Next step is a shader with ImageAtomicAdd. The number of invocations needs to be the size of the input image, so 1280x720=921600. The pixel dimensions needs to be the number of voronoi seeds*3. This is because ImageAtomicAdd only works on integers, not vec4, and we need 3 numbers calculated for each seed. Those numbers are the centroid x, the centroid y, and the overall density. So for every pixel in the input image, visit it in the compute shader, determine what index voronoi cell it is and do some stuff to determine what to store in the output pixels.

In your example, if you put an analyze TOP set to average and then multiply the red channel by 16 (number of pixels) you get about the same sum as what the compute shader reports.

aah that looks interesting indeed! Have to read the paper to fully understand it I think, but very curious to see the end result! So did you manage to use this atomic counter in a stable way?
In the end I’m using the analyzeTOP * amount of pixels indeed, but what I tried to achieve was a more optimized way of counting the pixels (used in TDNeuron to sum up big matrices). So if you do found a way to use it in a stable way, it would help me alot :slight_smile:

Well I still don’t understand the function atomicCounter(pCount) because it doesn’t work how I expected. atomicCounterIncrement(pCount) works the way I want it to, fortunately.

Hello,

Sorry, late to the atomic counters party!
Besides the ImageAtomicAdd() function I had also been using the atomicCounterIncrement() function with success.
I think the atomicCounter() function had also left me baffled but I took another look, it seems despite what https://www.khronos.org/opengl/wiki/Atomic_Counter says, just reading the value is not an atomic operation and depends on what has been completed in the different threads.

(Interestingly this https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/atomicCounter.xhtml doesn’t say it reads the value atomically, but this does https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/atomicCounterIncrement.xhtml)

It seems calling it after the atomicCounterIncrement() and after a memoryBarrier() (after " You will need to ensure internal visibility if you want to use ordering guarantees within a rendering command." ) gives the final result of the counter for all the threads.

Strangely enough it still seems to give unstable results with atomicCounterAdd() after a texelFetch(), as in Tim’s sum example. Might have to do with the fact that you can’t sync different groups no matter what (barrier() only works within the workgroup)

And about David’s question doing a decrement after the imageStore(), it seems all the increments are completed by the various threads before the write, so anything after doesn’t matter.

And doing a decrement just after an increment seems to give random results because of a race condition.
Adding memoryBarrier() gives a more predictable outcome but since all the increments would then be complete before the first decrement, the counter still goes from 0 to max overall.

See attached tox for various tests compute_shader_atomic_questions_workgroup_sync_issues.1.toe (5.6 KB)

So in short it seems the only reliable thing is one increment per thread and the rest is kinda confusing ;D

Some interesting insights:



In all case it seems atomic counters are going away with vulkan, only imageAtomic operations remain.

2 Likes

Also @timgerritsen, probably what you want is the parallel prefix sum algorithm https://en.wikipedia.org/wiki/Prefix_sum, there’s a sample with the opengl superbible samples https://github.com/openglsuperbible/sb7code

Might be easier to wrap the thrust functions in a cuda TOP :wink:
https://thrust.github.io/doc/group__prefixsums.html

2 Likes

vincent, wow, thanks for this valuable information! :slight_smile:
gonna dive into it later today!

cheers,
tim