GLSL efficiency question : Samplers vs Vectors vs Arrays vs Matrices?

Ok so premise to this question - I have been combing through my glsl code looking for ways to optimize it. I am exploring the potential of a bandwidth bottleneck vs a computational bottleneck, and am trying some things to see if anything helps. I feel like some of these are obvious questions to the seasoned graphics programmer, but I don’t know for sure so wanted to get some other input.


For the sake of simplicity, I’ll ignore shader MAT’s and just use a 10x10 glsl TOP as an example.
Also will assume that sampling a texture is done with texelFetch(). so no multi sample stuff is happening via texture().


First with Samplers - My understanding of them is that each fragment must sample an input texture even if the information is the same, or if the input texture is small with say only 1 pixel.
So, for a texture that is 10x10 pixels in size, a glsl shader performs 100 sampling calls.

Moving on to the Vectors page, we can specify a name, and up to a vec4 as an input.
My understanding here, is that a uniform value is much faster / more efficient because only 1 value is uploaded to the shader, and the shader does not need to sample from another texture 100 times, it already has this value in memory.

On the third tab, there are Arrays. Again, like Vectors my understanding here is that these are very efficient too because the data is uploaded once, and referenced many times once in memory. No sampling a bunch of times.

  • Does this change when we switch the array type to Texture Buffer instead of Uniform Array?

Lastly, the Matricies tab - which is where most of my confusion lies, and where I think I am looking for clarity the most.

  • Are these setup the same way as Vectors and Arrays, faster because everything is uploaded at once?
  • Is there a difference between the plain Matrix input, vs the Xform from and To inputs?
  • Is there a practical limit or hard limit to how many times the “+” button can be pressed adding more matricies?
  • Are matrix inputs faster than supplying a sequence of 16 values via the Arrays tab?

It’s a hard question to answer because drivers and the GPUs can all solve these issues differently.

Samplers do get cached, and samples to textures are going to grab a bunch of pixels at once, with the assumption nearby fragment shader executions are going to want those. So they are much faster than a ‘slow sample’ every time. If you are sampling in a way the GPU is assuming you will, then you’ll get a lot of cache hits and be much faster. Sampling a texture at random locations can be slower though. Samplers are always going to be in the GPU memory.

Vectors (Uniforms) are very fast, but they have very limited amount of memory available to them. However there is nothing to say that the memory holding these uniforms is any closer to the shader cores than the memory holding samplers. They may have the same cache behavior.

Arrays are using texture buffer objects, which are slower likely than raw uniforms. How fast they are depends on a lot. They aren’t necessarily uploaded to the GPU though, as they may be coming from the CPU memory directly. This depends on how the drivers and the GPU do things.

Matrix are using the same method as Vectors, raw uniforms, so they are as fast as those.

Unfortunately the real answer is the same answer I get when I ask similar questions to Nvidia or AMD about ‘what way is faster’. You need to try and see for your particular GPU and drivers and shader. The GPU timing in TOPs helps with this a lot.

3 Likes

wow this is a goldmine of extremely useful knowledge, thank you!
The original code that prompted this post was this one:

mat4 getLightProjectionMatrix(int lightIndex){
	mat4 m;
	
	m[0][0] = texelFetch( projMatricies , ivec2( 0 , lightIndex ) , 0 ).r;
	m[0][1] = texelFetch( projMatricies , ivec2( 1 , lightIndex ) , 0 ).r;
	m[0][2] = texelFetch( projMatricies , ivec2( 2 , lightIndex ) , 0 ).r;
	m[0][3] = texelFetch( projMatricies , ivec2( 3 , lightIndex ) , 0 ).r;
	
	m[1][0] = texelFetch( projMatricies , ivec2( 4 , lightIndex ) , 0 ).r;
	m[1][1] = texelFetch( projMatricies , ivec2( 5 , lightIndex ) , 0 ).r;
	m[1][2] = texelFetch( projMatricies , ivec2( 6 , lightIndex ) , 0 ).r;
	m[1][3] = texelFetch( projMatricies , ivec2( 7 , lightIndex ) , 0 ).r;
	
	m[2][0] = texelFetch( projMatricies , ivec2( 8 , lightIndex ) , 0 ).r;
	m[2][1] = texelFetch( projMatricies , ivec2( 9 , lightIndex ) , 0 ).r;
	m[2][2] = texelFetch( projMatricies , ivec2( 10 , lightIndex ) , 0 ).r;
	m[2][3] = texelFetch( projMatricies , ivec2( 11 , lightIndex ) , 0 ).r;
	
	m[3][0] = texelFetch( projMatricies , ivec2( 12 , lightIndex ) , 0 ).r;
	m[3][1] = texelFetch( projMatricies , ivec2( 13 , lightIndex ) , 0 ).r;
	m[3][2] = texelFetch( projMatricies , ivec2( 14 , lightIndex ) , 0 ).r;
	m[3][3] = texelFetch( projMatricies , ivec2( 15 , lightIndex ) , 0 ).r;
	
	return m;
}

First question that occurred to me : Is sampling the red channel 16 times slower than sampling r/g/b/a 4 times? Which from suggestions elsewhere seems to be the case.

Then also, I wondered if providing these matricies a different way would be faster or not…

@malcolm it sounds like from what you’re saying that if I could provide them via the actual matrix inputs in the shader, that would be the clear winner - but what if I have dozens to hundreds of matricies to input?

Would it be better to pass these in via 1 long vec4 uniform array? or is pulling from a sampler probably ok if I keep the incoming data organized in a cache friendly way?

One suggestion I got was to do something like this (psuedocode):

mat4 getLightProjectionMatrix(int lightIndex){
	
	mat4 m;
	
	m[0] = texelFetch( projMatricies , ivec2( 0 , lightIndex ) , 0 );
	m[1] = texelFetch( projMatricies , ivec2( 1 , lightIndex ) , 0 );
	m[2] = texelFetch( projMatricies , ivec2( 2 , lightIndex ) , 0 );
	m[3] = texelFetch( projMatricies , ivec2( 3 , lightIndex ) , 0 );
	
	return m;
}

It’s probably not 16x slower, but likely quit a bit slower. You are issuing 16 instructions vs. 4 at the very least. I would definitely try to group them so you can do 4 texelFetch() instructions instead.
If you have lots of these, but small enough that they’ll fit in a uniform array then that’ll likely be the fastest. Otherwise you need to use texture buffers.

Awesome, will definitely apply this to my code soon. Thanks again!

So along a maybe similar line of thinking - would sampling 1x32 bit pixel be generally faster than sampling 4x8 bit pixels? Or is it pretty close and really dependant on how the drivers and caching is happening on a given scenario?

I’m wondering if I should start packing all my info into higher bit depth textures in an effort to reduce the number of texture fetches even more…

In general CPUs and GPUs work with at least 32-bits at a time when doing operations, so yes grouping things into 32-bits is likely better.
Would love to hear about what the timing differences are when you do these changes!

1 Like

Awesome thank you! As soon as I have a more conclusive before and after I would love to put together a post on the findings here with some examples.