GLSL: projecting pixels from screen space into world space and back

Hello!

My goal:

  1. For a given FOV and aspect ratio, project screen space pixels into 3D world space based on pixel depth value. Depth is in absolute floating point format.
    (as seen done in depthProjection comp)
  2. Transform these points in 3D space based on transformations of a given camera view. A default camera state should correspond to no transformation (all zeroes, scale at 1)
  3. Project these 3D points back into the screen space (of camera from step 1).
  4. Output the offset in screen space between pixel UV position before and after 3D imaginary camera transformation.

Essentially the effect is analogous to a scenario where we would create geometry based on a depth map, then move with a camera around it, except here I want to have UV offset data to use with pixel displacements.

Despite my best efforts, I feel completely defeated on this relatively simple task after several days of trial and only error. I started knowing nothing about GLSL, but able to visualize all the transformations in my head.
I ended up knowing a lot more about matrix transformations, shader language, NDC, and conversions between world, view, and projection spaces, but alas, after countless youtube tutorials and an infinitely long chatGPT thread, I’m only making the AI and myself more confused at this point. I need real humans.

I’ve tried to put the aforementioned steps into a GLSL pixel shader.
The “imaginary” camera is a camera comp that I feed in as uniform mat4 through .transform() method. FOV and aspect come in as vec2. Depth is one of top inputs.

The code in the depthProjection component seemed like a good start, but most of the information online and especially what chatGPT follows are the conventions from graphics pipelines with various openGL libraries, which I struggled to fit together with that code. For example it often suggests the perspective() function to built the projection matrix, which is not available by default as far as I am aware?

If anyone is willing to look into this:
glsl projection.toe (719.2 KB)

I am attaching a template .toe file with locked tops that hold an example depth and image frame, but I essentially only left pseudo code (my logical thinking) inside GLSL main function, because all of my countless variants failed in one way or another, working almost right, but not quite right. It became a spaghetti of rethinking and redoing everything, and nobody deserves to be tortured by that.
Also attached is a displace node, that alludes to further steps in my pipeline and should visualize correct UV offsets by displacing the scene with apparent depth.

Thank you! I’d be excited to eventually share my project after I get over this bottleneck.

1 Like

Hi @aulerius,

you can do this all in TOPs as well and the nice part is, it explains the Math quite nicely:

Let’s say we need a position which is build from x, y, and z.
We already know z as this is the value stored in your depth map.
x can be retrieved from the depth map’s pixel uv value, normalized over the aspect correct width of the renderer. The formula is something like this: posXY = (uv * 2.0 - 1.0 - c) * z / f with uv being the texture coordinate, c the lens’ principal point and f the focal length.

The texture coordinate in TOPs can be created using Ramp TOPs - one horizontal for the u and one vertical for the v. These need to have normalized value ranges (Math TOP). Finally these values are multiplied with the depth texture multiplied by the focal length. The focal length for the horizontal and the vertical having the same ratio as the Render TOP’s aspect. You can use a Reorder TOP to bring them all together into one texture and use them as your source for instancing.

Hope the attached file makes sense. You can certainly also write this as a glsl shader following the steps of these operators.

cheers
Markus

glsl projection.2.toe (718.1 KB)

Hey @aulerius,

totally forgot that there is a component for this in the Palette too. You can find the depthProjection component under PointClouds.

cheers
Markus

Yes! I’ve already seen it and mentioned in my original post. It only takes me halfway though.

I’ll investigate your previous reply more deeply, but on my first impression I think you’re also talking only about the first half of my goal.
Sorry if my post is misleading or the wording is insufficient, but I do not need to actually render the points in 3D space. It’s an intermediate step to project them into 3D space, but ultimately I want to project them back into the 2D space, to figure out their offsets in 2D screen space after camera movement and output that. THIS is what the final result is, meant to be used as a displacement map for displace top.

basically i want to reverse the projection process after they’re in 3D, but so far I could not do that successfully, as all my attempts seemed to have one thing or another wrong.

1 Like

You’re trying to build a novel view from a known RGBD dataset, there are several ways to achieve this but perhaps the simplest one to implement is to build your UV lookup as a forward render pass, you can take @snaut’s example and modify it so that the resulting point cloud contains XYZ+UV instead of XYZ+RGB, do a render pass from the novel view (ensuring that the output buffer is at least 32-bit RG), and then from that resulting buffer calculate the UV offsets against the original view.
A single-pass backward reprojection (starting from the novel view viewport’s space) would require more involved techniques, like raycasting or perhaps something similar to shadow mapping.

I see. I was considering something like this. Won’t there be some artifacts due to how point clouds are rendered? (They scale to distance, may overlap, and generally do not map 1:1 to pixels on a rendered view)
With this in mind, I’ve discarded this option, because I want each input UV pixel to have corresponding UV offset on the output.

I’ll look into shadow mapping and such, but I’m still convinced that it could be achieved with pure glsl and some appropriate matrix transformations, because 2D-3D conversions are just that in the end. It felt like I was 90% there already with my attempts.

Yes.

This not feasible in the general sense, novel views can (and will) show parts of the scene that are not visible in the source material. NeRFs (or Gaussian Splatting) will get you far closer to a realistic result than simple linear interpolation (displacement maps)

Shadow mapping will exemplify how you can use an depth map to determine visibility / occlusion in a novel view in a single pass, and with ‘single pass’ I mean that a single shader invocation will suffice (you do all the math in there). It is not ‘just’ 2D-3D conversions because you’re dealing with a discrete (or sampled) signal in your source view, the information contained there corresponds to a relatively small (or null) subset of the information that can exist in a different view of the same scene. You are looking to approximate, or reconstruct, information that could be there based on your finite dataset, so it is up to you to pick the strategy to fill in the blanks.
Studying projective geometry (homographies, in particular) will net you the background to pick the better tools for your use case.

The following tutorial could be useful as a starting point:
https://docs.opencv.org/4.x/d9/dab/tutorial_homography.html

1 Like

I’m completely unsure if I understand the request fully but would something like this not work to get the uv offsets between the position of the pixel in the depthmap and the position in the rendered camera view?

// camera transform from the camera via op('cam1').localTransform
uniform mat4 mCamTrans;
// camera projection from the camera via op('cam1').projection(1280,720)
uniform mat4 mCamProj;

out vec4 fragColor;
void main()
{
	// get xyz from previous calculation (TOPs)
	vec4 xyz = texture(sTD2DInputs[0], vUV.st);

	// transform points to clipspace
	mat4 fullMat = mCamProj * mCamTrans;
	vec4 clipPos = fullMat * xyz;

	// calc normalized device coord
	vec3 ncPos = clipPos.xyz / clipPos.w;
	vec2 screenPos = ncPos.xy * -0.5 + 0.5;

	// calc difference org uv to screenPos
	vec2 diff = screenPos - vUV.st;

	fragColor = TDOutputSwizzle(vec4(diff,1.0,1.0));
}

I feel like I’m missing something crucial :rofl:

1 Like

To add to the discussion, here’s a couple of whitepapers describing methods for novel view synthesis from single and multiple RGBD image datasets:

1 Like

Jeez, there are also methods to get camera projection matrices!! I completely missed that.

Anyway, yes! It’s that simple, this is the second half of the behavior I seek.
I was very close to it by myself, but somewhere along the way got tangled up in my own mess and had to quit or at least take a step back. Thank you so much for sticking with me on this one!

What I still couldn’t figure out was the odd flip of X Y rotation and Z translation, when compared to ground truth (actual 3D points). I tried a workaround where I created a duplicate camera that references the transform params of the main one, except flips those axes of transformation, and it ends up working just fine.
Could this be something with the right/left handed coordinate things? Maybe the transform matrix from localTransform is flipped on those axes on purpose?

Here’s a comparison video of displacing with these UV coords against the point cloud. (remap is same as displace, just one step removed)

I know it has artifacts, they’re quite visible in the video. I came into this knowing the shortcomings of such technique.
And thank you @r-ssek for sharing additional research goodies and tips. I am well aware that a single depth projection is a very limited amount of information, but I wasn’t looking to be clever about it at this specific step in my larger pipeline/workflow. The camera shifts will always be small, and I am not oblivious to potential occlusion problems, ghosting, stretching, etc with pixel displacements like this.

I was already using pixel displacement frame-by-frame with data from Blender’s vector pass, which lists pixel displacements in screen space for each frame based on how geometry is moving. You can see how my goal here was essentially to replicate that kind of data inside TD using synthetic camera and depth data.
For more advanced cases in my workflow, I might be intrigued to follow alternative paths though. Your answers stirred up some curiosity in me. I see now that displacing a flat plane with a displacement vertex shader could also work, and the same math could carry over.

2 Likes

Hey @aulerius,

looking forward to seeing where you end up with all this especially also with the additional input from @r-ssek - facinating for me to follow along.

I guess if you invert the depth values, the axis should be aligned as in TouchDesigner z points towards you.

cheers
Markus

1 Like