optimisation

Quote from: eiffie on January 26, 2013, 06:52:33 PM

One last question and I'll shut up.

This creates a data dependency...
float tmp=x;
x=y;
//wait for memory to settle
y=tmp;

But does this???
v.xy=v.yx;

It must, right? Or since tmp wasn't used in arith. the first example does not need to wait?

I did a few tests with Nvidias NVEmulate that allows dumping compiled GLSL.

In your case using a temp or using swizzling compiled to the exact same code, turning both into a swizzle:
MOV.F result_color0.xy, fragment.attrib[0].yxzw;

I think swizzles are basically free on GPU's because the GPU instructions seems to specify a swizzle mask - so I'm not sure the GPU would actually need to wait for the register latency in this case.

I also did a few other test to see how the compiler behaves.

Here is my proposed optimization:

Code:

void main() {
	float x = coord.x;
	float y = coord.y;
	float x4 = x*x; // we only use x2 or y2 for tenx2y2
	float y4 = y*y;
	float tenx2y2 = 10.*x4*y4;
	x4 = x4*x4;
	y4 = y4*y4;
	x = (5.*y4+1.-tenx2y2+x4)*x; // we don't need tempx anymore: y doesn't use x
	y = (y4-tenx2y2+5.*x4+1.)*y;
	gl_FragColor = vec4(x,y,1.0,1.0);
}

Which is compiled into

Code:

MUL.F R0.w, fragment.attrib[0].x, fragment.attrib[0].x;
MUL.F R0.x, fragment.attrib[0].y, fragment.attrib[0].y;
MUL.F R0.y, R0.x, R0.w;
MUL.F R0.y, R0, {10, 0, 0, 0}.x;
MUL.F R0.x, R0, R0;
ADD.F R0.z, R0.x, -R0.y;
MAD.F R0.x, R0, {5, 0, 0, 0}, -R0.y;
MUL.F R0.w, R0, R0;
ADD.F R0.x, R0, R0.w;
MAD.F R0.y, R0.w, {5, 0, 0, 0}.x, R0.z;
MAD.F result_color0.x, R0, fragment.attrib[0], fragment.attrib[0];
MAD.F result_color0.y, fragment.attrib[0], R0, fragment.attrib[0];
MOV.F result_color0.zw, {1, 0, 0, 0}.x;
END
# 13 instructions, 1 R-regs

Notice that the operations are scalar-operations (except the last one).

Here is Knighty's proposal:

Code:

void main() {
	vec2 v=coord.xy;
	vec2 v2=v*v;
	v=v*(5.*v2.yx*v2.yx + v2*(v2 - 10.*v2.yx) + 1.);
	gl_FragColor = vec4(v,1.0,1.0);
}

Which turns into

Code:

MUL.F R0.xy, fragment.attrib[0], fragment.attrib[0];
MAD.F R0.zw, -R0.xyyx, {10, 0, 0, 0}.x, R0.xyxy;
MUL.F R0.zw, R0.xyxy, R0;
MUL.F R0.xy, R0.yxzw, R0.yxzw;
MAD.F R0.xy, R0, {5, 0, 0, 0}.x, R0.zwzw;
MAD.F result_color0.xy, R0, fragment.attrib[0], fragment.attrib[0];
MOV.F result_color0.zw, {1, 0, 0, 0}.x;
END
# 7 instructions, 1 R-regs

Only 7 instructions here, but they are all two components vector operations (corresponding to 14 scalar operations - as above)

And, as you might expect the two fragments execute at exactly the same speed. So it appears there is no reason to explicitly vectorize your instructions.

I also tried AMD GPU ShaderAnalyzer (which lets you see what GLSL gets compiled into for all different AMD/ATI architectures) - and as far as I could tell, the compiler will do some automatic vectorization of scalar instructions - even for older cards. But I'm not sure I'm interpreting the assembly code the right way.

	Author	Topic: optimisation (Read 2268 times)
		Description:
0 Members and 1 Guest are viewing this topic.

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

	Welcome, Guest. Please login or register.	January 16, 2026, 11:30:03 PM
		Login with username, password and session length

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval, thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again