One last question and I'll shut up.
This creates a data dependency...
float tmp=x;
x=y;
//wait for memory to settle
y=tmp;
But does this???
v.xy=v.yx;
It must, right? Or since tmp wasn't used in arith. the first example does not need to wait?
I did a few tests with Nvidias NVEmulate that allows dumping compiled GLSL.
In your case using a temp or using swizzling compiled to the exact same code, turning both into a swizzle:
MOV.F result_color0.xy, fragment.attrib[0].yxzw;
I think swizzles are basically free on GPU's because the GPU instructions seems to specify a swizzle mask - so I'm not sure the GPU would actually need to wait for the register latency in this case.
I also did a few other test to see how the compiler behaves.
Here is my proposed optimization:
void main() {
float x = coord.x;
float y = coord.y;
float x4 = x*x; // we only use x2 or y2 for tenx2y2
float y4 = y*y;
float tenx2y2 = 10.*x4*y4;
x4 = x4*x4;
y4 = y4*y4;
x = (5.*y4+1.-tenx2y2+x4)*x; // we don't need tempx anymore: y doesn't use x
y = (y4-tenx2y2+5.*x4+1.)*y;
gl_FragColor = vec4(x,y,1.0,1.0);
}
Which is compiled into
MUL.F R0.w, fragment.attrib[0].x, fragment.attrib[0].x;
MUL.F R0.x, fragment.attrib[0].y, fragment.attrib[0].y;
MUL.F R0.y, R0.x, R0.w;
MUL.F R0.y, R0, {10, 0, 0, 0}.x;
MUL.F R0.x, R0, R0;
ADD.F R0.z, R0.x, -R0.y;
MAD.F R0.x, R0, {5, 0, 0, 0}, -R0.y;
MUL.F R0.w, R0, R0;
ADD.F R0.x, R0, R0.w;
MAD.F R0.y, R0.w, {5, 0, 0, 0}.x, R0.z;
MAD.F result_color0.x, R0, fragment.attrib[0], fragment.attrib[0];
MAD.F result_color0.y, fragment.attrib[0], R0, fragment.attrib[0];
MOV.F result_color0.zw, {1, 0, 0, 0}.x;
END
# 13 instructions, 1 R-regs
Notice that the operations are scalar-operations (except the last one).
Here is Knighty's proposal:
void main() {
vec2 v=coord.xy;
vec2 v2=v*v;
v=v*(5.*v2.yx*v2.yx + v2*(v2 - 10.*v2.yx) + 1.);
gl_FragColor = vec4(v,1.0,1.0);
}
Which turns into
MUL.F R0.xy, fragment.attrib[0], fragment.attrib[0];
MAD.F R0.zw, -R0.xyyx, {10, 0, 0, 0}.x, R0.xyxy;
MUL.F R0.zw, R0.xyxy, R0;
MUL.F R0.xy, R0.yxzw, R0.yxzw;
MAD.F R0.xy, R0, {5, 0, 0, 0}.x, R0.zwzw;
MAD.F result_color0.xy, R0, fragment.attrib[0], fragment.attrib[0];
MOV.F result_color0.zw, {1, 0, 0, 0}.x;
END
# 7 instructions, 1 R-regs
Only 7 instructions here, but they are all two components vector operations (corresponding to 14 scalar operations - as above)
And, as you might expect the two fragments execute at exactly the same speed. So it appears there is no reason to explicitly vectorize your instructions.
I also tried AMD GPU ShaderAnalyzer (which lets you see what GLSL gets compiled into for all different AMD/ATI architectures) - and as far as I could tell, the compiler will do some automatic vectorization of scalar instructions - even for older cards. But I'm not sure I'm interpreting the assembly code the right way.