Logo by DsyneGrafix - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Support us via Flattr FLATTR Link
 
*
Welcome, Guest. Please login or register. December 02, 2025, 03:07:13 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: 1 [2]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: optimisation  (Read 2053 times)
0 Members and 1 Guest are viewing this topic.
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #15 on: January 27, 2013, 01:33:59 AM »

I use 1d memory array of unsigned chars for saving colors ( shades of grey = from 0 to 255 ) of pixels.
Do you mean that I should think about other structures, like for example 2d memory array ?
Nope, you simply process many of them (eg. 4+) in parallel; computing them together, and then writing the results out in one batch. Besides vectorisation potential this also reduces loop overhead (a technique called unrolling).
Logged

Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #16 on: January 27, 2013, 11:46:20 PM »

One last question and I'll shut up.

This creates a data dependency...
float tmp=x;
x=y;
//wait for memory to settle
y=tmp;

But does this???
v.xy=v.yx;

It must, right? Or since tmp wasn't used in arith. the first example does not need to wait?

I did a few tests with Nvidias NVEmulate that allows dumping compiled GLSL.

In your case using a temp or using swizzling compiled to the exact same code, turning both into a swizzle:
MOV.F result_color0.xy, fragment.attrib[0].yxzw;

I think swizzles are basically free on GPU's because the GPU instructions seems to specify a swizzle mask - so I'm not sure the GPU would actually need to wait for the register latency in this case.

I also did a few other test to see how the compiler behaves.

Here is my proposed optimization:

Code:
void main() {
float x = coord.x;
float y = coord.y;
float x4 = x*x; // we only use x2 or y2 for tenx2y2
float y4 = y*y;
float tenx2y2 = 10.*x4*y4;
x4 = x4*x4;
y4 = y4*y4;
x = (5.*y4+1.-tenx2y2+x4)*x; // we don't need tempx anymore: y doesn't use x
y = (y4-tenx2y2+5.*x4+1.)*y;
gl_FragColor = vec4(x,y,1.0,1.0);
}

Which is compiled into

Code:
MUL.F R0.w, fragment.attrib[0].x, fragment.attrib[0].x;
MUL.F R0.x, fragment.attrib[0].y, fragment.attrib[0].y;
MUL.F R0.y, R0.x, R0.w;
MUL.F R0.y, R0, {10, 0, 0, 0}.x;
MUL.F R0.x, R0, R0;
ADD.F R0.z, R0.x, -R0.y;
MAD.F R0.x, R0, {5, 0, 0, 0}, -R0.y;
MUL.F R0.w, R0, R0;
ADD.F R0.x, R0, R0.w;
MAD.F R0.y, R0.w, {5, 0, 0, 0}.x, R0.z;
MAD.F result_color0.x, R0, fragment.attrib[0], fragment.attrib[0];
MAD.F result_color0.y, fragment.attrib[0], R0, fragment.attrib[0];
MOV.F result_color0.zw, {1, 0, 0, 0}.x;
END
# 13 instructions, 1 R-regs

Notice that the operations are scalar-operations (except the last one).

Here is Knighty's proposal:

Code:
void main() {
vec2 v=coord.xy;
vec2 v2=v*v;
v=v*(5.*v2.yx*v2.yx + v2*(v2 - 10.*v2.yx) + 1.);
gl_FragColor = vec4(v,1.0,1.0);
}

Which turns into

Code:
MUL.F R0.xy, fragment.attrib[0], fragment.attrib[0];
MAD.F R0.zw, -R0.xyyx, {10, 0, 0, 0}.x, R0.xyxy;
MUL.F R0.zw, R0.xyxy, R0;
MUL.F R0.xy, R0.yxzw, R0.yxzw;
MAD.F R0.xy, R0, {5, 0, 0, 0}.x, R0.zwzw;
MAD.F result_color0.xy, R0, fragment.attrib[0], fragment.attrib[0];
MOV.F result_color0.zw, {1, 0, 0, 0}.x;
END
# 7 instructions, 1 R-regs

Only 7 instructions here, but they are all two components vector operations (corresponding to 14 scalar operations - as above)

And, as you might expect the two fragments execute at exactly the same speed. So it appears there is no reason to explicitly vectorize your instructions.

I also tried AMD GPU ShaderAnalyzer (which lets you see what GLSL gets compiled into for all different AMD/ATI architectures) - and as far as I could tell, the compiler will do some automatic vectorization of scalar instructions - even for older cards. But I'm not sure I'm interpreting the assembly code the right way.
Logged
eiffie
Guest
« Reply #17 on: January 29, 2013, 04:57:24 PM »

Thanks to everyone for answering those lingering questions. In short my GPU is smarter than me so no need to worry about it.
Logged
Pages: 1 [2]   Go Down
  Print  
 
Jump to:  


Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.4 seconds with 25 queries. (Pretty URLs adds 0.012s, 2q)