Title: optimisation Post by: Adam Majewski on January 18, 2013, 11:14:58 PM Hi. I have made image of Julia set for f(z) = z+z^5. Image and code are here : http://commons.wikimedia.org/wiki/File:Julia_set_z%2Bz%5E5.png (http://commons.wikimedia.org/wiki/File:Julia_set_z%2Bz%5E5.png)
I have tried to optimize inner loop from : tempx = 5*x*y*y*y*y-10*x*x*x*y*y + x*x*x*x*x + x ; // temporary variable y = y*y*y*y*y -10*x*x*y*y*y + 5*x*x*x*x*y + y ; x=tempx; to tempx = 5*x*y4 -10*x2my2*x + x4*x + x ; // 5*x*y^4-10*x^3*y^2+x^5+x y = y4*y -10*x2my2*y + 5*x4*y + y ; // y^5-10*x^2*y^3+5*x^4*y+y x=tempx; x2=x*x; y2=y*y; y4=y2*y2; x4=x2*x2; x2y2= x2+y2; x2my2= x2*y2; How I should do it ? TIA Title: Re: optimisation Post by: eiffie on January 19, 2013, 05:43:25 PM I can spot 1 optimisation - don't calculate x2y2 :) it is not used.
Title: Re: optimisation Post by: asimes on January 19, 2013, 06:55:29 PM Maybe this:
Code: x2 = x*x; Title: Re: optimisation Post by: lycium on January 19, 2013, 10:16:17 PM Vectorise it with SSE2 (http://en.wikipedia.org/wiki/SSE2), use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue.
Title: Re: optimisation Post by: Syntopia on January 20, 2013, 12:23:57 AM Maybe this: Code: x2 = x*x; Thats 16 muls, 6 adds, 1 unary minus, 7 temp vars. You can do better than that - for instance this one: Code: float x4 = x*x; // we only use x2 or y2 for tenx2y2 10 muls, 6 adds/subs, 3 temp vars Compared to the original code (with 28 muls!) this gave a modest 33% speedup (on a GPU). Title: Re: optimisation Post by: eiffie on January 22, 2013, 04:57:32 PM Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing :)
Title: Re: optimisation Post by: panzerboy on January 22, 2013, 11:50:54 PM eiffie, x2y2 is used to test against the bailout value in a julia or mandelbrot program. I guess technically you're testing against sqrt(x)+sqrt(y) > bailout, but it's much faster to test x2+y2>bailout2 Square roots being computationally prohibitive. Syntopia, See above. You still probably want to keep the x2 and y2 temp variables for testing against the bailout. Interesting that you change Code: x = (5.*y4-tenx2y2+x4)*x +x; to be Code: x = (5.*y4+1.-tenx2y2+x4)*x; I'm guessing x86 can store a floating point constant directly in the machine code instruction without reference to a memory location? So I looked up the x86 assembler wiki and found the FLD1 operation. Cool! It loads a floatingpoint 1 in one machine code instruction, well that's got to be faster than referencing a variable off the stack. As long as the compiler is smart enough to do this. Don't think there is an equivalent for ARM, well that's CISC vs RISC for you. Title: Re: optimisation Post by: fractower on January 23, 2013, 12:29:50 AM Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing :) I hope Syntopia won't mind my trying to answer this one. SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model. This is especially true for something like a bulb calculation which treats one dimension differently. To make best use of SIMD processors it is often best to structure your memory in structures of arrays instead of arrays of structures. Example of an array of structures Vect3D P[N]; produces an array with the x, y and z components sequential in memory. Example of a structure of arrays: x[N]; y[N]; z[N]; In order for the SIMD processor to vectorize the array of structures it must perform a gather to get all the components in separate vector registers. With the structure of arrays the memory is already SIMD friendly. Since it is a pain to code in SOA the gather operations are constantly getting faster and compilers getting better. If you don't like my answer just wait a GPU generation or two.. Title: Re: optimisation Post by: Syntopia on January 24, 2013, 11:16:28 PM Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing :) I hope Syntopia won't mind my trying to answer this one. SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model. I'm no expert on GPU's, but my understanding is quite different. I don't think it matters at all whether you use vec3 or single components in GLSL code. A modern GPU will not try to parallellize an operation such as vec3 addition onto multiple threads (i.e. it will not assign one thread to the x addition, and another to the y-addition). Instead all the 8 or 16 stream processors on an Nvidia multiprocessor will perform exactly the same instructions, on all components of the vector. I'm pretty sure any GLSL program will perform exactly the same whether you code component-wise or vector-wise, but it would be easy to test. GPU would be really difficult to program, if you only could harvest the parallellism when using special vector operations on special data types, like the SIMD instructions on a CPU. Title: Re: optimisation Post by: eiffie on January 25, 2013, 06:10:47 PM It makes sense to me now - the pixel fragments act as the array and they are done in parallel. vecs and mats are just for optimize our thinking :)
Title: Re: optimisation Post by: lycium on January 25, 2013, 06:41:29 PM On the other hand I am an expert, but was ignored :tease:
In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN. Title: Re: optimisation Post by: Syntopia on January 25, 2013, 09:53:54 PM On the other hand I am an expert, but was ignored :tease: In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN. At least for the GPU fractal stuff I do, the FLOPs are certainly the bottleneck. For raymarching fractals, you don't need much memory access - the few uniforms you pass to the shader will reside in local memory. And each fragment shader writes just a single value to the frame buffer object as output. Interesting that AMD only recently switched to a scalar architecture - I didn't know that. Title: Re: optimisation Post by: knighty on January 25, 2013, 11:26:53 PM Maybe you will find this document useful:
www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf BTW, the formula could be written in this form: Code: vec2 v=vec2(x,y); Title: Re: optimisation Post by: eiffie on January 26, 2013, 06:52:33 PM One last question and I'll shut up.
This creates a data dependency... float tmp=x; x=y; //wait for memory to settle y=tmp; But does this??? v.xy=v.yx; It must, right? Or since tmp wasn't used in arith. the first example does not need to wait? Title: Re: optimisation Post by: Adam Majewski on January 26, 2013, 09:14:08 PM Vectorise it with SSE2 (http://en.wikipedia.org/wiki/SSE2), use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue. I use 1d memory array of unsigned chars for saving colors ( shades of grey = from 0 to 255 ) of pixels. Do you mean that I should think about other structures, like for example 2d memory array ? Title: Re: optimisation Post by: lycium on January 27, 2013, 01:33:59 AM I use 1d memory array of unsigned chars for saving colors ( shades of grey = from 0 to 255 ) of pixels. Nope, you simply process many of them (eg. 4+) in parallel; computing them together, and then writing the results out in one batch. Besides vectorisation potential this also reduces loop overhead (a technique called unrolling).Do you mean that I should think about other structures, like for example 2d memory array ? Title: Re: optimisation Post by: Syntopia on January 27, 2013, 11:46:20 PM One last question and I'll shut up. This creates a data dependency... float tmp=x; x=y; //wait for memory to settle y=tmp; But does this??? v.xy=v.yx; It must, right? Or since tmp wasn't used in arith. the first example does not need to wait? I did a few tests with Nvidias NVEmulate that allows dumping compiled GLSL. In your case using a temp or using swizzling compiled to the exact same code, turning both into a swizzle: MOV.F result_color0.xy, fragment.attrib[0].yxzw; I think swizzles are basically free on GPU's because the GPU instructions seems to specify a swizzle mask - so I'm not sure the GPU would actually need to wait for the register latency in this case. I also did a few other test to see how the compiler behaves. Here is my proposed optimization: Code: void main() {Which is compiled into Code: MUL.F R0.w, fragment.attrib[0].x, fragment.attrib[0].x; Notice that the operations are scalar-operations (except the last one). Here is Knighty's proposal: Code: void main() {Which turns into Code: MUL.F R0.xy, fragment.attrib[0], fragment.attrib[0]; Only 7 instructions here, but they are all two components vector operations (corresponding to 14 scalar operations - as above) And, as you might expect the two fragments execute at exactly the same speed. So it appears there is no reason to explicitly vectorize your instructions. I also tried AMD GPU ShaderAnalyzer (which lets you see what GLSL gets compiled into for all different AMD/ATI architectures) - and as far as I could tell, the compiler will do some automatic vectorization of scalar instructions - even for older cards. But I'm not sure I'm interpreting the assembly code the right way. Title: Re: optimisation Post by: eiffie on January 29, 2013, 04:57:24 PM Thanks to everyone for answering those lingering questions. In short my GPU is smarter than me so no need to worry about it. |