|
Adam Majewski
|
 |
« on: January 18, 2013, 11:14:58 PM » |
|
Hi. I have made image of Julia set for f(z) = z+z^5. Image and code are here : http://commons.wikimedia.org/wiki/File:Julia_set_z%2Bz%5E5.png I have tried to optimize inner loop from : tempx = 5*x*y*y*y*y-10*x*x*x*y*y + x*x*x*x*x + x ; // temporary variable y = y*y*y*y*y -10*x*x*y*y*y + 5*x*x*x*x*y + y ; x=tempx; to tempx = 5*x*y4 -10*x2my2*x + x4*x + x ; // 5*x*y^4-10*x^3*y^2+x^5+x y = y4*y -10*x2my2*y + 5*x4*y + y ; // y^5-10*x^2*y^3+5*x^4*y+y x=tempx; x2=x*x; y2=y*y; y4=y2*y2; x4=x2*x2; x2y2= x2+y2; x2my2= x2*y2; How I should do it ? TIA
|
|
|
|
|
Logged
|
|
|
|
|
eiffie
Guest
|
 |
« Reply #1 on: January 19, 2013, 05:43:25 PM » |
|
I can spot 1 optimisation - don't calculate x2y2  it is not used.
|
|
|
|
« Last Edit: January 19, 2013, 05:45:20 PM by eiffie, Reason: spell check »
|
Logged
|
|
|
|
|
asimes
|
 |
« Reply #2 on: January 19, 2013, 06:55:29 PM » |
|
Maybe this: x2 = x*x; x4 = x2*x2; y2 = y*y; y4 = y2*y2; xt5 = x*5; xt_10 = x*-10;
tempx = xt5*y4 + xt_10*x2*y2 + x4*x + x; y = y4*y + xt_10*x*y2*y + xt5*x2*x*y + y; x = tempx;
|
|
|
|
|
Logged
|
|
|
|
|
lycium
|
 |
« Reply #3 on: January 19, 2013, 10:16:17 PM » |
|
Vectorise it with SSE2, use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue.
|
|
|
|
|
Logged
|
|
|
|
|
Syntopia
|
 |
« Reply #4 on: January 20, 2013, 12:23:57 AM » |
|
Maybe this: x2 = x*x; x4 = x2*x2; y2 = y*y; y4 = y2*y2; xt5 = x*5; xt_10 = x*-10;
tempx = xt5*y4 + xt_10*x2*y2 + x4*x + x; y = y4*y + xt_10*x*y2*y + xt5*x2*x*y + y; x = tempx; Thats 16 muls, 6 adds, 1 unary minus, 7 temp vars. You can do better than that - for instance this one: float x4 = x*x; // we only use x2 or y2 for tenx2y2 float y4 = y*y; float tenx2y2 = 10.*x4*y4; x4 = x4*x4; y4 = y4*y4; x = (5.*y4+1.-tenx2y2+x4)*x; // we don't need tempx anymore: y doesn't use x y = (y4-tenx2y2+5.*x4+1.)*y;
10 muls, 6 adds/subs, 3 temp vars Compared to the original code (with 28 muls!) this gave a modest 33% speedup (on a GPU).
|
|
|
|
|
Logged
|
|
|
|
|
eiffie
Guest
|
 |
« Reply #5 on: January 22, 2013, 04:57:32 PM » |
|
Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing 
|
|
|
|
|
Logged
|
|
|
|
panzerboy
Fractal Lover
 
Posts: 242
|
 |
« Reply #6 on: January 22, 2013, 11:50:54 PM » |
|
eiffie, x2y2 is used to test against the bailout value in a julia or mandelbrot program. I guess technically you're testing against sqrt(x)+sqrt(y) > bailout, but it's much faster to test x 2+y 2>bailout 2Square roots being computationally prohibitive. Syntopia, See above. You still probably want to keep the x2 and y2 temp variables for testing against the bailout. Interesting that you change x = (5.*y4-tenx2y2+x4)*x +x; y = (y4-tenx2y2+5.*x4)*y +y; to be x = (5.*y4+1.-tenx2y2+x4)*x; y = (y4-tenx2y2+5.*x4+1.)*y;
Placing that (floating point constant) one inside the brackets and removing the addition of a variable suggests there's some efficiency to be gained. I'm guessing x86 can store a floating point constant directly in the machine code instruction without reference to a memory location? So I looked up the x86 assembler wiki and found the FLD1 operation. Cool! It loads a floatingpoint 1 in one machine code instruction, well that's got to be faster than referencing a variable off the stack. As long as the compiler is smart enough to do this. Don't think there is an equivalent for ARM, well that's CISC vs RISC for you.
|
|
|
|
« Last Edit: January 23, 2013, 03:45:33 AM by panzerboy, Reason: weird [/code] tags appended to end? »
|
Logged
|
|
|
|
fractower
Iterator

Posts: 173
|
 |
« Reply #7 on: January 23, 2013, 12:29:50 AM » |
|
Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing  I hope Syntopia won't mind my trying to answer this one. SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model. This is especially true for something like a bulb calculation which treats one dimension differently. To make best use of SIMD processors it is often best to structure your memory in structures of arrays instead of arrays of structures. Example of an array of structures Vect3D P[N]; produces an array with the x, y and z components sequential in memory. Example of a structure of arrays: x[N]; y[N]; z[N]; In order for the SIMD processor to vectorize the array of structures it must perform a gather to get all the components in separate vector registers. With the structure of arrays the memory is already SIMD friendly. Since it is a pain to code in SOA the gather operations are constantly getting faster and compilers getting better. If you don't like my answer just wait a GPU generation or two..
|
|
|
|
|
Logged
|
|
|
|
|
Syntopia
|
 |
« Reply #8 on: January 24, 2013, 11:16:28 PM » |
|
Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing  I hope Syntopia won't mind my trying to answer this one.
SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model.
I'm no expert on GPU's, but my understanding is quite different. I don't think it matters at all whether you use vec3 or single components in GLSL code. A modern GPU will not try to parallellize an operation such as vec3 addition onto multiple threads (i.e. it will not assign one thread to the x addition, and another to the y-addition). Instead all the 8 or 16 stream processors on an Nvidia multiprocessor will perform exactly the same instructions, on all components of the vector. I'm pretty sure any GLSL program will perform exactly the same whether you code component-wise or vector-wise, but it would be easy to test. GPU would be really difficult to program, if you only could harvest the parallellism when using special vector operations on special data types, like the SIMD instructions on a CPU.
|
|
|
|
|
Logged
|
|
|
|
|
eiffie
Guest
|
 |
« Reply #9 on: January 25, 2013, 06:10:47 PM » |
|
It makes sense to me now - the pixel fragments act as the array and they are done in parallel. vecs and mats are just for optimize our thinking 
|
|
|
|
« Last Edit: January 25, 2013, 06:50:35 PM by eiffie »
|
Logged
|
|
|
|
|
lycium
|
 |
« Reply #10 on: January 25, 2013, 06:41:29 PM » |
|
On the other hand I am an expert, but was ignored  In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN.
|
|
|
|
|
Logged
|
|
|
|
|
Syntopia
|
 |
« Reply #11 on: January 25, 2013, 09:53:54 PM » |
|
On the other hand I am an expert, but was ignored  In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN. At least for the GPU fractal stuff I do, the FLOPs are certainly the bottleneck. For raymarching fractals, you don't need much memory access - the few uniforms you pass to the shader will reside in local memory. And each fragment shader writes just a single value to the frame buffer object as output. Interesting that AMD only recently switched to a scalar architecture - I didn't know that.
|
|
|
|
|
Logged
|
|
|
|
knighty
Fractal Iambus
  
Posts: 819
|
 |
« Reply #12 on: January 25, 2013, 11:26:53 PM » |
|
Maybe you will find this document useful: www.cs.berkeley.edu/~volkov/volkov10-GTC.pdfBTW, the formula could be written in this form: vec2 v=vec2(x,y); vec2 v2=v*v; v=v*(5.*v2.yx*v2.yx + v2*(v2 - 10.*v2.yx) + 1.);
|
|
|
|
|
Logged
|
|
|
|
|
eiffie
Guest
|
 |
« Reply #13 on: January 26, 2013, 06:52:33 PM » |
|
One last question and I'll shut up.
This creates a data dependency... float tmp=x; x=y; //wait for memory to settle y=tmp;
But does this??? v.xy=v.yx;
It must, right? Or since tmp wasn't used in arith. the first example does not need to wait?
|
|
|
|
|
Logged
|
|
|
|
|
Adam Majewski
|
 |
« Reply #14 on: January 26, 2013, 09:14:08 PM » |
|
Vectorise it with SSE2, use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue. I use 1d memory array of unsigned chars for saving colors ( shades of grey = from 0 to 255 ) of pixels. Do you mean that I should think about other structures, like for example 2d memory array ?
|
|
|
|
|
Logged
|
|
|
|
|