Logo by AGUS - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Visit us on facebook
 
*
Welcome, Guest. Please login or register. December 02, 2025, 03:27:10 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1] 2   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: optimisation  (Read 2054 times)
0 Members and 1 Guest are viewing this topic.
Adam Majewski
Fractal Lover
**
Posts: 221


WWW
« on: January 18, 2013, 11:14:58 PM »

Hi. I have made image of Julia set for f(z) = z+z^5. Image and code are here :  http://commons.wikimedia.org/wiki/File:Julia_set_z%2Bz%5E5.png
I have tried to optimize inner loop  from :

 tempx = 5*x*y*y*y*y-10*x*x*x*y*y + x*x*x*x*x + x ; // temporary variable
 y = y*y*y*y*y -10*x*x*y*y*y + 5*x*x*x*x*y + y ;
 x=tempx;

to

 tempx = 5*x*y4 -10*x2my2*x + x4*x + x ; // 5*x*y^4-10*x^3*y^2+x^5+x
 y = y4*y -10*x2my2*y + 5*x4*y + y ; // y^5-10*x^2*y^3+5*x^4*y+y
 x=tempx;
 x2=x*x;
 y2=y*y;
 y4=y2*y2;
 x4=x2*x2;   
 x2y2= x2+y2;
 x2my2= x2*y2;

How I should do it ?

TIA
Logged
eiffie
Guest
« Reply #1 on: January 19, 2013, 05:43:25 PM »

I can spot 1 optimisation - don't calculate x2y2 smiley it is not used.
« Last Edit: January 19, 2013, 05:45:20 PM by eiffie, Reason: spell check » Logged
asimes
Fractal Lover
**
Posts: 212



asimes
WWW
« Reply #2 on: January 19, 2013, 06:55:29 PM »

Maybe this:

Code:
x2 = x*x;
x4 = x2*x2;
y2 = y*y;
y4 = y2*y2;
xt5 = x*5;
xt_10 = x*-10;

tempx = xt5*y4 + xt_10*x2*y2 + x4*x + x;
y = y4*y + xt_10*x*y2*y + xt5*x2*x*y + y;
x = tempx;
Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #3 on: January 19, 2013, 10:16:17 PM »

Vectorise it with SSE2, use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue.
Logged

Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #4 on: January 20, 2013, 12:23:57 AM »

Maybe this:

Code:
x2 = x*x;
x4 = x2*x2;
y2 = y*y;
y4 = y2*y2;
xt5 = x*5;
xt_10 = x*-10;

tempx = xt5*y4 + xt_10*x2*y2 + x4*x + x;
y = y4*y + xt_10*x*y2*y + xt5*x2*x*y + y;
x = tempx;

Thats 16 muls, 6 adds, 1 unary minus, 7 temp vars.

You can do better than that - for instance this one:

Code:
float x4 = x*x; // we only use x2 or y2 for tenx2y2
float y4 = y*y;
float tenx2y2 = 10.*x4*y4;
x4 = x4*x4;
y4 = y4*y4;
x = (5.*y4+1.-tenx2y2+x4)*x; // we don't need tempx anymore: y doesn't use x
y = (y4-tenx2y2+5.*x4+1.)*y;


10 muls, 6 adds/subs, 3 temp vars

Compared to the original code (with 28 muls!) this gave a modest 33% speedup (on a GPU).

Logged
eiffie
Guest
« Reply #5 on: January 22, 2013, 04:57:32 PM »

Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing smiley
Logged
panzerboy
Fractal Lover
**
Posts: 242


« Reply #6 on: January 22, 2013, 11:50:54 PM »


eiffie,
 x2y2 is used to test against the bailout value in a julia or mandelbrot program.
I guess technically you're testing against sqrt(x)+sqrt(y) > bailout, but it's much faster to test x2+y2>bailout2
Square roots being computationally prohibitive.

Syntopia,
See above. You still probably want to keep the x2 and y2 temp variables for testing against the bailout.
Interesting that you change

Code:
x = (5.*y4-tenx2y2+x4)*x +x;
y = (y4-tenx2y2+5.*x4)*y +y;

to be

Code:
x = (5.*y4+1.-tenx2y2+x4)*x;
y = (y4-tenx2y2+5.*x4+1.)*y;
Placing that (floating point constant) one inside the brackets and removing the addition of a variable suggests there's some efficiency to be gained.
I'm guessing x86 can store a floating point constant directly in the machine code instruction without reference to a memory location?
So I looked up the x86 assembler wiki and found the FLD1 operation. Cool!
It loads a floatingpoint 1 in one machine code instruction, well that's got to be faster than referencing a variable off the stack.
As long as the compiler is smart enough to do this.
Don't think there is an equivalent for ARM, well that's CISC vs RISC for you.
« Last Edit: January 23, 2013, 03:45:33 AM by panzerboy, Reason: weird [/code] tags appended to end? » Logged
fractower
Iterator
*
Posts: 173


« Reply #7 on: January 23, 2013, 12:29:50 AM »

Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing smiley

I hope Syntopia won't mind my trying to answer this one.

SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model. This is especially true for something like a bulb calculation which treats one dimension differently. To make best use of SIMD processors it is often best to structure your memory in structures of arrays instead of arrays of structures.

Example of an array of structures

Vect3D P[N];

produces an array with the x, y and z components sequential in memory.

Example of a structure of arrays:

x[N];
y[N];
z[N];

In order for the SIMD processor to vectorize the array of structures it must perform a gather to get all the components in separate vector registers. With the structure of arrays the memory is already SIMD friendly. Since it is a pain to code in SOA the gather operations are constantly getting faster and compilers getting better. If you don't like my answer just wait a GPU generation or two..
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #8 on: January 24, 2013, 11:16:28 PM »

Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing smiley

I hope Syntopia won't mind my trying to answer this one.

SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model.

I'm no expert on GPU's, but my understanding is quite different.

I don't think it matters at all whether you use vec3 or single components in GLSL code. A modern GPU will not try to parallellize an operation such as vec3 addition onto multiple threads (i.e. it will not assign one thread to the x addition, and another to the y-addition). Instead all the 8 or 16 stream processors on an Nvidia multiprocessor will perform exactly the same instructions, on all components of the vector.

I'm pretty sure any GLSL program will perform exactly the same whether you code component-wise or vector-wise, but it would be easy to test.

GPU would be really difficult to program, if you only could harvest the parallellism when using special vector operations on special data types, like the SIMD instructions on a CPU.
Logged
eiffie
Guest
« Reply #9 on: January 25, 2013, 06:10:47 PM »

It makes sense to me now - the pixel fragments act as the array and they are done in parallel. vecs and mats are just for optimize our thinking smiley
« Last Edit: January 25, 2013, 06:50:35 PM by eiffie » Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #10 on: January 25, 2013, 06:41:29 PM »

On the other hand I am an expert, but was ignored  tease

In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN.
Logged

Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #11 on: January 25, 2013, 09:53:54 PM »

On the other hand I am an expert, but was ignored  tease

In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN.

At least for the GPU fractal stuff I do, the FLOPs are certainly the bottleneck. For raymarching fractals, you don't need much memory access - the few uniforms you pass to the shader will reside in local memory. And each fragment shader writes just a single value to the frame buffer object as output.

Interesting that AMD only recently switched to a scalar architecture - I didn't know that.
Logged
knighty
Fractal Iambus
***
Posts: 819


« Reply #12 on: January 25, 2013, 11:26:53 PM »

Maybe you will find this document useful:
www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

BTW, the formula could be written in this form:
Code:
vec2 v=vec2(x,y);
vec2 v2=v*v;
v=v*(5.*v2.yx*v2.yx + v2*(v2 - 10.*v2.yx) + 1.);
Logged
eiffie
Guest
« Reply #13 on: January 26, 2013, 06:52:33 PM »

One last question and I'll shut up.

This creates a data dependency...
float tmp=x;
x=y;
//wait for memory to settle
y=tmp;

But does this???
v.xy=v.yx;

It must, right? Or since tmp wasn't used in arith. the first example does not need to wait?
Logged
Adam Majewski
Fractal Lover
**
Posts: 221


WWW
« Reply #14 on: January 26, 2013, 09:14:08 PM »

Vectorise it with SSE2, use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue.

I use 1d memory array of unsigned chars for saving colors ( shades of grey = from 0 to 255 ) of pixels.
Do you mean that I should think about other structures, like for example 2d memory array ?
Logged
Pages: [1] 2   Go Down
  Print  
 
Jump to:  


Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.251 seconds with 25 queries. (Pretty URLs adds 0.01s, 2q)