optimisation

Adam Majewski

Fractal Lover

Posts: 221

optimisation

« on: January 18, 2013, 11:14:58 PM »

Hi. I have made image of Julia set for f(z) = z+z^5. Image and code are here : http://commons.wikimedia.org/wiki/File:Julia_set_z%2Bz%5E5.png
I have tried to optimize inner loop from :

tempx = 5*x*y*y*y*y-10*x*x*x*y*y + x*x*x*x*x + x ; // temporary variable
y = y*y*y*y*y -10*x*x*y*y*y + 5*x*x*x*x*y + y ;
x=tempx;

to

tempx = 5*x*y4 -10*x2my2*x + x4*x + x ; // 5*x*y^4-10*x^3*y^2+x^5+x
y = y4*y -10*x2my2*y + 5*x4*y + y ; // y^5-10*x^2*y^3+5*x^4*y+y
x=tempx;
x2=x*x;
y2=y*y;
y4=y2*y2;
x4=x2*x2;
x2y2= x2+y2;
x2my2= x2*y2;

How I should do it ?

TIA


	Logged

eiffie

Guest

Re: optimisation

« Reply #1 on: January 19, 2013, 05:43:25 PM »

I can spot 1 optimisation - don't calculate x2y2

it is not used.


« Last Edit: January 19, 2013, 05:45:20 PM by eiffie, Reason: spell check »	Logged

asimes

Fractal Lover

Posts: 212

Re: optimisation

« Reply #2 on: January 19, 2013, 06:55:29 PM »

Maybe this:

Code:

x2 = x*x;
x4 = x2*x2;
y2 = y*y;
y4 = y2*y2;
xt5 = x*5;
xt_10 = x*-10;

tempx = xt5*y4 + xt_10*x2*y2 + x4*x + x;
y = y4*y + xt_10*x*y2*y + xt5*x2*x*y + y;
x = tempx;


	Logged

lycium

Fractal Supremo

Posts: 1158

Re: optimisation

« Reply #3 on: January 19, 2013, 10:16:17 PM »

Vectorise it with SSE2, use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue.


	Logged

http://chaoticafractals.com | http://indigorenderer.com | http://lyc.deviantart.com

Syntopia

Fractal Molossus

Posts: 681

Re: optimisation

« Reply #4 on: January 20, 2013, 12:23:57 AM »

Quote from: asimes on January 19, 2013, 06:55:29 PM

Maybe this:

Code:

x2 = x*x;
x4 = x2*x2;
y2 = y*y;
y4 = y2*y2;
xt5 = x*5;
xt_10 = x*-10;

tempx = xt5*y4 + xt_10*x2*y2 + x4*x + x;
y = y4*y + xt_10*x*y2*y + xt5*x2*x*y + y;
x = tempx;

Thats 16 muls, 6 adds, 1 unary minus, 7 temp vars.

You can do better than that - for instance this one:

Code:

float x4 = x*x; // we only use x2 or y2 for tenx2y2
float y4 = y*y;
float tenx2y2 = 10.*x4*y4;
x4 = x4*x4;
y4 = y4*y4;
x = (5.*y4+1.-tenx2y2+x4)*x; // we don't need tempx anymore: y doesn't use x
y = (y4-tenx2y2+5.*x4+1.)*y;

10 muls, 6 adds/subs, 3 temp vars

Compared to the original code (with 28 muls!) this gave a modest 33% speedup (on a GPU).


	Logged

eiffie

Guest

Re: optimisation

« Reply #5 on: January 22, 2013, 04:57:32 PM »

Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing


	Logged

panzerboy

Fractal Lover

Posts: 242

Re: optimisation

« Reply #6 on: January 22, 2013, 11:50:54 PM »

eiffie,
x2y2 is used to test against the bailout value in a julia or mandelbrot program.
I guess technically you're testing against sqrt(x)+sqrt(y) > bailout, but it's much faster to test x²+y²>bailout²
Square roots being computationally prohibitive.

Syntopia,
See above. You still probably want to keep the x2 and y2 temp variables for testing against the bailout.
Interesting that you change

Code:

x = (5.*y4-tenx2y2+x4)*x +x;
y = (y4-tenx2y2+5.*x4)*y +y;

to be

Code:

x = (5.*y4+1.-tenx2y2+x4)*x;
y = (y4-tenx2y2+5.*x4+1.)*y;

Placing that (floating point constant) one inside the brackets and removing the addition of a variable suggests there's some efficiency to be gained.
I'm guessing x86 can store a floating point constant directly in the machine code instruction without reference to a memory location?
So I looked up the x86 assembler wiki and found the FLD1 operation. Cool!
It loads a floatingpoint 1 in one machine code instruction, well that's got to be faster than referencing a variable off the stack.
As long as the compiler is smart enough to do this.
Don't think there is an equivalent for ARM, well that's CISC vs RISC for you.


« Last Edit: January 23, 2013, 03:45:33 AM by panzerboy, Reason: weird [/code] tags appended to end? »	Logged

fractower

Iterator

Posts: 173

Re: optimisation

« Reply #7 on: January 23, 2013, 12:29:50 AM »

Quote from: eiffie on January 22, 2013, 04:57:32 PM

Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing

I hope Syntopia won't mind my trying to answer this one.

SIMD vectors are generally 8 to 16 double or single floats which are all treated exactly the same. Unfortunately 3D vectors don't always fit nicely in this model. This is especially true for something like a bulb calculation which treats one dimension differently. To make best use of SIMD processors it is often best to structure your memory in structures of arrays instead of arrays of structures.

Example of an array of structures

Vect3D P[N];

produces an array with the x, y and z components sequential in memory.

Example of a structure of arrays:

x[N];
y[N];
z[N];

In order for the SIMD processor to vectorize the array of structures it must perform a gather to get all the components in separate vector registers. With the structure of arrays the memory is already SIMD friendly. Since it is a pain to code in SOA the gather operations are constantly getting faster and compilers getting better. If you don't like my answer just wait a GPU generation or two..


	Logged

Syntopia

Fractal Molossus

Posts: 681

Re: optimisation

« Reply #8 on: January 24, 2013, 11:16:28 PM »

Quote from: eiffie on January 22, 2013, 04:57:32 PM

Syntopia I am curious - I don't know much about GPU architecture - does it help to vectorize on the GPU as well? I work with vectors because I'm usually more concerned about optimizing my typing

Quote from: fractower on January 23, 2013, 12:29:50 AM

I'm no expert on GPU's, but my understanding is quite different.

I don't think it matters at all whether you use vec3 or single components in GLSL code. A modern GPU will not try to parallellize an operation such as vec3 addition onto multiple threads (i.e. it will not assign one thread to the x addition, and another to the y-addition). Instead all the 8 or 16 stream processors on an Nvidia multiprocessor will perform exactly the same instructions, on all components of the vector.

I'm pretty sure any GLSL program will perform exactly the same whether you code component-wise or vector-wise, but it would be easy to test.

GPU would be really difficult to program, if you only could harvest the parallellism when using special vector operations on special data types, like the SIMD instructions on a CPU.


	Logged

eiffie

Guest

Re: optimisation

« Reply #9 on: January 25, 2013, 06:10:47 PM »

It makes sense to me now - the pixel fragments act as the array and they are done in parallel. vecs and mats are just for optimize our thinking


« Last Edit: January 25, 2013, 06:50:35 PM by eiffie »	Logged

lycium

Fractal Supremo

Posts: 1158

Re: optimisation

« Reply #10 on: January 25, 2013, 06:41:29 PM »

On the other hand I am an expert, but was ignored tease

In case anyone still cares: FLOPs are pretty much free on modern architectures, it's the memory access that matters most for performance. On GPUs which are not only MIMD (as they all are) but also SIMD (as older Radeons before GCN were), vectorising does help with floating point performance, but this is not always possible, so NVIDIA's scalar architecture often won, which is why AMD now do the same in GCN.


	Logged

http://chaoticafractals.com | http://indigorenderer.com | http://lyc.deviantart.com

Syntopia

Fractal Molossus

Posts: 681

Re: optimisation

« Reply #11 on: January 25, 2013, 09:53:54 PM »

Quote from: lycium on January 25, 2013, 06:41:29 PM

On the other hand I am an expert, but was ignored tease

At least for the GPU fractal stuff I do, the FLOPs are certainly the bottleneck. For raymarching fractals, you don't need much memory access - the few uniforms you pass to the shader will reside in local memory. And each fragment shader writes just a single value to the frame buffer object as output.

Interesting that AMD only recently switched to a scalar architecture - I didn't know that.


	Logged

knighty

Fractal Iambus

Posts: 819

Re: optimisation

« Reply #12 on: January 25, 2013, 11:26:53 PM »

Maybe you will find this document useful:
www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

BTW, the formula could be written in this form:

Code:

vec2 v=vec2(x,y);
vec2 v2=v*v;
v=v*(5.*v2.yx*v2.yx + v2*(v2 - 10.*v2.yx) + 1.);


	Logged

eiffie

Guest

Re: optimisation

« Reply #13 on: January 26, 2013, 06:52:33 PM »

One last question and I'll shut up.

This creates a data dependency...
float tmp=x;
x=y;
//wait for memory to settle
y=tmp;

But does this???
v.xy=v.yx;

It must, right? Or since tmp wasn't used in arith. the first example does not need to wait?


	Logged

Adam Majewski

Fractal Lover

Posts: 221

Re: optimisation

« Reply #14 on: January 26, 2013, 09:14:08 PM »

Quote from: lycium on January 19, 2013, 10:16:17 PM

Vectorise it with SSE2, use streaming writes to aligned memory. FLOPs are essentially "free" on modern architectures, memory access is the key issue.

I use 1d memory array of unsigned chars for saving colors ( shades of grey = from 0 to 255 ) of pixels.
Do you mean that I should think about other structures, like for example 2d memory array ?


	Logged

Pages: [1] 2 Go Down

« previous next »

	Author	Topic: optimisation (Read 2268 times)
		Description:
0 Members and 1 Guest are viewing this topic.

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

	Welcome, Guest. Please login or register.	January 16, 2026, 11:30:09 PM
		Login with username, password and session length

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval, thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again