cycles needed for floating point asm?

cKleinhuis

Administrator
Fractal Senior

Posts: 7044

formerly known as 'Trifox'

Re: cycles needed for floating point asm?

« Reply #15 on: April 04, 2015, 11:38:28 AM »

all nice, i used to code assembler on the mc68000 (amiga) in my eyes the existing formulas should be optimized in a way that the cpu pipelines nowadays support the best, but i am unaware of the optimizations they actually do, restructuring a formula in a way that stuff that can be calculated in paralel steps (on cpu pipelining), making good use of the caches ( not so important for the small memory access a math-formula usually needs ) but since i dont know what the processors do i totally dont care about that, but some tricks might be some real performance boosts, like the one another user posted earlier in this thread, using the cycles of a longer calculation to squeeze in another one that can run in parallel on such hardware, is the way to go, i dunno if calculating constants or prerequisites in such a way and then using the results for outuput ....

so, simple reorganizing order of commands might give another speedup above 10 percent, to get rid of cycles that can be done in parallel could be a way to squeeze out some amazing performance


	Logged

---

divide and conquer - iterate and rule - chaos is No random!

DarkBeam

Global Moderator
Fractal Senior

Posts: 2512

Fragments of the fractal -like the tip of it

Re: cycles needed for floating point asm?

« Reply #16 on: April 04, 2015, 06:07:01 PM »

The speedup is partially killed by mixing floating point and MOV/XOR instructions, too bad, in theory my version would be a lot faster ... sad


	Logged

No sweat, guardian of wisdom!

eiffie

Guest

Re: cycles needed for floating point asm?

« Reply #17 on: April 04, 2015, 06:09:05 PM »

I just have to add my thoughts even tough we have gotten way off topic.
Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do. We aren't getting rich off this so FUN is a prerequisite. And I find that mindset that anything that can be done has already been done better to be very unappealing. I am not a genius (had to look up how to spell that) but even I have come up with new methods like distance estimated lighting and enlarged-sphere tracing to mimic monticarlo sampling. I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to". I love the innovation here and that so many people are going in different directions to achieve similar goals... call it evolution.

That being said I use to code machine language on Apple IIe but then there were only about 100 instructions.


	Logged

knighty

Fractal Iambus

Posts: 819

Re: cycles needed for floating point asm?

« Reply #18 on: April 04, 2015, 07:04:24 PM »

It should be much faster.
Using evaldraw (which is not very good at optimizing) I get 2x speedup:

Code:

(x,y){
//return sin(x);
#define INV_TWO_PI 0.15915494309189533576888376337251
#define TWO_PI 6.283185307179586476925286766559
   x=x*INV_TWO_PI+(-0.25);
   x=x-floor(x);
   x=abs(x+(-0.5))+(-0.25);
   x=x*TWO_PI;
   x2=x*x;
   //from: http://stackoverflow.com/questions/345085/how-do-trigonometric-functions-work/394512#394512
   x = x * (0.99999660 + x2*(-0.16664824 + x2*(0.00830629 +x2*-0.00018363)));
}

(Set [esc]->View->Show Clock Cycles, and set multithreading to 1 (with [F5]) in order to compare the timings. One also have to account for the overhead of function calling and other operations (10cycles here))
Here is the asm code generated:

Code:

1244240:83 ec 0c                   sub esp, 12
1244243:f2 0f 10 44 24 10          movsd xmm0, [esp+16]
1244249:f2 0f 59 46 30             mulsd xmm0, [esi+48]
124424e:f2 0f 58 46 10             addsd xmm0, [esi+16]
1244253:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244259:dd 44 24 10                fld qword ptr [esp+16]
124425d:db 5c 24 f8                fistp dword ptr [esp-8]
1244261:f2 0f 2a 4c 24 f8          cvtsi2sd xmm1, [esp-8]
1244267:f2 0f 5c c1                subsd xmm0, xmm1
124426b:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244271:f2 0f 58 46 08             addsd xmm0, [esi+8]
1244276:0f 54 46 e0                andps xmm0, [esi-32]
124427a:f2 0f 58 46 10             addsd xmm0, [esi+16]
124427f:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244285:f2 0f 59 46 58             mulsd xmm0, [esi+88]
124428a:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244290:f2 0f 59 c0                mulsd xmm0, xmm0
1244294:f2 0f 11 04 24             movsd [esp], xmm0
1244299:f2 0f 59 46 20             mulsd xmm0, [esi+32]
124429e:f2 0f 58 46 28             addsd xmm0, [esi+40]
12442a3:f2 0f 59 04 24             mulsd xmm0, [esp]
12442a8:f2 0f 58 46 18             addsd xmm0, [esi+24]
12442ad:f2 0f 59 04 24             mulsd xmm0, [esp]
12442b2:f2 0f 58 46 48             addsd xmm0, [esi+72]
12442b7:f2 0f 59 44 24 10          mulsd xmm0, [esp+16]
12442bd:f2 0f 11 44 24 10          movsd [esp+16], xmm0
12442c3:dd 44 24 10                fld qword ptr [esp+16]
12442c7:83 c4 0c                   add esp, 12
12442ca:c3                         ret

It uses a lot of unnecessary memory operations and mixes between FPU and SSE for the floor() function. A good complier should give better results. For example reordering the instructions, reducing memory operations, using immediate values...etc.


	Logged

cKleinhuis

Administrator
Fractal Senior

Posts: 7044

formerly known as 'Trifox'

Re: cycles needed for floating point asm?

« Reply #19 on: April 04, 2015, 07:20:56 PM »

hehe, never leave cpu registers alone wink

arent nowadays processors provide huge register space to keep everything right in the registers?!


	Logged

---

divide and conquer - iterate and rule - chaos is No random!

DarkBeam

Global Moderator
Fractal Senior

Posts: 2512

Fragments of the fractal -like the tip of it

Re: cycles needed for floating point asm?

« Reply #20 on: April 04, 2015, 07:55:10 PM »

I don't know how to thank you guys and Knighty!!! cheesy

Will try, but I don't know how to replicate assembly's FRNDINT inside Evaldraw which takes me always lots of time :


	Logged

No sweat, guardian of wisdom!

DarkBeam

Global Moderator
Fractal Senior

Posts: 2512

Fragments of the fractal -like the tip of it

Re: cycles needed for floating point asm?

« Reply #21 on: April 04, 2015, 08:14:35 PM »

This is my superslim assembly version. EDI stuff contains more or less same constants as Knighty's. ~~But not so faster. (Still noticeably faster in some circumstances!)~~
I use PI instead of 2PI... grin

and multiplied all constants by powers of PI to kill one multiplication shocked

The AL manipulating stuff changes odd/even signs. (FCHS) ("fast" fold replica)
Because, the FIST integer is even/odd when you need to flip upside down the graph. (crappiest explanation ever but still wink

)
The coeffs are from the minimax polynome, so I am not claiming the ownership!

-> http://www.geometrictools.com/Source/Functions.html

Code:

[CONSTANTS]
Double = 0.31830988618379067153776752674503 // [edi]
Double = 3.1415926535897932384626433832795 // [edi+8] ... and so on
Double = -5.167375251323450
Double = 2.543310591226590
Double = -0.557168173477934

Cattura.PNG (7.35 KB, 294x216 - viewed 276 times.)
« Last Edit: April 05, 2015, 10:51:42 PM by DarkBeam, Reason: updating nerdy description so it\'s human readable or something »	Logged

No sweat, guardian of wisdom!

3dickulus

Global Moderator
Fractal Senior

Posts: 1558

Re: cycles needed for floating point asm?

« Reply #22 on: April 04, 2015, 08:43:24 PM »

@ eiffie


	Logged

Resistance is fertile...
You will be illuminated!

#B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium

knighty

Fractal Iambus

Posts: 819

Re: cycles needed for floating point asm?

« Reply #23 on: April 04, 2015, 08:58:40 PM »

Quote from: cKleinhuis on April 04, 2015, 07:20:56 PM

hehe, never leave cpu registers alone wink

arent nowadays processors provide huge register space to keep everything right in the registers?!

Yes but when there is an instruction that tells the CPU to store something, it will be executed or at least decoded..etc. That said I'm not an expert. embarrass

@Darkbeam: impressive. Is it possible to make it use SSE?


	Logged

DarkBeam

Global Moderator
Fractal Senior

Posts: 2512

Fragments of the fractal -like the tip of it

Re: cycles needed for floating point asm?

« Reply #24 on: April 04, 2015, 10:32:49 PM »

Knighty,

I never understood SSE opcodes it's a nightmare - they always give me weird errors an I don't know the alignment they pretend? sad

Not even 1% as impressive as your scripts that I always adore


	Logged

No sweat, guardian of wisdom!

hobold

Fractal Bachius

Posts: 573

Re: cycles needed for floating point asm?

« Reply #25 on: April 05, 2015, 12:26:14 AM »

Quote from: eiffie on April 04, 2015, 06:09:05 PM

Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do.

I didn't mean to sound so definitive. I really don't think that one specific way of doing things is the best one. I don't mean to tell others how they ought to live.

Quote

I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to".

That may well be true, but imagine what the world would look like if no one ever got any software done to the point where it was "good enough" to give it to non-expert users.

We really do need all kinds.


	Logged

DarkBeam

Global Moderator
Fractal Senior

Posts: 2512

Fragments of the fractal -like the tip of it

Re: cycles needed for floating point asm?

« Reply #26 on: April 05, 2015, 07:10:58 PM »

Notice, I updated the code to a more tight version, no longer fstp, xor, fld but a simple AND/check/FCHS and oh boy, performance is incredibly boosted now, probably even more than 2x! cheesy

The assembly snippet can be freely reused if someone finds it useful wink

Still not tried to boost FATAN but should be even more better as optimization... as it takes roughly 2x the cycles than fsin().


	Logged

No sweat, guardian of wisdom!

ker2x

Fractal Molossus

Posts: 795

Re: cycles needed for floating point asm?

« Reply #27 on: April 06, 2015, 09:44:10 AM »

ASM is fun and i personally love to write ASM but doing it for speed optimization isn't really going to work on modern CPU.
Get the mighty intel compiler and development suite https://software.intel.com/en-us/qualify-for-free-software


	Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/

quaz0r

Fractal Molossus

Posts: 652

Re: cycles needed for floating point asm?

« Reply #28 on: April 06, 2015, 05:00:13 PM »

with the vector units getting bigger on processors nowadays the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code, otherwise you are missing out on all your potential performance.


	Logged

claude

Fractal Bachius

Posts: 563

Re: cycles needed for floating point asm?

« Reply #29 on: April 23, 2015, 08:49:17 PM »

I had a go at a fast single-precision cosf() approximation:

http://mathr.co.uk/blog/2015-04-21_approximating_cosine.html

Code:

// gcc -std=c99 -Wall -Wextra -pedantic -O3 -march=native -ffast-math -funroll-loops -D_BSD_SOURCE -D_POSIX_C_SOURCE=200112L
// note: -ffast-math can break some numerical code, verify results carefully, use it only for translation units where it's known to be ok
#ifndef COSF9_H
#define COSF9_H 1

#include <math.h>

#define   likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

// postcondition: 0 <= phase <= 1
static inline float cosf9_reduce(float phase) {
  float p = fabsf(phase);
  // p >= 0
  if (likely(p < (1 << 24))) {
    int q = p;
    return p - q;
  } else {
    if (unlikely(isnanf(p) || isinff(p))) {
      // return NaN
      return p - p;
    } else {
      // int could overflow, and it will be integral anyway
      return 0.0f;
    }
  }
}

// precondition: 0 <= phase <= 1
static inline float cosf9_unsafe(float phase) {
  float p = fabsf(4.0f * phase - 2.0f) - 1.0f;
  // p in -1 .. 1
  float s
    = 1.5707963267948965580e+00f * p
    - 6.4596271553942852250e-01f * p * p * p
    + 7.9685048314861006702e-02f * p * p * p * p * p
    - 4.6672571910271187789e-03f * p * p * p * p * p * p * p
    + 1.4859762069630022552e-04f * p * p * p * p * p * p * p * p * p;
  // compiler figures out optimal simd multiplications
  return s;
}

static inline float cosf9(float phase) {
  return cosf9_unsafe(cosf9_reduce(phase));
}

#endif

(Note: updated/improved versions might appear at a later date in the git repo linked from the blog post)

For input in [-0.5..0.5] (corresponding to [-pi .. pi] for the real cosf()), it's accurate to about 21 bits worst case, about 23 bits rms average - single precision float has only 24 bits, so that's pretty good.

Quote from: quaz0r on April 06, 2015, 05:00:13 PM

the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code

absolutely! vectorizing 1 evaluation on its own is hard, but if you have say 64 input values ready and waiting and you need 64 evaluations, gcc can do a pretty good job of vectorizing (provided the function is inlined and not too complicated)

Quote from: cKleinhuis on April 04, 2015, 07:20:56 PM

arent nowadays processors provide huge register space to keep everything right in the registers?!

They've got more than before, but still various factors are involved - sometimes making two passes over the data can be faster than a single pass (assuming the work can be split nicely into phases). I don't really understand it, but apparently doing too many things in one loop puts pressure on the http://en.wikipedia.org/wiki/Translation_lookaside_buffer

For example, splitting the cosf9_reduce() and cosf9_unsafe() into two passes over the 64 input data samples sped it up by around 25%, compared to doing cosf9() in one pass.


	Logged

Pages: 1 [2] Go Down

« previous next »

	Author	Topic: cycles needed for floating point asm? (Read 5796 times)
		Description:
0 Members and 1 Guest are viewing this topic.

Related Topics
	Subject	Started by	Replies	Views	Last post
	DE estimation - help needed Mandelbulb Implementation	chaospro	3	2068	April 19, 2010, 02:29:28 AM by David Makin
	Floating Filigrees 1 Mandelbulber Gallery	MarkJayBee	0	1508	May 04, 2010, 03:26:15 PM by MarkJayBee
	Floating Lake Mandelbulb3D Gallery	Sockratease	0	1362	September 19, 2010, 08:28:54 PM by Sockratease
	FLOATING POINT MATH Mathematics	Bent-Winged Angel	2	3056	October 30, 2010, 04:54:26 PM by Bent-Winged Angel
	help needed Introduction to Fractals and Related Links	logan	5	831	April 30, 2014, 08:38:25 PM by logan

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

	Welcome, Guest. Please login or register.	April 24, 2024, 05:54:35 PM
		Login with username, password and session length

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval, thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again