Logo by Fiery - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Did you know ? you can use LaTex inside Postings on fractalforums.com!
 
*
Welcome, Guest. Please login or register. April 24, 2024, 05:54:35 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: 1 [2]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: cycles needed for floating point asm?  (Read 5796 times)
0 Members and 1 Guest are viewing this topic.
cKleinhuis
Administrator
Fractal Senior
*******
Posts: 7044


formerly known as 'Trifox'


WWW
« Reply #15 on: April 04, 2015, 11:38:28 AM »

all nice, i used to code assembler on the mc68000 (amiga) in my eyes the existing formulas should be optimized in a way that the cpu pipelines nowadays support the best, but i am unaware of the optimizations they actually do, restructuring a formula in a way that stuff that can be calculated in paralel steps (on cpu pipelining), making good use of the caches ( not so important for the small memory access a math-formula usually needs ) but since i dont know what the processors do i totally dont care about that, but some tricks might be some real performance boosts, like the one another user posted earlier in this thread, using the cycles of a longer calculation to squeeze in another one that can run in parallel on such hardware, is the way to go, i dunno if calculating constants or prerequisites in such a way and then using the results for outuput ....

so, simple reorganizing order of commands might give another speedup above 10 percent, to get rid of cycles that can be done in parallel could be a way to squeeze out some amazing performance
Logged

---

divide and conquer - iterate and rule - chaos is No random!
DarkBeam
Global Moderator
Fractal Senior
******
Posts: 2512


Fragments of the fractal -like the tip of it


« Reply #16 on: April 04, 2015, 06:07:01 PM »

The speedup is partially killed by mixing floating point and MOV/XOR instructions, too bad, in theory my version would be a lot faster ... sad
Logged

No sweat, guardian of wisdom!
eiffie
Guest
« Reply #17 on: April 04, 2015, 06:09:05 PM »

I just have to add my thoughts even tough we have gotten way off topic.
Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do. We aren't getting rich off this so FUN is a prerequisite. And I find that mindset that anything that can be done has already been done better to be very unappealing. I am not a genius (had to look up how to spell that) but even I have come up with new methods like distance estimated lighting and enlarged-sphere tracing to mimic monticarlo sampling. I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to". I love the innovation here and that so many people are going in different directions to achieve similar goals... call it evolution.

That being said I use to code machine language on Apple IIe but then there were only about 100 instructions.
Logged
knighty
Fractal Iambus
***
Posts: 819


« Reply #18 on: April 04, 2015, 07:04:24 PM »

It should be much faster.
Using evaldraw (which is not very good at optimizing) I get 2x speedup:
Code:
(x,y){
//return sin(x);
#define INV_TWO_PI 0.15915494309189533576888376337251
#define TWO_PI 6.283185307179586476925286766559
   x=x*INV_TWO_PI+(-0.25);
   x=x-floor(x);
   x=abs(x+(-0.5))+(-0.25);
   x=x*TWO_PI;
   x2=x*x;
   //from: http://stackoverflow.com/questions/345085/how-do-trigonometric-functions-work/394512#394512
   x = x * (0.99999660 + x2*(-0.16664824 + x2*(0.00830629 +x2*-0.00018363)));
}
(Set [esc]->View->Show Clock Cycles, and set multithreading to 1 (with [F5]) in order to compare the timings. One also have to account for the overhead of function calling and other operations (10cycles here))
Here is the asm code generated:
Code:
1244240:83 ec 0c                   sub esp, 12
1244243:f2 0f 10 44 24 10          movsd xmm0, [esp+16]
1244249:f2 0f 59 46 30             mulsd xmm0, [esi+48]
124424e:f2 0f 58 46 10             addsd xmm0, [esi+16]
1244253:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244259:dd 44 24 10                fld qword ptr [esp+16]
124425d:db 5c 24 f8                fistp dword ptr [esp-8]
1244261:f2 0f 2a 4c 24 f8          cvtsi2sd xmm1, [esp-8]
1244267:f2 0f 5c c1                subsd xmm0, xmm1
124426b:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244271:f2 0f 58 46 08             addsd xmm0, [esi+8]
1244276:0f 54 46 e0                andps xmm0, [esi-32]
124427a:f2 0f 58 46 10             addsd xmm0, [esi+16]
124427f:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244285:f2 0f 59 46 58             mulsd xmm0, [esi+88]
124428a:f2 0f 11 44 24 10          movsd [esp+16], xmm0
1244290:f2 0f 59 c0                mulsd xmm0, xmm0
1244294:f2 0f 11 04 24             movsd [esp], xmm0
1244299:f2 0f 59 46 20             mulsd xmm0, [esi+32]
124429e:f2 0f 58 46 28             addsd xmm0, [esi+40]
12442a3:f2 0f 59 04 24             mulsd xmm0, [esp]
12442a8:f2 0f 58 46 18             addsd xmm0, [esi+24]
12442ad:f2 0f 59 04 24             mulsd xmm0, [esp]
12442b2:f2 0f 58 46 48             addsd xmm0, [esi+72]
12442b7:f2 0f 59 44 24 10          mulsd xmm0, [esp+16]
12442bd:f2 0f 11 44 24 10          movsd [esp+16], xmm0
12442c3:dd 44 24 10                fld qword ptr [esp+16]
12442c7:83 c4 0c                   add esp, 12
12442ca:c3                         ret
It uses a lot of unnecessary memory operations and mixes between FPU and SSE for the floor() function. A good complier should give better results. For example reordering the instructions, reducing memory operations, using immediate values...etc.
Logged
cKleinhuis
Administrator
Fractal Senior
*******
Posts: 7044


formerly known as 'Trifox'


WWW
« Reply #19 on: April 04, 2015, 07:20:56 PM »

hehe, never leave cpu registers alone wink arent nowadays processors provide huge register space to keep everything right in the registers?!
Logged

---

divide and conquer - iterate and rule - chaos is No random!
DarkBeam
Global Moderator
Fractal Senior
******
Posts: 2512


Fragments of the fractal -like the tip of it


« Reply #20 on: April 04, 2015, 07:55:10 PM »

I don't know how to thank you guys and Knighty!!! cheesy

Will try, but I don't know how to replicate assembly's FRNDINT inside Evaldraw which takes me always lots of time :smiley
Logged

No sweat, guardian of wisdom!
DarkBeam
Global Moderator
Fractal Senior
******
Posts: 2512


Fragments of the fractal -like the tip of it


« Reply #21 on: April 04, 2015, 08:14:35 PM »

This is my superslim assembly version. EDI stuff contains more or less same constants as Knighty's. But not so faster. (Still noticeably faster in some circumstances!)
I use PI instead of 2PI... grin and multiplied all constants by powers of PI to kill one multiplication shocked
The AL manipulating stuff changes odd/even signs. (FCHS) ("fast" fold replica)
Because, the FIST integer is even/odd when you need to flip upside down the graph. (crappiest explanation ever but still wink )
The coeffs are from the minimax polynome, so I am not claiming the ownership! smiley -> http://www.geometrictools.com/Source/Functions.html

Code:
[CONSTANTS]
Double = 0.31830988618379067153776752674503 // [edi]
Double = 3.1415926535897932384626433832795 // [edi+8] ... and so on
Double = -5.167375251323450
Double = 2.543310591226590
Double = -0.557168173477934


* Cattura.PNG (7.35 KB, 294x216 - viewed 276 times.)
« Last Edit: April 05, 2015, 10:51:42 PM by DarkBeam, Reason: updating nerdy description so it\'s human readable or something » Logged

No sweat, guardian of wisdom!
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #22 on: April 04, 2015, 08:43:24 PM »

@ eiffie  A Beer Cup
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
knighty
Fractal Iambus
***
Posts: 819


« Reply #23 on: April 04, 2015, 08:58:40 PM »

hehe, never leave cpu registers alone wink arent nowadays processors provide huge register space to keep everything right in the registers?!
Yes but when there is an instruction that tells the CPU to store something, it will be executed or at least decoded..etc. That said I'm not an expert.  embarrass

@Darkbeam: impressive. Is it possible to make it use SSE? 
Logged
DarkBeam
Global Moderator
Fractal Senior
******
Posts: 2512


Fragments of the fractal -like the tip of it


« Reply #24 on: April 04, 2015, 10:32:49 PM »

Knighty, cheesy I never understood SSE opcodes it's a nightmare - they always give me weird errors an I don't know the alignment they pretend? sad
Not even 1% as impressive as your scripts that I always adore smiley
Logged

No sweat, guardian of wisdom!
hobold
Fractal Bachius
*
Posts: 573


« Reply #25 on: April 05, 2015, 12:26:14 AM »

Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do.
I didn't mean to sound so definitive. I really don't think that one specific way of doing things is the best one. I don't mean to tell others how they ought to live.

Quote
I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to".
That may well be true, but imagine what the world would look like if no one ever got any software done to the point where it was "good enough" to give it to non-expert users. smiley

We really do need all kinds.
Logged
DarkBeam
Global Moderator
Fractal Senior
******
Posts: 2512


Fragments of the fractal -like the tip of it


« Reply #26 on: April 05, 2015, 07:10:58 PM »

Notice, I updated the code to a more tight version, no longer fstp, xor, fld but a simple AND/check/FCHS and oh boy, performance is incredibly boosted now, probably even more than 2x! cheesy
The assembly snippet can be freely reused if someone finds it useful wink

Still not tried to boost FATAN but should be even more better as optimization... as it takes roughly 2x the cycles than fsin().
Logged

No sweat, guardian of wisdom!
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #27 on: April 06, 2015, 09:44:10 AM »

ASM is fun and i personally love to write ASM but doing it for speed optimization isn't really going to work on modern CPU.
Get the mighty intel compiler and development suite https://software.intel.com/en-us/qualify-for-free-software smiley

Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
quaz0r
Fractal Molossus
**
Posts: 652



« Reply #28 on: April 06, 2015, 05:00:13 PM »

with the vector units getting bigger on processors nowadays the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code, otherwise you are missing out on all your potential performance.
Logged
claude
Fractal Bachius
*
Posts: 563



WWW
« Reply #29 on: April 23, 2015, 08:49:17 PM »

I had a go at a fast single-precision cosf() approximation:

http://mathr.co.uk/blog/2015-04-21_approximating_cosine.html

Code:
// gcc -std=c99 -Wall -Wextra -pedantic -O3 -march=native -ffast-math -funroll-loops -D_BSD_SOURCE -D_POSIX_C_SOURCE=200112L
// note: -ffast-math can break some numerical code, verify results carefully, use it only for translation units where it's known to be ok
#ifndef COSF9_H
#define COSF9_H 1

#include <math.h>

#define   likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

// postcondition: 0 <= phase <= 1
static inline float cosf9_reduce(float phase) {
  float p = fabsf(phase);
  // p >= 0
  if (likely(p < (1 << 24))) {
    int q = p;
    return p - q;
  } else {
    if (unlikely(isnanf(p) || isinff(p))) {
      // return NaN
      return p - p;
    } else {
      // int could overflow, and it will be integral anyway
      return 0.0f;
    }
  }
}

// precondition: 0 <= phase <= 1
static inline float cosf9_unsafe(float phase) {
  float p = fabsf(4.0f * phase - 2.0f) - 1.0f;
  // p in -1 .. 1
  float s
    = 1.5707963267948965580e+00f * p
    - 6.4596271553942852250e-01f * p * p * p
    + 7.9685048314861006702e-02f * p * p * p * p * p
    - 4.6672571910271187789e-03f * p * p * p * p * p * p * p
    + 1.4859762069630022552e-04f * p * p * p * p * p * p * p * p * p;
  // compiler figures out optimal simd multiplications
  return s;
}

static inline float cosf9(float phase) {
  return cosf9_unsafe(cosf9_reduce(phase));
}

#endif
(Note: updated/improved versions might appear at a later date in the git repo linked from the blog post)

For input in [-0.5..0.5] (corresponding to [-pi .. pi] for the real cosf()), it's accurate to about 21 bits worst case, about 23 bits rms average - single precision float has only 24 bits, so that's pretty good.

the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code

absolutely!  vectorizing 1 evaluation on its own is hard, but if you have say 64 input values ready and waiting and you need 64 evaluations, gcc can do a pretty good job of vectorizing (provided the function is inlined and not too complicated)

arent nowadays processors provide huge register space to keep everything right in the registers?!

They've got more than before, but still various factors are involved - sometimes making two passes over the data can be faster than a single pass (assuming the work can be split nicely into phases).  I don't really understand it, but apparently doing too many things in one loop puts pressure on the http://en.wikipedia.org/wiki/Translation_lookaside_buffer

For example, splitting the cosf9_reduce() and cosf9_unsafe() into two passes over the 64 input data samples sped it up by around 25%, compared to doing cosf9() in one pass.
Logged
Pages: 1 [2]   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
DE estimation - help needed Mandelbulb Implementation chaospro 3 2068 Last post April 19, 2010, 02:29:28 AM
by David Makin
Floating Filigrees 1 Mandelbulber Gallery MarkJayBee 0 1508 Last post May 04, 2010, 03:26:15 PM
by MarkJayBee
Floating Lake Mandelbulb3D Gallery Sockratease 0 1362 Last post September 19, 2010, 08:28:54 PM
by Sockratease
FLOATING POINT MATH Mathematics Bent-Winged Angel 2 3056 Last post October 30, 2010, 04:54:26 PM
by Bent-Winged Angel
help needed Introduction to Fractals and Related Links logan 5 831 Last post April 30, 2014, 08:38:25 PM
by logan

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.17 seconds with 24 queries. (Pretty URLs adds 0.014s, 2q)