cKleinhuis
|
|
« Reply #15 on: April 04, 2015, 11:38:28 AM » |
|
all nice, i used to code assembler on the mc68000 (amiga) in my eyes the existing formulas should be optimized in a way that the cpu pipelines nowadays support the best, but i am unaware of the optimizations they actually do, restructuring a formula in a way that stuff that can be calculated in paralel steps (on cpu pipelining), making good use of the caches ( not so important for the small memory access a math-formula usually needs ) but since i dont know what the processors do i totally dont care about that, but some tricks might be some real performance boosts, like the one another user posted earlier in this thread, using the cycles of a longer calculation to squeeze in another one that can run in parallel on such hardware, is the way to go, i dunno if calculating constants or prerequisites in such a way and then using the results for outuput ....
so, simple reorganizing order of commands might give another speedup above 10 percent, to get rid of cycles that can be done in parallel could be a way to squeeze out some amazing performance
|
|
|
Logged
|
---
divide and conquer - iterate and rule - chaos is No random!
|
|
|
DarkBeam
Global Moderator
Fractal Senior
Posts: 2512
Fragments of the fractal -like the tip of it
|
|
« Reply #16 on: April 04, 2015, 06:07:01 PM » |
|
The speedup is partially killed by mixing floating point and MOV/XOR instructions, too bad, in theory my version would be a lot faster ...
|
|
|
Logged
|
No sweat, guardian of wisdom!
|
|
|
eiffie
Guest
|
|
« Reply #17 on: April 04, 2015, 06:09:05 PM » |
|
I just have to add my thoughts even tough we have gotten way off topic. Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do. We aren't getting rich off this so FUN is a prerequisite. And I find that mindset that anything that can be done has already been done better to be very unappealing. I am not a genius (had to look up how to spell that) but even I have come up with new methods like distance estimated lighting and enlarged-sphere tracing to mimic monticarlo sampling. I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to". I love the innovation here and that so many people are going in different directions to achieve similar goals... call it evolution.
That being said I use to code machine language on Apple IIe but then there were only about 100 instructions.
|
|
|
Logged
|
|
|
|
knighty
Fractal Iambus
Posts: 819
|
|
« Reply #18 on: April 04, 2015, 07:04:24 PM » |
|
It should be much faster. Using evaldraw (which is not very good at optimizing) I get 2x speedup: (x,y){ //return sin(x); #define INV_TWO_PI 0.15915494309189533576888376337251 #define TWO_PI 6.283185307179586476925286766559 x=x*INV_TWO_PI+(-0.25); x=x-floor(x); x=abs(x+(-0.5))+(-0.25); x=x*TWO_PI; x2=x*x; //from: http://stackoverflow.com/questions/345085/how-do-trigonometric-functions-work/394512#394512 x = x * (0.99999660 + x2*(-0.16664824 + x2*(0.00830629 +x2*-0.00018363))); } (Set [esc]->View->Show Clock Cycles, and set multithreading to 1 (with [F5]) in order to compare the timings. One also have to account for the overhead of function calling and other operations (10cycles here)) Here is the asm code generated: 1244240:83 ec 0c sub esp, 12 1244243:f2 0f 10 44 24 10 movsd xmm0, [esp+16] 1244249:f2 0f 59 46 30 mulsd xmm0, [esi+48] 124424e:f2 0f 58 46 10 addsd xmm0, [esi+16] 1244253:f2 0f 11 44 24 10 movsd [esp+16], xmm0 1244259:dd 44 24 10 fld qword ptr [esp+16] 124425d:db 5c 24 f8 fistp dword ptr [esp-8] 1244261:f2 0f 2a 4c 24 f8 cvtsi2sd xmm1, [esp-8] 1244267:f2 0f 5c c1 subsd xmm0, xmm1 124426b:f2 0f 11 44 24 10 movsd [esp+16], xmm0 1244271:f2 0f 58 46 08 addsd xmm0, [esi+8] 1244276:0f 54 46 e0 andps xmm0, [esi-32] 124427a:f2 0f 58 46 10 addsd xmm0, [esi+16] 124427f:f2 0f 11 44 24 10 movsd [esp+16], xmm0 1244285:f2 0f 59 46 58 mulsd xmm0, [esi+88] 124428a:f2 0f 11 44 24 10 movsd [esp+16], xmm0 1244290:f2 0f 59 c0 mulsd xmm0, xmm0 1244294:f2 0f 11 04 24 movsd [esp], xmm0 1244299:f2 0f 59 46 20 mulsd xmm0, [esi+32] 124429e:f2 0f 58 46 28 addsd xmm0, [esi+40] 12442a3:f2 0f 59 04 24 mulsd xmm0, [esp] 12442a8:f2 0f 58 46 18 addsd xmm0, [esi+24] 12442ad:f2 0f 59 04 24 mulsd xmm0, [esp] 12442b2:f2 0f 58 46 48 addsd xmm0, [esi+72] 12442b7:f2 0f 59 44 24 10 mulsd xmm0, [esp+16] 12442bd:f2 0f 11 44 24 10 movsd [esp+16], xmm0 12442c3:dd 44 24 10 fld qword ptr [esp+16] 12442c7:83 c4 0c add esp, 12 12442ca:c3 ret It uses a lot of unnecessary memory operations and mixes between FPU and SSE for the floor() function. A good complier should give better results. For example reordering the instructions, reducing memory operations, using immediate values...etc.
|
|
|
Logged
|
|
|
|
cKleinhuis
|
|
« Reply #19 on: April 04, 2015, 07:20:56 PM » |
|
hehe, never leave cpu registers alone arent nowadays processors provide huge register space to keep everything right in the registers?!
|
|
|
Logged
|
---
divide and conquer - iterate and rule - chaos is No random!
|
|
|
DarkBeam
Global Moderator
Fractal Senior
Posts: 2512
Fragments of the fractal -like the tip of it
|
|
« Reply #20 on: April 04, 2015, 07:55:10 PM » |
|
I don't know how to thank you guys and Knighty!!! Will try, but I don't know how to replicate assembly's FRNDINT inside Evaldraw which takes me always lots of time :
|
|
|
Logged
|
No sweat, guardian of wisdom!
|
|
|
DarkBeam
Global Moderator
Fractal Senior
Posts: 2512
Fragments of the fractal -like the tip of it
|
|
« Reply #21 on: April 04, 2015, 08:14:35 PM » |
|
This is my superslim assembly version. EDI stuff contains more or less same constants as Knighty's. But not so faster. (Still noticeably faster in some circumstances!)I use PI instead of 2PI... and multiplied all constants by powers of PI to kill one multiplication The AL manipulating stuff changes odd/even signs. (FCHS) ("fast" fold replica) Because, the FIST integer is even/odd when you need to flip upside down the graph. (crappiest explanation ever but still ) The coeffs are from the minimax polynome, so I am not claiming the ownership! -> http://www.geometrictools.com/Source/Functions.html[CONSTANTS] Double = 0.31830988618379067153776752674503 // [edi] Double = 3.1415926535897932384626433832795 // [edi+8] ... and so on Double = -5.167375251323450 Double = 2.543310591226590 Double = -0.557168173477934
|
|
« Last Edit: April 05, 2015, 10:51:42 PM by DarkBeam, Reason: updating nerdy description so it\'s human readable or something »
|
Logged
|
No sweat, guardian of wisdom!
|
|
|
3dickulus
|
|
« Reply #22 on: April 04, 2015, 08:43:24 PM » |
|
@ eiffie
|
|
|
Logged
|
|
|
|
knighty
Fractal Iambus
Posts: 819
|
|
« Reply #23 on: April 04, 2015, 08:58:40 PM » |
|
hehe, never leave cpu registers alone arent nowadays processors provide huge register space to keep everything right in the registers?! Yes but when there is an instruction that tells the CPU to store something, it will be executed or at least decoded..etc. That said I'm not an expert. @Darkbeam: impressive. Is it possible to make it use SSE?
|
|
|
Logged
|
|
|
|
DarkBeam
Global Moderator
Fractal Senior
Posts: 2512
Fragments of the fractal -like the tip of it
|
|
« Reply #24 on: April 04, 2015, 10:32:49 PM » |
|
Knighty, I never understood SSE opcodes it's a nightmare - they always give me weird errors an I don't know the alignment they pretend? Not even 1% as impressive as your scripts that I always adore
|
|
|
Logged
|
No sweat, guardian of wisdom!
|
|
|
hobold
Fractal Bachius
Posts: 573
|
|
« Reply #25 on: April 05, 2015, 12:26:14 AM » |
|
Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do. I didn't mean to sound so definitive. I really don't think that one specific way of doing things is the best one. I don't mean to tell others how they ought to live. I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to". That may well be true, but imagine what the world would look like if no one ever got any software done to the point where it was "good enough" to give it to non-expert users. We really do need all kinds.
|
|
|
Logged
|
|
|
|
DarkBeam
Global Moderator
Fractal Senior
Posts: 2512
Fragments of the fractal -like the tip of it
|
|
« Reply #26 on: April 05, 2015, 07:10:58 PM » |
|
Notice, I updated the code to a more tight version, no longer fstp, xor, fld but a simple AND/check/FCHS and oh boy, performance is incredibly boosted now, probably even more than 2x! The assembly snippet can be freely reused if someone finds it useful Still not tried to boost FATAN but should be even more better as optimization... as it takes roughly 2x the cycles than fsin().
|
|
|
Logged
|
No sweat, guardian of wisdom!
|
|
|
ker2x
Fractal Molossus
Posts: 795
|
|
« Reply #27 on: April 06, 2015, 09:44:10 AM » |
|
ASM is fun and i personally love to write ASM but doing it for speed optimization isn't really going to work on modern CPU. Get the mighty intel compiler and development suite https://software.intel.com/en-us/qualify-for-free-software
|
|
|
Logged
|
|
|
|
quaz0r
Fractal Molossus
Posts: 652
|
|
« Reply #28 on: April 06, 2015, 05:00:13 PM » |
|
with the vector units getting bigger on processors nowadays the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code, otherwise you are missing out on all your potential performance.
|
|
|
Logged
|
|
|
|
claude
Fractal Bachius
Posts: 563
|
|
« Reply #29 on: April 23, 2015, 08:49:17 PM » |
|
I had a go at a fast single-precision cosf() approximation: http://mathr.co.uk/blog/2015-04-21_approximating_cosine.html// gcc -std=c99 -Wall -Wextra -pedantic -O3 -march=native -ffast-math -funroll-loops -D_BSD_SOURCE -D_POSIX_C_SOURCE=200112L // note: -ffast-math can break some numerical code, verify results carefully, use it only for translation units where it's known to be ok #ifndef COSF9_H #define COSF9_H 1
#include <math.h>
#define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0)
// postcondition: 0 <= phase <= 1 static inline float cosf9_reduce(float phase) { float p = fabsf(phase); // p >= 0 if (likely(p < (1 << 24))) { int q = p; return p - q; } else { if (unlikely(isnanf(p) || isinff(p))) { // return NaN return p - p; } else { // int could overflow, and it will be integral anyway return 0.0f; } } }
// precondition: 0 <= phase <= 1 static inline float cosf9_unsafe(float phase) { float p = fabsf(4.0f * phase - 2.0f) - 1.0f; // p in -1 .. 1 float s = 1.5707963267948965580e+00f * p - 6.4596271553942852250e-01f * p * p * p + 7.9685048314861006702e-02f * p * p * p * p * p - 4.6672571910271187789e-03f * p * p * p * p * p * p * p + 1.4859762069630022552e-04f * p * p * p * p * p * p * p * p * p; // compiler figures out optimal simd multiplications return s; }
static inline float cosf9(float phase) { return cosf9_unsafe(cosf9_reduce(phase)); }
#endif
(Note: updated/improved versions might appear at a later date in the git repo linked from the blog post) For input in [-0.5..0.5] (corresponding to [-pi .. pi] for the real cosf()), it's accurate to about 21 bits worst case, about 23 bits rms average - single precision float has only 24 bits, so that's pretty good. the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code
absolutely! vectorizing 1 evaluation on its own is hard, but if you have say 64 input values ready and waiting and you need 64 evaluations, gcc can do a pretty good job of vectorizing (provided the function is inlined and not too complicated) arent nowadays processors provide huge register space to keep everything right in the registers?!
They've got more than before, but still various factors are involved - sometimes making two passes over the data can be faster than a single pass (assuming the work can be split nicely into phases). I don't really understand it, but apparently doing too many things in one loop puts pressure on the http://en.wikipedia.org/wiki/Translation_lookaside_bufferFor example, splitting the cosf9_reduce() and cosf9_unsafe() into two passes over the 64 input data samples sped it up by around 25%, compared to doing cosf9() in one pass.
|
|
|
Logged
|
|
|
|
|