Title: cycles needed for floating point asm? Post by: DarkBeam on April 03, 2015, 01:01:45 PM Hello guys,
I am evaluating about making a "faster" but very accurate polynomial version of trig functions. Too bad I can't find anywhere basic infos. How many cycles does it take to eval FRNDINT versus FSIN? And how many for FPATAN? And FSQRT? Thankies :) Title: Re: cycles needed for floating point asm? Post by: claude on April 03, 2015, 04:17:17 PM http://www.agner.org/optimize/ "4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs"
The trouble is it isn't just cycles per instruction that you can add up - pipelining has a big effect so you can do something else useful while a many-cycle instruction is running... measuring real code is probably the best way... Some quick numbers from the PDF for Intel Haswell (latency column): FRNDINT 11 FSIN 47-106 FPATAN 96-156 FSQRT 10-23 Which does correspond to my own experience (trig is slow, replacing it if possible is usually good). Title: Re: cycles needed for floating point asm? Post by: cKleinhuis on April 03, 2015, 04:43:57 PM hehe, just use a 256 entry sine lookup table, and round PI to 3 ;)
Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 03, 2015, 04:58:59 PM Hehe Chris, lut are very hard to treat for me!
Only 11 cycles for frndint, hrm! Sure? Either my processor has a fast sin or a slow roundint. Or maybe I added too many instructions afterwards :( Ok, my idea was this, it gives a very accurate result (1e-8 error worst case) 1. x = x * twopiinv; x = x - roundint (x); x = 4. * x; // not sure about those passages ... 2. Fold (x, 1); // maybe a source of slowness due to abs() inside fold. 3. ox =x; x = abs(x); x2 = x^2 ... etc 4. x = ox* (pihalf + a x + b x2 + c x3 + d x4) / (1 + e x + f x2 + g x3) // divisions are slow A...G coeffs are found using Maxima using a simple linear system, to force the function to be exact in some points. I use a fractional function to reduce fp errors without losing flexibility while the traditional approach was to use high x powers. There are premade very good approx around but not that precise. ;) will see! Thanks for the kind reply! Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 03, 2015, 05:18:01 PM Ok, of course now I find all :)
A great solution is Chebychev up to x^7. Chebyshev: max error around 6e-7, f(x) = 0.99999660x-0.16664824x3+0.00830629x5-0.00018363x7 Range = -pi/2 to +pi/2, degree 7 (4 terms) Out of range use reflections (and frndint) ;) Found in - http://stackoverflow.com/questions/345085/how-do-trigonometric-functions-work/394512#394512 Now I hope it's really fast! Title: Re: cycles needed for floating point asm? Post by: hobold on April 03, 2015, 06:15:14 PM Chebychev polynomials are a good starting point. If you are looking for perfection, you could construct one of the so-called "minimax polynomials". These are the most numerically precise polynomial approximation of a given smooth function. However, there are no explicit formulas to determine the coefficients.
In practice, one uses some numerical optimization algorithm to find the coefficients which minimize the maximal error. A Chebychev polynomial is probably a good starting point for that kind of search. Edit: you might find Estrin's scheme interesting, an evaluation method for polynomials that allows modern processors to better utilize their parallel execution units: https://en.wikipedia.org/wiki/Estrin%27s_scheme Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 03, 2015, 06:29:56 PM Another thing I did not found. Do you know if there's a place with premade Cheby polynomes for ArcTan (even for x inside [-1,+1]? :) I found many interesting formulas, but it's dispersive because error is rarely specified.
Title: Re: cycles needed for floating point asm? Post by: claude on April 03, 2015, 06:43:19 PM http://www.coranac.com/2009/07/sines/ nice article about fixed point sin() approximation on ARM, probably much will also apply to floating point on x86 or amd64
Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 03, 2015, 08:04:31 PM :banana: Ahyeeah! :banana:
The Chebychev polynome works perfectly, using folds and I save one multiplication, using modified multipliers like that. Code: 1. x = x * 0.15915494309189533576888376337251; x = x - roundint (x); You have a sin() but 20% faster (circa) :) I am happy! :worm: Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 04, 2015, 12:19:35 AM Ah! :D found.
Minimax polynomes for any trig function here: http://www.geometrictools.com/Source/Functions.html Title: Re: cycles needed for floating point asm? Post by: quaz0r on April 04, 2015, 08:25:16 AM i know i sound like kind of a party pooper but the thing about this stuff is that you are never going to be the first person who has asked "how can i make some basic math functions fast and efficient?" everyone has already asked that question before, everyone has already worked on solving that question a million times over already, and the best approaches have already been implemented and refined a million times over in standard libraries and such since the dawn of computing. and hand tailored assembly for every architecture has also been written already to maximize performance of these things with architecture specific code (the intel MKL and amd's ACML for example). so the question isnt so much "what new and groundbreaking way can i invent to calculate the cosine" so much as "what methods have already been implemented a million times by tons of other people for every possible language and architecture?" :angel1:
on a similar note ive noticed in general that pretty much all the programs ive seen that are written by people on fractalforums do not make use of libraries, not even the standard library for the language a program was written in. it kinda reminds me of how more people tended to approach programming 30 years ago. maybe people arent super experienced coders, but there also appears to be a mindset that people want to do something that is way more amazing than what anyone else has ever done. while that sounds like a fun endeavor, in practice that is just hardly ever going to be the case, especially if you are not an expert coder. need to save a jpeg? the culmination of all the brightest minds dedicated to that task and the pinnacle of human achievement in the history of saving jpegs is already available in a library somewhere. need some arbitrary precision? the culmination of all the brightest minds in the history of humanity dedicated to coming together and solving that task has already produced a library for that. need to calculate a cosine? same deal. kate gregory gave an excellent little presentation about this at the last CppCon, specifically at 29m30s: http://youtu.be/BiYliKliFvs (http://youtu.be/BiYliKliFvs) Title: Re: cycles needed for floating point asm? Post by: hobold on April 04, 2015, 09:30:03 AM while that sounds like a fun endeavor Added highlighting to answer your question.Longer answer: when I am paid to write software, then time is short, perfection is frowned upon (because "good enough" is what sells), libraries are good (because someone else is responsible for testing and bug fixing code in there), and nobody cares what I learn in the process. When I am programming on my own, I have all the time in the world to polish whatever irrelevant detail I always wanted to understand better. I use libraries that I want to familiarize with. I write stuff myself that I want to learn the basics of. I skip things that would be an absolute requirement for product quality code, but that I don't want to bother with for a single run of a prototype algorithm. In short: my goals can differ a lot. In one situation I want to get something done. In the other situation I want to gain insight. A note on existing numerical libraries. Earlier in my life, I earned the friendship of folks who wrote such highly tuned code for a living. (That happened due to a few cases of undue perfectionism on which I had wasted my spare time, and which caught their attention.) These libraries are usually created in "get something done" mode. The experts who do that work are on somebody's payroll. And that somebody, being a successful business(wo)man, usually tends to impatience. :) Existing high performance libraries are typically very good. But they are never perfect. When the next processor revision is released, the timing of the execution pipelines is rebalanced. But the library is still tuned for the previous hardware. There is usually no money budgeted to rewrite the library's existing functions. Instead, new features keep getting added to the library; tuned for whatever machine is interesting at the time. In other words, the hardware target keeps shifting under the code tuners' feet. But their boss would rather release that software product now, instead of spending another three months re-tuning it. The point of my ramblings is this: 1. It would be a herculean task not to use libraries at all and do everything over and over again. Libraries are good. Trying to write a big application from scratch without using any existing libraries would be pretty insane. 2. A single programmer with sufficient time and dedication, when focusing on a single detail of implementation, can occasionally outdo a team of experts. Trying to do that is not nearly as insane. :) Title: Re: cycles needed for floating point asm? Post by: quaz0r on April 04, 2015, 09:52:05 AM Quote from: hobold obligatory devil's advocacy of course. no person is perfect, no software is perfect, no code is perfect, no library is perfect, no team of all the world's biggest geniuses is perfect, no library that team of geniuses created is perfect. pretty much no human and no thing ever created by a human or group of humans is ever as perfect as it could ever possibly be. then again, you probably arent perfect either. and whatever you naively came up with because you never heard of libraries before probably isnt the pinnacle of human achievement either. but yes, it is of course always possible for one dedicated person or team of persons to reinvent a particular wheel a little better than somebody else might have done. my point (of course) was more about this general mindset of "maybe this doesnt already exist," or "maybe nobodys ever thought of this before," or "maybe nobodys ever tried to make it fast before," or "maybe ill just reinvent that wheel regardless and hope for groundbreaking perfection and eternal glory" being in the main a rather naive and ill conceived approach to programming. the point of my ramblings was a general observation regarding a general pattern ive noticed coming from multiple parties responsible for multiple pieces of software being more likely an indicator of inexperience than... the opposite. Title: Re: cycles needed for floating point asm? Post by: youhn on April 04, 2015, 10:35:29 AM Well, here it's not so much about perfection, efficiency or other Key Progress Indicator like goals. People here do whatever the irrational-fluff they want, just because it's fun. Some like the idea of using libs, some experience it as a constraint. Go preach at some other group of coders.
Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 04, 2015, 11:02:57 AM Hey friends don't go crazy :dontcare:
(New emoticon :P ! ) In my case I am not even programming, that stuff is going into a new set of Mb3d formulas, and of course I didn't claim I found a fancy great polynome invented by me :D as I pretty much copied everything from other (wise and thoughtful) sources with a minimum effort. (After I found those of course) Because many mb3d formulas using sin() are not totally precision-critical so speedups would be very welcomed. In any case - the 20% speedup can be pushed further using a cheat over rounding integer, so good news ;) back to code! Title: Re: cycles needed for floating point asm? Post by: cKleinhuis on April 04, 2015, 11:38:28 AM all nice, i used to code assembler on the mc68000 (amiga) in my eyes the existing formulas should be optimized in a way that the cpu pipelines nowadays support the best, but i am unaware of the optimizations they actually do, restructuring a formula in a way that stuff that can be calculated in paralel steps (on cpu pipelining), making good use of the caches ( not so important for the small memory access a math-formula usually needs ) but since i dont know what the processors do i totally dont care about that, but some tricks might be some real performance boosts, like the one another user posted earlier in this thread, using the cycles of a longer calculation to squeeze in another one that can run in parallel on such hardware, is the way to go, i dunno if calculating constants or prerequisites in such a way and then using the results for outuput ....
so, simple reorganizing order of commands might give another speedup above 10 percent, to get rid of cycles that can be done in parallel could be a way to squeeze out some amazing performance Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 04, 2015, 06:07:01 PM The speedup is partially killed by mixing floating point and MOV/XOR instructions, too bad, in theory my version would be a lot faster ... :sad1:
Title: Re: cycles needed for floating point asm? Post by: eiffie on April 04, 2015, 06:09:05 PM I just have to add my thoughts even tough we have gotten way off topic.
Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do. We aren't getting rich off this so FUN is a prerequisite. And I find that mindset that anything that can be done has already been done better to be very unappealing. I am not a genius (had to look up how to spell that) but even I have come up with new methods like distance estimated lighting and enlarged-sphere tracing to mimic monticarlo sampling. I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to". I love the innovation here and that so many people are going in different directions to achieve similar goals... call it evolution. That being said I use to code machine language on Apple IIe but then there were only about 100 instructions. Title: Re: cycles needed for floating point asm? Post by: knighty on April 04, 2015, 07:04:24 PM It should be much faster.
Using evaldraw (which is not very good at optimizing) I get 2x speedup: Code: (x,y){Here is the asm code generated: Code: 1244240:83 ec 0c sub esp, 12 Title: Re: cycles needed for floating point asm? Post by: cKleinhuis on April 04, 2015, 07:20:56 PM hehe, never leave cpu registers alone ;) arent nowadays processors provide huge register space to keep everything right in the registers?!
Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 04, 2015, 07:55:10 PM I don't know how to thank you guys and Knighty!!! :D
Will try, but I don't know how to replicate assembly's FRNDINT inside Evaldraw which takes me always lots of time ::) Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 04, 2015, 08:14:35 PM This is my superslim assembly version. EDI stuff contains more or less same constants as Knighty's.
I use PI instead of 2PI... ;D and multiplied all constants by powers of PI to kill one multiplication :o The AL manipulating stuff changes odd/even signs. (FCHS) ("fast" fold replica) Because, the FIST integer is even/odd when you need to flip upside down the graph. (crappiest explanation ever but still ;) ) The coeffs are from the minimax polynome, so I am not claiming the ownership! :) -> http://www.geometrictools.com/Source/Functions.html Code: [CONSTANTS] Title: Re: cycles needed for floating point asm? Post by: 3dickulus on April 04, 2015, 08:43:24 PM @ eiffie :beer:
Title: Re: cycles needed for floating point asm? Post by: knighty on April 04, 2015, 08:58:40 PM hehe, never leave cpu registers alone ;) arent nowadays processors provide huge register space to keep everything right in the registers?! Yes but when there is an instruction that tells the CPU to store something, it will be executed or at least decoded..etc. That said I'm not an expert. :embarrass:@Darkbeam: impressive. Is it possible to make it use SSE? :troll: Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 04, 2015, 10:32:49 PM Knighty, :D I never understood SSE opcodes it's a nightmare - they always give me weird errors an I don't know the alignment they pretend? :(
Not even 1% as impressive as your scripts that I always adore :) Title: Re: cycles needed for floating point asm? Post by: hobold on April 05, 2015, 12:26:14 AM Although quaz0r makes a good point (if you are a coder by profession) I think hobold makes a better one about what we do. I didn't mean to sound so definitive. I really don't think that one specific way of doing things is the best one. I don't mean to tell others how they ought to live.Quote I would hate to think what this world would look like if everyone waited for the team of smart people to tell them "how to". That may well be true, but imagine what the world would look like if no one ever got any software done to the point where it was "good enough" to give it to non-expert users. :)We really do need all kinds. Title: Re: cycles needed for floating point asm? Post by: DarkBeam on April 05, 2015, 07:10:58 PM Notice, I updated the code to a more tight version, no longer fstp, xor, fld but a simple AND/check/FCHS and oh boy, performance is incredibly boosted now, probably even more than 2x! :D
The assembly snippet can be freely reused if someone finds it useful ;) Still not tried to boost FATAN but should be even more better as optimization... as it takes roughly 2x the cycles than fsin(). Title: Re: cycles needed for floating point asm? Post by: ker2x on April 06, 2015, 09:44:10 AM ASM is fun and i personally love to write ASM but doing it for speed optimization isn't really going to work on modern CPU.
Get the mighty intel compiler and development suite https://software.intel.com/en-us/qualify-for-free-software :) Title: Re: cycles needed for floating point asm? Post by: quaz0r on April 06, 2015, 05:00:13 PM with the vector units getting bigger on processors nowadays the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code, otherwise you are missing out on all your potential performance.
Title: Re: cycles needed for floating point asm? Post by: claude on April 23, 2015, 08:49:17 PM I had a go at a fast single-precision cosf() approximation: http://mathr.co.uk/blog/2015-04-21_approximating_cosine.html Code: // gcc -std=c99 -Wall -Wextra -pedantic -O3 -march=native -ffast-math -funroll-loops -D_BSD_SOURCE -D_POSIX_C_SOURCE=200112L For input in [-0.5..0.5] (corresponding to [-pi .. pi] for the real cosf()), it's accurate to about 21 bits worst case, about 23 bits rms average - single precision float has only 24 bits, so that's pretty good. the main thing for number crunching is to make sure your compiler does a good job of vectorizing your code absolutely! vectorizing 1 evaluation on its own is hard, but if you have say 64 input values ready and waiting and you need 64 evaluations, gcc can do a pretty good job of vectorizing (provided the function is inlined and not too complicated) arent nowadays processors provide huge register space to keep everything right in the registers?! They've got more than before, but still various factors are involved - sometimes making two passes over the data can be faster than a single pass (assuming the work can be split nicely into phases). I don't really understand it, but apparently doing too many things in one loop puts pressure on the http://en.wikipedia.org/wiki/Translation_lookaside_buffer For example, splitting the cosf9_reduce() and cosf9_unsafe() into two passes over the 64 input data samples sped it up by around 25%, compared to doing cosf9() in one pass. |