Logo by Sockratease - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Visit us on facebook
 
*
Welcome, Guest. Please login or register. April 23, 2024, 09:14:24 AM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: double VS long double performance in x86 (Core2Duo)  (Read 10956 times)
0 Members and 1 Guest are viewing this topic.
aluminumstudios
Conqueror
*******
Posts: 135


« on: March 13, 2010, 09:10:00 AM »

I've searched around the net and found info. on float VS double performance, but haven't been able to find much on long double VS float performance.  I'm hoping that someone with more experience than me can enlighten me a bit.

I'm writing software using GCC (C++) on a Core2Duo Macbook running OS X 10.6

Floats are 32 bit (giving around 6 decimal digits of precision), doubles are 64 bit (yielding 15 decimal digits of precision), and long doubles are 80 bits and yield an additional 3 decimal digits of precision over doubles (this is according to some code I found and compiled that outputs information about data types on a system.)

The extra precision of a long double isn't critical to my program, but I like it.  My question is what is the performance difference on Core2Duo  (or x86 in general), when using long double instead of double?

I compile with the options -march=core2 -mfpmath=sse -msse2 -O2.  I have also tried -march=core2 -mfpmath=387 -O2 which hasn't yielded measurably different execution times.

I've heard that x86 CPUs have good hardware support for long doubles so crunching them isn't much different from doubles.  I don't know if using long double however might prevent the compiler from automatically being able to generate SSE instructions which might speed things up???

Please fill me in on what the consensus is or if I am missing anything.

Thanks!

-William Milberry


« Last Edit: March 18, 2013, 11:45:23 PM by aluminumstudios » Logged
hobold
Fractal Bachius
*
Posts: 573


« Reply #1 on: March 13, 2010, 10:11:12 AM »

These days, the floating point units of personal computers are very advanced. Addition and multiplication are generally implemented in full width, so they are equally quick no matter the data type. Only division and square root timings are somewhat proportional to operand width. Higher math functions like trigonometry and such will also be slower for wider operands.

However, there is one important issue that potentially speeds up the narrower data types in significant ways. Modern machines have a special type of parallelism called SIMD ("Single Instruction Multiple Data"). This is a technology which originated in the old "vector supercomputers" and finally made its way into commodity machines. The hardware can pack several narrow values into a wider data path and wider processing units. For example with SSE2 and onwards ("Streaming SIMD Extensions"), those data paths are 128 bits wide. Potentially, two double precision or four single precision floats can be packed in there.

This only works if the program exhibits enough data parallelism, and if the compiler can recognize that (or if a programmer tunes the program accordingly). But in that case, double precision has the potential to be twice as fast as long double, because doubles can be processed in parallel pairs. Single precision can potentially be another factor of two faster than that.


GPUs can potentially process dozens of values in parallel in this manner, which is how they achieve their enormous number crunching throughput.
Logged
Timeroot
Fractal Fertilizer
*****
Posts: 362


The pwnge.


WWW
« Reply #2 on: March 13, 2010, 07:37:37 PM »

You could always use quads... maybe a bit excessive in the precision, but it sounds like if SIMD is dominant, they might not be too much slower than long doubles.  embarrass
Logged

Someday, man will understand primary theory; how every aspect of our universe has come about. Then we will describe all of physics, build a complete understanding of genetic engineering, catalog all planets, and find intelligent life. And then we'll just puzzle over fractals for eternity.
aluminumstudios
Conqueror
*******
Posts: 135


« Reply #3 on: March 15, 2010, 06:37:20 AM »

Thanks for the replies.

I wasn't sure if there was a difference or not in speed for floats/doubles/long doubles as far as execution hardware goes.

My program is really memory intensive though.  It does operations in a loop and writes the results of each loop to an array.  I did a (very non-methodical) test and found that floats were about 10% faster than long doubles (I didn't test double vs long double yet.)  I suspect the reason for that was that floats are much smaller, and thus memory bandwidth/latency was less of an issue.

I guess I'll just have to do some tests to see if there is a difference between double and long double.  Memory bandwidth might play a little role, but I'm also wondering how much of a role SSE can play if I switch to double (which the SSE units should handle 2 at a time where parallelizable by the compiler...)

You could always use quads... maybe a bit excessive in the precision, but it sounds like if SIMD is dominant, they might not be too much slower than long doubles.  embarrass

I'm not aware of C++ having any type of quad-precision implementation.  If there's an easy way to do it though (without coding it myself), please tell me smiley
« Last Edit: March 15, 2010, 07:37:05 AM by aluminumstudios » Logged
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #4 on: March 15, 2010, 08:57:40 AM »

it's highly dependent of the compiler and the way you code your program.

i'm very upset by the gcc's auto-vectorization capabilities and that's why i switched to intel fortran compiler (instead of c/c++ with gcc).
using simd instrucion (SSE*) there is a performance potential of x2.

Simply :
- SSE registers (xmm0 to 15) are 128bits.
- you can do "packed" mathematic operation on them
- 4 float at once, or 2 double at once

eg :
ADDPS : Add Packed Single-Precision Floating-Point Values
Destination[0..31] = Destination[0..31] + Source[0..31];
Destination[32..63] = Destination[32..63] + Source[32..63];
Destination[64..95] = Destination[64..95] + Source[64..95];
Destination[96..127] = Destination[96..127] + Source[127-96];

ADDPD : Add Packed Double-Precision Floating-Point Values
Destination[0..63] = Destination[0..63] + Source[0..63];
Destination[64..127] = Destination[64..127] + Source[64..127];

Edit : obviously, can't use simd on 80bits values. it will be FPU only or scalar operation on xmm registers (eg : ADDSD/ADDSS)
Edit2 : i'm not even sure that you can use xmm* registers for 80bits scalar.
http://siyobik.info/index.php?module=x86
« Last Edit: March 15, 2010, 09:02:23 AM by ker2x » Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
Duncan C
Fractal Fanatic
****
Posts: 348



WWW
« Reply #5 on: March 15, 2010, 02:48:25 PM »


Floats are 32 bit (giving around 6 decimal digits of precision), doubles are 64 bit (yielding 15 decimal digits of precision), and long doubles are 80 bits and yield an additional 3 decimal digits of precision over doubles (this is according to some code I found and compiled that outputs information about data types on a system.)

Actually long doubles are 128 bits (twice as large as doubles) in the GCC compiler.


I don't think there is any direct hardware support for long doubles. It's my understanding that the math libraries use multiple double precision (hardware accelerated) operations to calculate long double values. I did a little bit of testing and found long doubles to take a bout twice as long as doubles. That surprised me. I expected them to take about 4 times as long. I would be shocked if the time penalty was only 10% for long doubles.


Duncan C
Logged

Regards,

Duncan C
aluminumstudios
Conqueror
*******
Posts: 135


« Reply #6 on: March 15, 2010, 03:18:29 PM »


Actually long doubles are 128 bits (twice as large as doubles) in the GCC compiler.

I don't think there is any direct hardware support for long doubles. It's my understanding that the math libraries use multiple double precision (hardware accelerated) operations to calculate long double values. I did a little bit of testing and found long doubles to take a bout twice as long as doubles. That surprised me. I expected them to take about 4 times as long. I would be shocked if the time penalty was only 10% for long doubles.

Duncan C

GCC can optionally align long doubles along different byte boundaries so that they can be transferred and handled more efficiently, it doesn't add any extra precision according to the documentation, they are fundamentally still 80 bits.  This is from GCC's documentation:

"-m96bit-long-double
-m128bit-long-double
    These switches control the size of long double type. The i386 application binary interface specifies the size to be 96 bits, so -m96bit-long-double is the default in 32 bit mode.

    Modern architectures (Pentium and newer) would prefer long double to be aligned to an 8 or 16 byte boundary. In arrays or structures conforming to the ABI, this would not be possible. So specifying a -m128bit-long-double will align long double to a 16 byte boundary by padding the long double with an additional 32 bit zero.

    In the x86-64 compiler, -m128bit-long-double is the default choice as its ABI specifies that long double is to be aligned on 16 byte boundary.

    Notice that neither of these options enable any extra precision over the x87 standard of 80 bits for a long double."


This is the output of a program I found to tell you about your architecture's data types (run on my core2duo Macbook)
$ ./limits.o
Float format:
                radix: 2
         radix digits: 24
       decimal digits: 6
       radix exponent: -125 to 128
     decimal exponent: -37 to 38

Double format:
                radix: 2
         radix digits: 53
       decimal digits: 15
       radix exponent: -1021 to 1024
     decimal exponent: -307 to 308

Long double format:
                radix: 2
         radix digits: 64
       decimal digits: 18
       radix exponent: -16381 to 16384
     decimal exponent: -4931 to 4932

I'm fairly certain that long doubles are directly supported in hardware because the 387 traditionally used 80 bits to store intermediate values.  This is what causes it's numeric instability when working with 64 bit values (depending on the order of operations and when intermediate values are stored, rounding error may be different between 387 and other architectures on the same operations both using supposedly "64" bit variables.   I read once that it wasn't unil SSE2 supported 64 bit values with 64 bit internals that x86 became useful for more scientific work.  It would also explain why C++ offers a long double that is 80 bits, but not a quad which would be more logical if it was going to be a software supported implementation.  http://en.wikipedia.org/wiki/Long_double  Wikipedia also mentions hardware support.  However, both I and Wikipedia have been known to be wrong!

I have a pretty solid understanding of architecture and data types.  I just haven't coded enough to be aware of what the relative performances of the different options are on current systems.

Logged
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #7 on: March 15, 2010, 04:47:38 PM »

I'm fairly certain that long doubles are directly supported in hardware because the 387 traditionally used 80 bits to store intermediate values.  This is what causes it's numeric instability when working with 64 bit values (depending on the order of operations and when intermediate values are stored, rounding error may be different between 387 and other architectures on the same operations both using supposedly "64" bit variables.   I read once that it wasn't unil SSE2 supported 64 bit values with 64 bit internals that x86 became useful for more scientific work.  It would also explain why C++ offers a long double that is 80 bits, but not a quad which would be more logical if it was going to be a software supported implementation.  http://en.wikipedia.org/wiki/Long_double  Wikipedia also mentions hardware support.  However, both I and Wikipedia have been known to be wrong!

I have a pretty solid understanding of architecture and data types.  I just haven't coded enough to be aware of what the relative performances of the different options are on current systems.

Very interesting, feel free to tell us more about it ! you have a new fan smiley
Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
hobold
Fractal Bachius
*
Posts: 573


« Reply #8 on: March 15, 2010, 05:20:19 PM »

I suspect the reason for that was that floats are much smaller, and thus memory bandwidth/latency was less of an issue.
A good observation. Fractals are not typically bandwidth limited, but many other workloads are.

In the last two or three decades, processor clock frequencies have vastly outrun memory clock frequencies. In a very real sense, main memory has gradually turned into a slow peripheral device. This is why processors got caches, and why the number of cache levels is increasing.

If you are running well tuned computation on large data sets, managing of memory bandwidth or controlling cache usage often become the most important tuning point.
Logged
aluminumstudios
Conqueror
*******
Posts: 135


« Reply #9 on: March 15, 2010, 05:26:51 PM »

Very interesting, feel free to tell us more about it ! you have a new fan smiley

If you're interested in learning more, this web page has a decent explanation:  http://en.allexperts.com/e/s/ss/sse2.htm
Logged
aluminumstudios
Conqueror
*******
Posts: 135


« Reply #10 on: March 18, 2010, 12:22:17 PM »

To follow up on my post, I did a test.

I rendered a short zoom sequence using my buddhabrot program (that I made this video with: http://www.fractalforums.com/movies-showcase-%28rate-my-movie%29/orbit-density-map-%28aka-anti-buddhabrot%29-rotation/)

One the first run the program as it was, using long doubles.  On the second run I replaced all of the long doubles with doubles and recompiled with the same options.

long double
real   14m7.605s
user   24m43.939s
sys   0m3.453s

double
real   8m36.919s
user   15m24.214s
sys   0m2.117s

The difference is quite dramatic for me!  The double run was about 157% the speed of the long double run!

I see two possible reasons for this...  One is that the compiler was able to generate some SSE instructions for the doubles that it couldn't with the long doubles thereby reducing the time to do the arithmetic (doing two arithmetic operations at once.)    Two is the fact that my program is very memory bandwidth intensive and stores/retrieves lots of values from large arrays.  Transferring 64 bit values instead of 80 would use less bandwidth and quite possibly be more efficient because they can most likely be moved in one memory transfer whereas an 80 bit long double might have to be split and done in two 64 bit transfers (with the second one obviously being 16 bits of data and 48 bits of padding, probably created by GCC's "alignment" of long doubles to certain byte boundaries.)

My program renders buddhabrots, so while it's the same math it's a different rendering procedure than other fractals, so this dramatic difference may not apply to other apps.  Also, for some the extra precision might be more important than the speed.  But it's something to consider and test if you never thought about it and use long doubles.
« Last Edit: March 18, 2010, 12:39:06 PM by aluminumstudios » Logged
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #11 on: March 19, 2010, 11:09:32 AM »

wooooo \o/

I think that we should create a buddhabrot fan-club  grin

EDIT : btw, i do not think that long double, or even just "double" is usefull for buddhabroth. Considering that the rendering time is instanly slower than regular mandelbrot rendering, speed is a must have.
« Last Edit: March 19, 2010, 11:12:26 AM by ker2x » Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
aluminumstudios
Conqueror
*******
Posts: 135


« Reply #12 on: March 19, 2010, 01:02:49 PM »

EDIT : btw, i do not think that long double, or even just "double" is usefull for buddhabroth. Considering that the rendering time is instanly slower than regular mandelbrot rendering, speed is a must have.

I'm working on a program to zoom into Buddhabrots and my method that I'm developing requires higher precision than floats. nerd
Logged
johandebock
Explorer
****
Posts: 59



WWW
« Reply #13 on: March 20, 2010, 01:12:08 PM »

I work with double in my multithreaded buddhabrot renderer. Enough precision and still fast enough.
Logged

Botond Kósa
Fractal Lover
**
Posts: 233



WWW
« Reply #14 on: April 15, 2010, 11:45:20 PM »

I am planning to implement long double (80-bit extended) precision calculation in my own Mandelbrot generator. It is written in Java, which doesn't support this floating point type, so I am going to implement it in C or C++ and use JNI calls. Which C++ IDE/compiler would you recommend for this (for WinXP 32bit)?

Botond
Logged

Check out my Mandelbrot set explorer:
http://web.t-online.hu/kbotond/mandelmachine/
Pages: [1]   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
[Java] Double-double library for 128-bit precision. Programming Zom-B 10 17317 Last post December 20, 2010, 04:03:48 AM
by David Makin
double helix Chaoscope Gallery Well En Taoed 0 1162 Last post September 25, 2010, 09:24:11 PM
by Well En Taoed
double helix Chaoscope Gallery Bent-Winged Angel 0 955 Last post November 26, 2010, 06:37:54 PM
by Bent-Winged Angel
Double Suspension Mandelbulb3D Gallery dainbramage 0 875 Last post January 16, 2011, 01:42:51 AM
by dainbramage
double game Mandelbulb3D Gallery mandelfrau 0 548 Last post January 10, 2012, 07:47:26 PM
by mandelfrau

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.217 seconds with 24 queries. (Pretty URLs adds 0.013s, 2q)