Welcome to Fractal Forums

Community => Let's collaborate on something! => Topic started by: cbuchner1 on May 14, 2011, 11:25:52 PM




Title: Sharing knowledge about SSE programming - possibly running a forum?
Post by: cbuchner1 on May 14, 2011, 11:25:52 PM
Hi,

are people around here getting down to SSE vector programming on x86 chips? It can be done with so called intrinsic compiler instructions in C/C++, and with bare assembly code. The achievable speed gain is up to factor 4 for single precision floating point and even factor 8 on the new Intel Sandy Bridge chip generation per CPU core.

The main issue I think is that vendor libraries (provided by AMD or Intel) typically favor their particular chip generation and the source code is not open. These are often free to use but do not run equally well on AMD and Intel chips. You often have to code your own math subroutines to get the best out of any hardware. With Intel's new AVX instruction set on Sandy Bridge this kind of optimization might become much more interesting for fractal generation (4 double precision floats processed with one instruction).

I've recently checked if there is any dedicated forum where people can share code snippets and other information related to SSE programming, but I found no specialized web site. I was considering opening and running a forum, but I have no experience yet running bulletin board software on a rented sever.

I would like to know if there's anyone interested in that SSE/AVX topic here, and maybe willing to help in creating a dedicated web forum or wiki information resource about it.

Christian



Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: panzerboy on May 27, 2011, 04:54:59 AM
Giday Christian,

I just stumbled apon the new AVX extensions while investigating the i7s on Wikipedia. Am I right in thinking this will allow 256bit arithmetic? I downloaded Intel's "Advanced Vector Extensions
Programming Reference" pdf but its been decades since I last looked at assembly code (on the 8088) and I'm way behind on the current Intel nomencluture. I assume theres compiler support for AVX in MS VC++ but so far my searches have only turned up compiler switches to enable AVX and not the capabilities.
256 bit arithmetic would be a big speed boost for my mandelbrot zooms.
It seems someone is using the AVX instructions more as a benchmark perhaps? Until I spend the money and get an i7 I can't run it.
http://board.flatassembler.net/topic.php?p=128226

Jeremy Thomson


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: cbuchner1 on May 27, 2011, 08:59:20 PM
Giday Christian,

I just stumbled apon the new AVX extensions while investigating the i7s on Wikipedia. Am I right in thinking this will allow 256bit arithmetic?

Jeremy Thomson

I doubt that. I know you can squeeze 8 single precision floats or 4 double precision floats into one 256 bit register, but the arithmetic remains limited to 32 or 64 bit words.  The integer and floating point ALUs are still limited in precision, but there are now up to 8 running in parallel. There is not one 256 bit multiplier ALU on the CPU. ;)

The main benefit of AVX is that you could now simultaneously compute up to 8 pixels of a Mandelbrot zoom using a single core of the CPU.

Of course you could try to perform hacks like using double single, or double double arithmetics to simulate higher precision. With the help of AVX instruction one might create a faster arbitrary precision implementation also.

I've just brought my new Intel Core i5-2600k CPU online and I will try some things with Visual C++ 2010 (which has AVX support starting with Service Pack 1). What I like about it is that it has a moderate thermal design power and it makes all my AMD CPUs look old.

Hint: don't look at the AVX assembly instruction sets, but rather learn how to use compiler intrinsics to do the maths. I find this much easier. Also there is way more source code out there using intrinsics, than there is documented assembly code.

Christian


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: cbuchner1 on May 28, 2011, 01:30:52 AM
okay, now I am torn deciding which of these two projects to attack:

-doing a Mandelbrot deep zoomer in CUDA. The idea is to use all of the shared memory of CUDA devices to hold some fixed point numbers. The shared memory is some of the fastest memory on the device. Arithmetics would happen entirely within the shared memory. Let's say for simplicity one number shall be at most 1kb in size, that is 8192 bits or 2466 decimal digits. We'd need to implement addition and multiplication on these numbers to compute the Mandelbrot set.... The various multiprocessors of a CUDA cards could then be crunching in parallel on different pixels of the set, while the threads running on each mulltiprocessor perform the necessary additions and multiplications in parallel. Way faster than doing the same on the CPU, I think.

-learning Intel AVX, possibly also by implementing a parallel mandelbrot set. Most likely not with high precision maths (that would be too difficult as a first project).


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: panzerboy on May 28, 2011, 04:42:11 AM
Thanks for the reply. If your interested in CUDA I found what appears to be an arbitrary precision library for Linux here http://code.google.com/p/gpuprec/


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: cbuchner1 on May 29, 2011, 12:33:31 AM
Thanks for the reply. If your interested in CUDA I found what appears to be an arbitrary precision library for Linux here http://code.google.com/p/gpuprec/


I may try to roll my own implementation based on "divide and conquer" algorithm (Karatsuba) for multiplication. It is simple enough to understand for an average Joe (unlike the FFT based multiplications). Not sure how many bits of precision I will need for deep zooms.

By the way, there is a CUDA enabled deep zoom program called fastfractal256. It uses 256 bit arithmetics and combines CUDA and the CPU for rendering. The music in the sample video reminds me of some of the deep zoom videos that were posted on these forums previously. I could not find this software mentioned here using the search function.





Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: panzerboy on May 29, 2011, 07:47:46 AM
Yes I've tried FastFractal256, its a bit buggy for me, perhaps my low spec 9800GT card (CUDA 1.0?) but it works most of the time.
The killer for me was its limited palette support. I'm used to Fractal Extreme's and Ultra Fractals palette editors (called Gradients in UF). FastFractal256 just allows you to alter the 'cycle frequency' of Red green and blue along with respective intensities.
Rendering resolutions are fixed.
The program becomes quite unresponsive when rendering, showing 'ghost' menus when selected till a second or more and the menu shows fully. Perhaps that is Aero's use of the GPU conflicting with FF256. Its alseo a total CPU hog, forget about doing anything else with your PC whilst it renders.
But Sin of all Sins, changing the pallette forces a re-render. When oh when will mandelbrot program designers realise that calculated iterations are precious and not to be discarded just because I want to see what a different colour setting looks like. Thats the main reason I use Fractal Extreme over Ultra Fractal, once calculated you can play with the colours all you like and you don't have to wait hours (days?) for a recalculation. UF allows a bit of this unless you change the gradient mapping from say square root to cube root. Even better in FE you can increase the iterations and only the black (M-set) pixels recalculate.
Perhaps the weirdest thing about FastFractal256 is that the co-ordinates are shown in hexidecimal :verysurprised:, so you can't even use it to explore quickly and use the co-ordinates in another program.
And unlicensed it will only run for 10 minutes, and I'm not even remotley tempted to part with the $4.95 to license it.
So FastFractal256 has a decent enough 'engine' but practically everything else about it needs work.


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: cbuchner1 on June 03, 2011, 11:05:21 PM
I am making some progress on 1024 bit integer multiplication using the Karatsuba algorithm. I am storing the numbers in a way that they can be processed efficiently on CUDA devices of the older (pre-Fermi) generation. This older GPU generation only supports a fast 24x24 bit integer multiplication in hardware. This is because nVidia engineers re-used the multiplier circuits in the single precision floating point ALU which has just 24 bits of mantissa.

The numbers consists of 64 legs ("bigits") of 16 bit each. I allow for 8 extra bits of carry overflow per leg, giving me the 24 bits that CUDA can work with. Some bignum implementations (e.g. GMP) call these extra bits "nails". These allow me to delay the carry propagation step until the final result of the multiplication is available. Carry propagation is hard to parallelize efficiently and having to do this just once is a big win.

Right now I have a proof of concept multiplication going on the CPU that seems to generate correct results. The main difficulty is to generate some highly parallel CUDA code from this serial implementation. Afterwards it's all about deep Mandelbrot zooms.... and fast I hope ;)  1024 bits should allow zooms until approximately 1E300

Does anyone have some *really* deep reference coordinates for a deep zoom Mandelbrot? This would help me testing whether my implementation is sound.

Afterwards I might try a similar thing using Intel AVX instructions.

Source code will be available. I hate not to share.


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: msltoe on June 04, 2011, 02:57:35 AM
cbuchner,

 The CUFFT library may be useful to build a fairly simple CUDA-based arbitrary precision program. I know that FFT-based multiplication is a little bit trickier than Karatsuba and it doesn't scale as well for the size numbers you'll be using, but the CUFFT library would make the slowest part of the method (the FFT) a breeze.

 When I was a kid, I tried doing FFT-based multiplication using a straightforward "multiply the FFT's of the big numbers together and then reverse the FFT" approach. It didn't yield perfect answers as there was some round-off errors, and there's some sort of carry operation needed. But perhaps if I aimed for a smaller number of digits per block, the round-off error would've disappeared.

-mike


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: cbuchner1 on June 04, 2011, 02:48:52 PM
Those FFTs are getting more efficient than the Karatsuba or Toom-3/Toom-4 algorithms for very large numbers (thousands of digits). For really deep Mandelbrot zooms a few hundred bits of precision are sufficient. I think my choice of 1024 bits is more than enough.

Bignum packages like GMP are tuned to dynamically pick the best algorithm depending on problem size. For what I do I think Karatsuba is well suited. Here are the current stats of my serial CPU implementation. To multiply two 1024 bit numbers and to get a (truncated) 1024 bit result I need:

729 multiplications (24bit x 24bit, 8 clock cycles each to compute resulting HI and LO words)
5320 adds (4 clock cycles each).

Now the cool part is that there are hundreds of CUDA cores available on the GPU, so most of the above arithmetic operations will run in parallel and memory accesses will be minimal (all data lives in the GPU's internal user managed caches, also known as shared memory).

I will still have to map the serial CPU code to parallel CUDA code, which is non-trivial. The data flow graph for a single iteration of Karatsuba already looks nasty when drawn on a sheet of paper. Now imagine that graph being recursively inserted into the three multiplication nodes (3rd level from top) with six levels of recusion. And now find a thread-parallel way to go through that mess ;)


Title: Re: Sharing knowledge about SSE programming - possibly running a forum?
Post by: panzerboy on August 13, 2011, 07:57:25 AM
Further to my previous rant about FastFractal256. In the Edit menu there is a System Setting dialog where you can reduce the number of CPU threads to 1 or 0. This seems to stop the freeze up of anything else running whilst it's rendering. All other gripes remain.  :tease: