Logo by Kaliyuga - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Check out the originating "3d Mandelbulb" thread here
 
*
Welcome, Guest. Please login or register. April 19, 2024, 01:58:02 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: Sharing knowledge about SSE programming - possibly running a forum?  (Read 5665 times)
0 Members and 1 Guest are viewing this topic.
cbuchner1
Fractal Phenom
******
Posts: 443


« on: May 14, 2011, 11:25:52 PM »

Hi,

are people around here getting down to SSE vector programming on x86 chips? It can be done with so called intrinsic compiler instructions in C/C++, and with bare assembly code. The achievable speed gain is up to factor 4 for single precision floating point and even factor 8 on the new Intel Sandy Bridge chip generation per CPU core.

The main issue I think is that vendor libraries (provided by AMD or Intel) typically favor their particular chip generation and the source code is not open. These are often free to use but do not run equally well on AMD and Intel chips. You often have to code your own math subroutines to get the best out of any hardware. With Intel's new AVX instruction set on Sandy Bridge this kind of optimization might become much more interesting for fractal generation (4 double precision floats processed with one instruction).

I've recently checked if there is any dedicated forum where people can share code snippets and other information related to SSE programming, but I found no specialized web site. I was considering opening and running a forum, but I have no experience yet running bulletin board software on a rented sever.

I would like to know if there's anyone interested in that SSE/AVX topic here, and maybe willing to help in creating a dedicated web forum or wiki information resource about it.

Christian

« Last Edit: May 14, 2011, 11:33:22 PM by cbuchner1 » Logged
panzerboy
Fractal Lover
**
Posts: 242


« Reply #1 on: May 27, 2011, 04:54:59 AM »

Giday Christian,

I just stumbled apon the new AVX extensions while investigating the i7s on Wikipedia. Am I right in thinking this will allow 256bit arithmetic? I downloaded Intel's "Advanced Vector Extensions
Programming Reference" pdf but its been decades since I last looked at assembly code (on the 8088) and I'm way behind on the current Intel nomencluture. I assume theres compiler support for AVX in MS VC++ but so far my searches have only turned up compiler switches to enable AVX and not the capabilities.
256 bit arithmetic would be a big speed boost for my mandelbrot zooms.
It seems someone is using the AVX instructions more as a benchmark perhaps? Until I spend the money and get an i7 I can't run it.
http://board.flatassembler.net/topic.php?p=128226

Jeremy Thomson
Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #2 on: May 27, 2011, 08:59:20 PM »

Giday Christian,

I just stumbled apon the new AVX extensions while investigating the i7s on Wikipedia. Am I right in thinking this will allow 256bit arithmetic?

Jeremy Thomson

I doubt that. I know you can squeeze 8 single precision floats or 4 double precision floats into one 256 bit register, but the arithmetic remains limited to 32 or 64 bit words.  The integer and floating point ALUs are still limited in precision, but there are now up to 8 running in parallel. There is not one 256 bit multiplier ALU on the CPU. wink

The main benefit of AVX is that you could now simultaneously compute up to 8 pixels of a Mandelbrot zoom using a single core of the CPU.

Of course you could try to perform hacks like using double single, or double double arithmetics to simulate higher precision. With the help of AVX instruction one might create a faster arbitrary precision implementation also.

I've just brought my new Intel Core i5-2600k CPU online and I will try some things with Visual C++ 2010 (which has AVX support starting with Service Pack 1). What I like about it is that it has a moderate thermal design power and it makes all my AMD CPUs look old.

Hint: don't look at the AVX assembly instruction sets, but rather learn how to use compiler intrinsics to do the maths. I find this much easier. Also there is way more source code out there using intrinsics, than there is documented assembly code.

Christian
« Last Edit: May 28, 2011, 01:35:08 AM by cbuchner1 » Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #3 on: May 28, 2011, 01:30:52 AM »

okay, now I am torn deciding which of these two projects to attack:

-doing a Mandelbrot deep zoomer in CUDA. The idea is to use all of the shared memory of CUDA devices to hold some fixed point numbers. The shared memory is some of the fastest memory on the device. Arithmetics would happen entirely within the shared memory. Let's say for simplicity one number shall be at most 1kb in size, that is 8192 bits or 2466 decimal digits. We'd need to implement addition and multiplication on these numbers to compute the Mandelbrot set.... The various multiprocessors of a CUDA cards could then be crunching in parallel on different pixels of the set, while the threads running on each mulltiprocessor perform the necessary additions and multiplications in parallel. Way faster than doing the same on the CPU, I think.

-learning Intel AVX, possibly also by implementing a parallel mandelbrot set. Most likely not with high precision maths (that would be too difficult as a first project).
« Last Edit: May 28, 2011, 01:32:29 AM by cbuchner1 » Logged
panzerboy
Fractal Lover
**
Posts: 242


« Reply #4 on: May 28, 2011, 04:42:11 AM »

Thanks for the reply. If your interested in CUDA I found what appears to be an arbitrary precision library for Linux here http://code.google.com/p/gpuprec/
Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #5 on: May 29, 2011, 12:33:31 AM »

Thanks for the reply. If your interested in CUDA I found what appears to be an arbitrary precision library for Linux here http://code.google.com/p/gpuprec/


I may try to roll my own implementation based on "divide and conquer" algorithm (Karatsuba) for multiplication. It is simple enough to understand for an average Joe (unlike the FFT based multiplications). Not sure how many bits of precision I will need for deep zooms.

By the way, there is a CUDA enabled deep zoom program called fastfractal256. It uses 256 bit arithmetics and combines CUDA and the CPU for rendering. The music in the sample video reminds me of some of the deep zoom videos that were posted on these forums previously. I could not find this software mentioned here using the search function.



« Last Edit: May 29, 2011, 12:35:17 AM by cbuchner1 » Logged
panzerboy
Fractal Lover
**
Posts: 242


« Reply #6 on: May 29, 2011, 07:47:46 AM »

Yes I've tried FastFractal256, its a bit buggy for me, perhaps my low spec 9800GT card (CUDA 1.0?) but it works most of the time.
The killer for me was its limited palette support. I'm used to Fractal Extreme's and Ultra Fractals palette editors (called Gradients in UF). FastFractal256 just allows you to alter the 'cycle frequency' of Red green and blue along with respective intensities.
Rendering resolutions are fixed.
The program becomes quite unresponsive when rendering, showing 'ghost' menus when selected till a second or more and the menu shows fully. Perhaps that is Aero's use of the GPU conflicting with FF256. Its alseo a total CPU hog, forget about doing anything else with your PC whilst it renders.
But Sin of all Sins, changing the pallette forces a re-render. When oh when will mandelbrot program designers realise that calculated iterations are precious and not to be discarded just because I want to see what a different colour setting looks like. Thats the main reason I use Fractal Extreme over Ultra Fractal, once calculated you can play with the colours all you like and you don't have to wait hours (days?) for a recalculation. UF allows a bit of this unless you change the gradient mapping from say square root to cube root. Even better in FE you can increase the iterations and only the black (M-set) pixels recalculate.
Perhaps the weirdest thing about FastFractal256 is that the co-ordinates are shown in hexidecimal very surprised, so you can't even use it to explore quickly and use the co-ordinates in another program.
And unlicensed it will only run for 10 minutes, and I'm not even remotley tempted to part with the $4.95 to license it.
So FastFractal256 has a decent enough 'engine' but practically everything else about it needs work.
Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #7 on: June 03, 2011, 11:05:21 PM »

I am making some progress on 1024 bit integer multiplication using the Karatsuba algorithm. I am storing the numbers in a way that they can be processed efficiently on CUDA devices of the older (pre-Fermi) generation. This older GPU generation only supports a fast 24x24 bit integer multiplication in hardware. This is because nVidia engineers re-used the multiplier circuits in the single precision floating point ALU which has just 24 bits of mantissa.

The numbers consists of 64 legs ("bigits") of 16 bit each. I allow for 8 extra bits of carry overflow per leg, giving me the 24 bits that CUDA can work with. Some bignum implementations (e.g. GMP) call these extra bits "nails". These allow me to delay the carry propagation step until the final result of the multiplication is available. Carry propagation is hard to parallelize efficiently and having to do this just once is a big win.

Right now I have a proof of concept multiplication going on the CPU that seems to generate correct results. The main difficulty is to generate some highly parallel CUDA code from this serial implementation. Afterwards it's all about deep Mandelbrot zooms.... and fast I hope wink  1024 bits should allow zooms until approximately 1E300

Does anyone have some *really* deep reference coordinates for a deep zoom Mandelbrot? This would help me testing whether my implementation is sound.

Afterwards I might try a similar thing using Intel AVX instructions.

Source code will be available. I hate not to share.
« Last Edit: June 03, 2011, 11:22:41 PM by cbuchner1 » Logged
msltoe
Iterator
*
Posts: 187


« Reply #8 on: June 04, 2011, 02:57:35 AM »

cbuchner,

 The CUFFT library may be useful to build a fairly simple CUDA-based arbitrary precision program. I know that FFT-based multiplication is a little bit trickier than Karatsuba and it doesn't scale as well for the size numbers you'll be using, but the CUFFT library would make the slowest part of the method (the FFT) a breeze.

 When I was a kid, I tried doing FFT-based multiplication using a straightforward "multiply the FFT's of the big numbers together and then reverse the FFT" approach. It didn't yield perfect answers as there was some round-off errors, and there's some sort of carry operation needed. But perhaps if I aimed for a smaller number of digits per block, the round-off error would've disappeared.

-mike
Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #9 on: June 04, 2011, 02:48:52 PM »

Those FFTs are getting more efficient than the Karatsuba or Toom-3/Toom-4 algorithms for very large numbers (thousands of digits). For really deep Mandelbrot zooms a few hundred bits of precision are sufficient. I think my choice of 1024 bits is more than enough.

Bignum packages like GMP are tuned to dynamically pick the best algorithm depending on problem size. For what I do I think Karatsuba is well suited. Here are the current stats of my serial CPU implementation. To multiply two 1024 bit numbers and to get a (truncated) 1024 bit result I need:

729 multiplications (24bit x 24bit, 8 clock cycles each to compute resulting HI and LO words)
5320 adds (4 clock cycles each).

Now the cool part is that there are hundreds of CUDA cores available on the GPU, so most of the above arithmetic operations will run in parallel and memory accesses will be minimal (all data lives in the GPU's internal user managed caches, also known as shared memory).

I will still have to map the serial CPU code to parallel CUDA code, which is non-trivial. The data flow graph for a single iteration of Karatsuba already looks nasty when drawn on a sheet of paper. Now imagine that graph being recursively inserted into the three multiplication nodes (3rd level from top) with six levels of recusion. And now find a thread-parallel way to go through that mess wink


* flowgraph.jpg (78.28 KB, 800x600 - viewed 390 times.)
« Last Edit: June 04, 2011, 04:46:49 PM by cbuchner1 » Logged
panzerboy
Fractal Lover
**
Posts: 242


« Reply #10 on: August 13, 2011, 07:57:25 AM »

Further to my previous rant about FastFractal256. In the Edit menu there is a System Setting dialog where you can reduce the number of CPU threads to 1 or 0. This seems to stop the freeze up of anything else running whilst it's rendering. All other gripes remain.  tease
Logged
Pages: [1]   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
Strange, possibly unknown Mandelbrot variants. Mandelbrot & Julia Set matsoljare 5 5074 Last post November 13, 2008, 01:02:30 AM
by lkmitch
Mandelbulb seminar at London Knowledge Lab, UK The 3D Mandelbulb twinbee 5 3604 Last post May 11, 2011, 07:17:02 PM
by twinbee
Password protected parameters sharing!! feature request « 1 2 3 » Tahyon 41 6730 Last post April 13, 2015, 11:41:48 PM
by Sockratease
Sharing my first Synthclipse animation. Synthclipse Patryk Kizny 3 2853 Last post October 21, 2015, 09:07:50 AM
by _revers_
I'm sharing an interesting parameter Mandelbulb 3d Weber 0 1197 Last post February 04, 2016, 08:11:53 PM
by Weber

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.174 seconds with 25 queries. (Pretty URLs adds 0.01s, 2q)