cbuchner1


« on: June 16, 2011, 11:10:00 AM » 

Hi, I've recently achieved a milestone by implementing bignum integer arithmetics on nVidia GPUs (targeting older first generation cards primarily which can only perform 24 bit integer multiplication in their ALU). I've unrolled and parallelized the recursive Karatsuba multiplication algorithm for 1024 bit operands. This code could easily be converted into fixed point arithmetics and be used for deep zooms into fractals. A first benchmark shows that my laptop's nVidia 9600M GT can do 320000 bignum multiplications per second. Extrapolating from this I would expect around 2.2 million such multiplications per second on an nVidia GTX 260 card. Now what I cannot really tell is if this is a high value for bignum mults. It is certainly nowhere near a GigaMultiplications performance which I had hoped for Does anyone have CPU benchmarks for 1024 bit multiplications available? The highest I have found so far is for 512 bits using SSE2 code. How many iterations on average can one expect per pixel when zooming deeply into a Mandelbrot fractal (say 1E100 screen space)? How do these iteration counts tend to grow with the zoom depth? Christian


« Last Edit: June 16, 2011, 05:55:55 PM by cbuchner1 »

Logged




marius
Fractal Lover
Posts: 206


« Reply #1 on: June 16, 2011, 06:39:39 PM » 

A first benchmark shows that my laptop's nVidia 9600M GT can do 320000 bignum multiplications per second. Extrapolating from this I would expect around 2.2 million such multiplications per second on an nVidia GTX 260 card. Now what I cannot really tell is if this is a high value for bignum mults. It is certainly nowhere near a GigaMultiplications performance which I had hoped for Does anyone have CPU benchmarks for 1024 bit multiplications available? The highest I have found so far is for 512 bits using SSE2 code. You could write a little benchmark against the openssl bignum library? Normally folk use that to do modular bignum arithmetic for crypto purposes but the raw primitives have regular multiply etc. Or gmp, gnu multiprecision library. The openssl command line tool has benchmarking builtin (openssl speed) but that focuses on modular arithmetic so not directly comparable.



Logged




marius
Fractal Lover
Posts: 206


« Reply #2 on: June 17, 2011, 12:40:20 AM » 

A first benchmark shows that my laptop's nVidia 9600M GT can do 320000 bignum multiplications per second. Extrapolating from this I would expect around 2.2 million such multiplications per second on an nVidia GTX 260 card. Now what I cannot really tell is if this is a high value for bignum mults. It is certainly nowhere near a GigaMultiplications performance which I had hoped for Does anyone have CPU benchmarks for 1024 bit multiplications available? The highest I have found so far is for 512 bits using SSE2 code. You could write a little benchmark against the openssl bignum library? Normally folk use that to do modular bignum arithmetic for crypto purposes but the raw primitives have regular multiply etc. Or gmp, gnu multiprecision library. The openssl command line tool has benchmarking builtin (openssl speed) but that focuses on modular arithmetic so not directly comparable. Here's rough test: Linked against openssl 0.9.8k, 64 bit on a Xeon W3690, gcc 4.4.3 I get 2.3M 'fixed point' 1024 bit multiplies / second / core (mind you, that's a pretty spiffy 6 core CPU but still, having the GPU for lunch here using a single core) #include <stdio.h> #include <time.h>
#include <openssl/bn.h>
int main(int argc, char* argv[]) { BN_CTX* ctx = BN_CTX_new(); BIGNUM* a = BN_new(); BIGNUM* b = BN_new(); BIGNUM* c = BN_new();
int i; time_t start, end;
BN_rand(a, 1024, 0, 0); BN_print_fp(stdout, a); printf("\n"); BN_sqr(b, a, ctx); BN_rshift(b, b, 512); BN_mask_bits(b, 1024); BN_print_fp(stdout, b); printf("\n");
start = clock(); for (i = 0; i < 1000000; ++i) { BN_mul(c, b, a, ctx); // c = a * b BN_rshift(b, c, 512); // b = (c >> 512) BN_mask_bits(b, 1024); // b &= 2^1024  1 } end = clock(); BN_print_fp(stdout, b); printf("\n");
printf("%lf / second\n", ((double)i * CLOCKS_PER_SEC) / (end  start));
BN_free(c); BN_free(b); BN_free(a); BN_CTX_free(ctx);
return 0; }



Logged




cbuchner1


« Reply #3 on: June 17, 2011, 12:47:40 AM » 

Thanks Marius for your results.
I've just done similar tests using mpir (a windows friendly version of GMP) on Windows 64bit. I used mpn_kara_mul_n with 16 limbs (one limb being an unsigned long long, i.e. 64 bits).
This performed 10 million muls in about 5 seconds on a single core of an Intel Core i52500K. Okay, that puts my GPU version to shame. Back to the drawing board. wink And I had thought I was getting good at this CUDA thing.
Update: actually getting 3 million muls/sec on my GTX 260 card. Better but still no cigar.
2nd Update: Okay, now I get in the region of 5 million bignum mults per second on a GTX 260 graphics card. This would start to rival one fast dual core CPU now...


« Last Edit: June 23, 2011, 10:29:11 PM by cbuchner1 »

Logged




real_het
Forums Freshman
Posts: 13


« Reply #4 on: June 18, 2011, 06:14:02 PM » 

Hello! Good to know threre are other people who trying to implement bigint math on GPU's. (And that's the reason I joined this forum now ). I have some early benchmarks for 1024bit fixpoint (32bit integer part) multiply on a 1TFlops gpu (ati HD4850), and its 80 million / sec. (100*1000 workarea, and 1000 'whileloop' iterations in the kernel finished in 1219ms) But these are kinda theoretical results because I only tested 128bit multiplies (with my own eyes and only a few tries), so there's a big chance my multiply_code_generator is wrong. Need a few days or weeks to finish it. I'm using the 'school grade multiplication' algo optimized for fixpoint. (I've already tried the FFT method (described on google > big integer multiply, but it uses too many registers and float32 math is too small for it > for example when you use 256 digits you can have only 7 bits in each digit and the rest will need for the fft mults) I think you should unroll the recursion in the Karatsuba algo to get faster results on a GPU. And also forget about Karatsuba because it has more operations than the "school grade multiplication algo". AFAIK on ATI cards there are 4 clocks to calculate a 32bit mul or a 32bit add, so removing a mul and replacing it with adds is not good (however is't good on CPU where mul has 3 clocks(latency) and addition is only 1 clocks) I think same rules apply for NV simd hardware as well. Here are 2 vids generated with the FFT algo, (only 144bits precision) > http://het.ath.cx/mandelBut I hope the 'School grade' algo I write will do it faster and will be able to run at 1024 bits precision. Anyways, I hope it can help optimizing your kernel. I still need a few days of coding to see an actual Mandenbrot rendering with this 'school grade' stuff, I'm kinda excite now. One mul and 2x Sqr's already ran on my card at 2048 bits (it was 350KBytes of gpu microcode, it's almost too big), but I have to doublecheck it with some automated tests.



Logged




cbuchner1


« Reply #5 on: June 18, 2011, 08:00:13 PM » 

for Karatsuba I need 1330 + 3990 32 bit adds (the first 1330 are needed before doing the multiplications, the other 3990 for recombining the multiplication results). I also need 729 integer mults (24 x 24 bit mults with 48 bit results) requiring 8 cycles each. The multiplication takes twice as long as an add on my hardware because high and low part of the result are computed separately.
I hold my data in limbs of 32 bits, of which 24 bits are significant (8 bits of which hold a sign and carries and 16 bits represent the actual limb). All carry propagation is delayed until after the multiplication.
My main problem seems to be is that I need way more cycles for extra bookkeeping than I actually spend on useful computation. For examples accessing the tables with permutation indices that control where from to read the operands and where to store results. And because I work in shared memory that is arranged in 16 memory banks, I have to optimize hard to avoid bank conflicts.
Also I do not always have work for all the 243 threads that I throw at each multiplication. In the beginning and in the end I need fewer threads simultaneously (the dataset grows and shrinks during the process). I begin to think that I chose the wrong parallelization approach.


« Last Edit: June 18, 2011, 08:25:44 PM by cbuchner1 »

Logged




marius
Fractal Lover
Posts: 206


« Reply #6 on: June 18, 2011, 08:42:08 PM » 

for Karatsuba I need 1330 + 3990 32 bit adds (the first 1330 are needed before doing the multiplications, the other 3990 for recombining the multiplication results). I also need 729 integer mults (24 x 24 bit mults with 48 bit results) requiring 8 cycles each. The multiplication takes twice as long as an add on my hardware because high and low part of the result are computed separately.
I hold my data in limbs of 32 bits, of which 24 bits are significant (8 bits of which hold a sign and carries and 16 bits represent the actual limb). All carry propagation is delayed until after the multiplication.
My main problem seems to be is that I need way more cycles for extra bookkeeping than I actually spend on useful computation. For examples accessing the tables with permutation indices that control where from to read the operands and where to store results. And because I work in shared memory that is arranged in 16 memory banks, I have to optimize hard to avoid bank conflicts.
Also I do not always have work for all the 243 threads that I throw at each multiplication. In the beginning and in the end I need fewer threads simultaneously (the dataset grows and shrinks during the process). I begin to think that I chose the wrong parallelization approach.
I figure you have searched around a bit for work on implementing RSA on GPUs? Stuff like http://www.marcelokaihara.com/papers/An_Implementation_of_RSA2048_on_GPUs_using_CUDA.pdf etc.? (there's a bunch of stuff in there, and in others regarding RSA, about Montgomery representation; ignore that. In the end large numbers need to get multiplied, so similar scalability issues) For your target deep MB zoom, you care more about throughput than latency. So give each thread their own rather than cooperate on a single? Perhaps I'm misinterpreting your approach. Sounds like real_het generates the whole multiplication sequence as a straightline chunk of code, almost w/o loops? And then run many of those in parallel? But just guessing, haven't looked into this deeply, although i'm pretty familiar with crypto bignum code on nonGPUs :/



Logged




cbuchner1


« Reply #7 on: June 18, 2011, 11:04:48 PM » 

So give each thread their own rather than cooperate on a single? Perhaps I'm misinterpreting your approach.
You're right. But I was worried I would be unable fit the workspace for many parallel multiplications into the user managed cache (16kb of shared memory). This would require many additional memory accesses to the graphics card's main memory  and thus the speed becomes limited by how much memory throughput we can achieve. But as things look, I will have to try your suggestion anyway I don't expect to beat ATI cards with my slightly outdated nVidia cards. They run circles around nVidia in integer applications such as RSA, SHA, bignum, bitcoin mining


« Last Edit: June 18, 2011, 11:13:20 PM by cbuchner1 »

Logged




real_het
Forums Freshman
Posts: 13


« Reply #8 on: June 27, 2011, 12:06:37 AM » 

Hi again! Finally I'm rendering right now At 1024bit precision, 512x512 pixels, 3361million total Mandelbrot iterations took 80seconds, and it's 42.1 million iterations per second. (its' at zoom level 2^697) This is the point where register shortage became a bottleneck. (there are 128 regs for 16*4 stream processors so most of the threads must wait for available registers, and so the GPU activity meter shows only 50%) 1280bit precision will run for 120180secs (measured while finding a spot to zoom), and only one thread was active out of 16 in each SIMD engine, i guess. And finally 1536bit precision wasn't able to run without errors at all with my technique. On lower levels the program was good: 768bit precision > 1520 seconds, 640bits > 710seconds, 512bits >45seconds. (It's expected from the long multiplication algo > when You double then precision, the computing time will be quadrupled) Times were measured on 2xHD6990@950MHz, that's 11.5(FMAD)TFlops. Anyways here's I some things I've found:  Use the school_grade_multiplication algo, because it has the lowest memory complexity, so we can take advantage of the superfast registers. Only 2x 32bit extra registers needed to calculate it (an accumulator and a carry). The total register usage is something like (precision/32bit)*3/4 (in xyzw regs).  I've tried the FFT method before: It was bad, because:  one float32 limb was only good for storing 9..7 bits (because the many mul/adds those numbers go through their 24bit capacity will used out)
 up on that, memory complexity is so many times bigger than school_grade_mult (2x complex numbers * 2x FFT symmetry, this must be all present to make the final inverse fft, while schoolgrade works only with the inputs and 2 temps)
 Unroll everything(except the main loop ofc).  Try using the 32bit mullo/mulhi (I've checked those in a 2007 cuda pdf ). This will do the dirty job, the more bits we can multiply at once, the better! Now the render is at 2^741 (512x512, rendering every 2x zooms, and the total iterationb count now is at 3562 million, started 2:15 hours ago and still 40% of the frames are left, and it will tok 20 more hours... This is so exponential :S plus that register bottleneck... But wait a minute! > It's at 37MillionMandels/sec now and you said that you have 10million muls/sec on an x86. This is totally slow then either 37/3= only 12mill mults/sec (dividing with the 3 multiplies in a mandel)... Maybe I'm miscalculating something, but I'm affraid that not



Logged




cbuchner1


« Reply #9 on: June 28, 2011, 01:10:28 AM » 

Very nice to see your progress. I haven't started to actually render anything. Did you try to exploit the fact that squaring a number is less complex than multiplying arbitrary numbers? Some research papers I've skimmed suggest that some complexity (and memory access) can be saved when squaring bignums as opposed to multiplying two different numbers. The complex squaring z^2 = (a+bi)*(a+bi) required for a Mandelbrot requires squaring bignums a and b and computing the product 2ab. Because bignum squaring is required twice, some notable gains would be expected. I agree with the reasons you cite for dropping FFT, but I am still a little sceptical about the school grade multiplication because its complexity (the number of operations required) will grow with the length of the input numbers squared, i.e. O(n^2). Karatsuba would split the big problem into many smaller problems and that actually saves real work. There is a reason some CPU based bignum libraries use Karatsuba, Toom3 and like algorithms... The mullo/mulhi you found in the CUDA doc actually only uses 24 bit operands and produces a 48 bit result. mulhi returns the upper 32 bits, mullo returns the lower 32 bits. There is overlap. For integer multiplication nVidia reused the multiplication circuits contained in the floating point multiplication ALU... floats happen to have a 24 bit mantissa... Newer cards (Fermi etc) have a redesigned ALU without this limitation. I wonder if a Mandelbulb deep zoom could be performed with bignum arithmetics as well? But I guess we'd need transcendental functions such as a bignum logarithm for proper distance estimation.


« Last Edit: June 28, 2011, 11:49:44 PM by cbuchner1 »

Logged




real_het
Forums Freshman
Posts: 13


« Reply #10 on: July 01, 2011, 12:48:25 AM » 

Hi! Lemme present my first zoom vid


« Last Edit: July 01, 2011, 01:05:05 AM by real_het »

Logged




marius
Fractal Lover
Posts: 206


« Reply #11 on: July 01, 2011, 07:03:05 AM » 

Hi! Lemme present my first zoom vid
Awesome. I like all the computation shortcuts like constructing the final movie from 512x512 computed double zooms. Makes perfect sense for this type of thing.



Logged




Botond Kósa


« Reply #12 on: September 21, 2011, 09:17:55 PM » 

I've optimized for sqrs, yes, this was exact inner loop I've used: xx:=x*x; yy:=y*y; xy:=x*y; x:=xxyy+x0; y:=xy+xy+y0; z:=xx+yy; (and check z if it's ge 4 then bailout)
There is a trick to eliminate the x*y multiplication in your code: 2*x*y = (x+y)*(x+y)  x*x  y*y, that is 3 squares (and some extra add/subtract, which are almost free compared to square/multiply). Since x*x and y*y are already computed, you actually replace a multiply with a square. My own Mandelbrot deep zoom generator can perform 3.7 million squares and 1.1 million Mandelbrot iterations / core / second at 1020 bits of precision on a mobile Core 2 Duo T9400 (2.53 GHz). Although written in pure Java, it is a pretty fast application.



Logged




David Makin


« Reply #13 on: September 21, 2011, 09:50:19 PM » 

How many iterations on average can one expect per pixel when zooming deeply into a Mandelbrot fractal (say 1E100 screen space)? How do these iteration counts tend to grow with the zoom depth?
In case it helps here's a quote from Damien Jones' "Whoosh", though it's considerably deeper than 1e100: Edit:Ooops  ignore that bit about the depth !! "The final frame of this movie needs 35 decimals, and has a minimum iteration count of more than 1.6 million (the maximum calculated was just under ten million)." http://www.damienmjones.com/whoosh/demo3.php


« Last Edit: September 25, 2011, 02:39:18 PM by David Makin »

Logged




marius
Fractal Lover
Posts: 206


« Reply #14 on: September 21, 2011, 10:30:53 PM » 

In case it helps here's a quote from Damien Jones' "Whoosh", though it's considerably deeper than 1e100:
"The final frame of this movie needs 35 decimals, and has a minimum iteration count of more than 1.6 million (the maximum calculated was just under ten million)."
'35 decimals' is not merely ~1e35?



Logged




