Logo by mclarekin - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Visit us on facebook
 
*
Welcome, Guest. Please login or register. June 27, 2019, 03:20:14 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1] 2   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: bignum benchmark?  (Read 5554 times)
0 Members and 1 Guest are viewing this topic.
cbuchner1
Fractal Phenom
******
Posts: 443


« on: June 16, 2011, 11:10:00 AM »

Hi,

I've recently achieved a milestone by implementing bignum integer arithmetics on nVidia GPUs (targeting older first generation cards primarily which can only perform 24 bit integer multiplication in their ALU). I've unrolled and parallelized the recursive Karatsuba multiplication algorithm for 1024 bit operands. This code could easily be converted into fixed point arithmetics and be used for deep zooms into fractals.

A first benchmark shows that my laptop's nVidia 9600M GT can do 320000 bignum multiplications per second. Extrapolating from this I would expect around 2.2 million such multiplications per second on an nVidia GTX 260 card.

Now what I cannot really tell is if this is a high value for bignum mults. It is certainly nowhere near a Giga-Multiplications performance which I had hoped for wink

Does anyone have CPU benchmarks for 1024 bit multiplications available? The highest I have found so far is for 512 bits using SSE2 code.

How many iterations on average can one expect per pixel when zooming deeply into a Mandelbrot fractal (say 1E-100 screen space)? How do these iteration counts tend to grow with the zoom depth?

Christian
« Last Edit: June 16, 2011, 05:55:55 PM by cbuchner1 » Logged
marius
Fractal Lover
**
Posts: 206


« Reply #1 on: June 16, 2011, 06:39:39 PM »

A first benchmark shows that my laptop's nVidia 9600M GT can do 320000 bignum multiplications per second. Extrapolating from this I would expect around 2.2 million such multiplications per second on an nVidia GTX 260 card.

Now what I cannot really tell is if this is a high value for bignum mults. It is certainly nowhere near a Giga-Multiplications performance which I had hoped for wink

Does anyone have CPU benchmarks for 1024 bit multiplications available? The highest I have found so far is for 512 bits using SSE2 code.

You could write a little benchmark against the openssl bignum library? Normally folk use that to do modular bignum arithmetic for crypto purposes but the raw primitives have regular multiply etc. Or gmp, gnu multi-precision library.

The openssl command line tool has benchmarking built-in (openssl speed) but that focuses on modular arithmetic so not directly comparable.
Logged
marius
Fractal Lover
**
Posts: 206


« Reply #2 on: June 17, 2011, 12:40:20 AM »

A first benchmark shows that my laptop's nVidia 9600M GT can do 320000 bignum multiplications per second. Extrapolating from this I would expect around 2.2 million such multiplications per second on an nVidia GTX 260 card.

Now what I cannot really tell is if this is a high value for bignum mults. It is certainly nowhere near a Giga-Multiplications performance which I had hoped for wink

Does anyone have CPU benchmarks for 1024 bit multiplications available? The highest I have found so far is for 512 bits using SSE2 code.

You could write a little benchmark against the openssl bignum library? Normally folk use that to do modular bignum arithmetic for crypto purposes but the raw primitives have regular multiply etc. Or gmp, gnu multi-precision library.

The openssl command line tool has benchmarking built-in (openssl speed) but that focuses on modular arithmetic so not directly comparable.

Here's rough test:
Linked against openssl 0.9.8k, 64 bit on a Xeon W3690, gcc 4.4.3 I get 2.3M 'fixed point' 1024 bit multiplies / second / core  undecided
(mind you, that's a pretty spiffy 6 core CPU but still, having the GPU for lunch here using a single core)

Code:
#include <stdio.h>
#include <time.h>

#include <openssl/bn.h>

int main(int argc, char* argv[]) {
  BN_CTX* ctx = BN_CTX_new();
  BIGNUM* a = BN_new();
  BIGNUM* b = BN_new();
  BIGNUM* c = BN_new();

  int i;
  time_t start, end;

  BN_rand(a, 1024, 0, 0);
  BN_print_fp(stdout, a); printf("\n");
  BN_sqr(b, a, ctx);
  BN_rshift(b, b, 512);
  BN_mask_bits(b, 1024);
  BN_print_fp(stdout, b); printf("\n");

  start = clock();
  for (i = 0; i < 1000000; ++i) {
    BN_mul(c, b, a, ctx);  // c = a * b
    BN_rshift(b, c, 512);  // b = (c >> 512)
    BN_mask_bits(b, 1024);  // b &= 2^1024 - 1
  }
  end = clock();
  BN_print_fp(stdout, b); printf("\n");

  printf("%lf / second\n", ((double)i * CLOCKS_PER_SEC) / (end - start));

  BN_free(c);
  BN_free(b);
  BN_free(a);
  BN_CTX_free(ctx);

  return 0;
}


Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #3 on: June 17, 2011, 12:47:40 AM »

Thanks Marius for your results.

I've just done similar tests using mpir (a windows friendly version of GMP) on Windows 64bit. I used mpn_kara_mul_n with 16 limbs (one limb being an unsigned long long, i.e. 64 bits).

This performed 10 million muls in about 5 seconds on a single core of an Intel Core i5-2500K. Okay, that puts my GPU version to shame. Back to the drawing board. wink  And I had thought I was getting good at this CUDA thing.

Update: actually getting 3 million muls/sec on my GTX 260 card. Better but still no cigar.

2nd Update: Okay, now I get in the region of 5 million bignum mults per second on a GTX 260 graphics card. This would start to rival one fast dual core CPU now...


« Last Edit: June 23, 2011, 10:29:11 PM by cbuchner1 » Logged
real_het
Forums Freshman
**
Posts: 13


« Reply #4 on: June 18, 2011, 06:14:02 PM »

Hello!

Good to know threre are other people who trying to implement bigint math on GPU's. (And that's the reason I joined this forum now cheesy).

I have some early benchmarks for 1024bit fixpoint (32bit integer part) multiply on a 1TFlops gpu (ati HD4850), and its 80 million / sec. (100*1000 workarea, and 1000 'whileloop' iterations in the kernel finished in 1219ms)
But these are kinda theoretical results because I only tested 128bit multiplies (with my own eyes and only a few tries), so there's a big chance my multiply_code_generator is wrong. Need a few days or weeks to finish it.

I'm using the 'school grade multiplication' algo optimized for fixpoint. (I've already tried the FFT method (described on google -> big integer multiply, but it uses too many registers and float32 math is too small for it -> for example when you use 256 digits you can have only 7 bits in each digit and the rest will need for the fft mults)

I think you should unroll the recursion in the Karatsuba algo to get faster results on a GPU. And also forget about Karatsuba because it has more operations than the "school grade multiplication algo".  AFAIK on ATI cards there are 4 clocks to calculate a 32bit mul or a 32bit add, so removing a mul and replacing it with adds is not good (however is't good on CPU where mul has 3 clocks(latency) and addition is only 1 clocks) I think same rules apply for NV simd hardware as well.

Here are 2 vids generated with the FFT algo, (only 144bits precision) -> http://het.ath.cx/mandel
But I hope the 'School grade' algo I write will do it faster and will be able to run at 1024 bits precision.

Anyways, I hope it can help optimizing your kernel. I still need a few days of coding to see an actual Mandenbrot rendering with this 'school grade' stuff, I'm kinda excite now.  One mul and 2x Sqr's already ran on my card at 2048 bits (it was 350KBytes of gpu microcode, it's almost too big), but I have to double-check it with some automated tests.
Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #5 on: June 18, 2011, 08:00:13 PM »

for Karatsuba I need 1330 + 3990 32 bit adds (the first 1330 are needed before doing the multiplications, the other 3990 for recombining the multiplication results). I also need 729 integer mults (24 x 24 bit mults with 48 bit results) requiring 8 cycles each.  The multiplication takes twice as long as an add on my hardware because high and low part of the result are computed separately.

I hold my data in limbs of 32 bits, of which 24 bits are significant (8 bits of which hold a sign and carries and 16 bits represent the actual limb). All carry propagation is delayed until after the multiplication.

My main problem seems to be is that I need way more cycles for extra bookkeeping than I actually spend on useful computation. For examples accessing the tables with permutation indices that control where from to read the operands and where to store results. And because I work in shared memory that is arranged in 16 memory banks, I have to optimize hard to avoid bank conflicts.

Also I do not always have work for all the 243 threads that I throw at each multiplication. In the beginning and in the end I need fewer threads simultaneously (the dataset grows and shrinks during the process). I begin to think that I chose the wrong parallelization approach.
« Last Edit: June 18, 2011, 08:25:44 PM by cbuchner1 » Logged
marius
Fractal Lover
**
Posts: 206


« Reply #6 on: June 18, 2011, 08:42:08 PM »

for Karatsuba I need 1330 + 3990 32 bit adds (the first 1330 are needed before doing the multiplications, the other 3990 for recombining the multiplication results). I also need 729 integer mults (24 x 24 bit mults with 48 bit results) requiring 8 cycles each.  The multiplication takes twice as long as an add on my hardware because high and low part of the result are computed separately.

I hold my data in limbs of 32 bits, of which 24 bits are significant (8 bits of which hold a sign and carries and 16 bits represent the actual limb). All carry propagation is delayed until after the multiplication.

My main problem seems to be is that I need way more cycles for extra bookkeeping than I actually spend on useful computation. For examples accessing the tables with permutation indices that control where from to read the operands and where to store results. And because I work in shared memory that is arranged in 16 memory banks, I have to optimize hard to avoid bank conflicts.

Also I do not always have work for all the 243 threads that I throw at each multiplication. In the beginning and in the end I need fewer threads simultaneously (the dataset grows and shrinks during the process). I begin to think that I chose the wrong parallelization approach.


I figure you have searched around a bit for work on implementing RSA on GPUs? Stuff like
http://www.marcelokaihara.com/papers/An_Implementation_of_RSA2048_on_GPUs_using_CUDA.pdf etc.?
(there's a bunch of stuff in there, and in others regarding RSA, about Montgomery representation; ignore that. In the end large numbers need to get multiplied, so similar scalability issues)

For your target deep MB zoom, you care more about throughput than latency.
So give each thread their own rather than cooperate on a single? Perhaps I'm misinterpreting your approach.

Sounds like real_het generates the whole multiplication sequence as a straight-line chunk of code, almost w/o loops? And then run many of those in parallel?
But just guessing, haven't looked into this deeply, although i'm pretty familiar with crypto bignum code on non-GPUs :-/

Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #7 on: June 18, 2011, 11:04:48 PM »

So give each thread their own rather than cooperate on a single? Perhaps I'm misinterpreting your approach.

You're right. But I was worried I would be unable fit the workspace for many parallel multiplications into the user managed cache (16kb of shared memory). This would require many additional memory accesses to the graphics card's main memory - and thus the speed becomes limited by how much memory throughput we can achieve.

But as things look, I will have to try your suggestion anyway wink I don't expect to beat ATI cards with my slightly outdated nVidia cards. They run circles around nVidia in integer applications such as RSA, SHA, bignum, bitcoin mining
« Last Edit: June 18, 2011, 11:13:20 PM by cbuchner1 » Logged
real_het
Forums Freshman
**
Posts: 13


« Reply #8 on: June 27, 2011, 12:06:37 AM »

Hi again!

Finally I'm rendering right now cheesy

At 1024bit precision, 512x512 pixels, 3361million total Mandelbrot iterations took 80seconds, and it's 42.1 million iterations per second. (its' at zoom level 2^697)
This is the point where register shortage became a bottleneck. (there are 128 regs for 16*4 stream processors so most of the threads must wait for available registers, and so the GPU activity meter shows only 50%)
1280bit precision will run for 120-180secs (measured while finding a spot to zoom), and only one thread was active out of 16 in each SIMD engine, i guess. And finally 1536bit precision wasn't able to run without errors at all with my technique.
On lower levels the program was good: 768bit precision -> 15-20 seconds, 640bits -> 7-10seconds, 512bits ->4-5seconds. (It's expected from the long multiplication algo -> when You double then precision, the computing time will be quadrupled)

Times were measured on 2xHD6990@950MHz, that's 11.5(FMAD)TFlops.

Anyways here's I some things I've found:
- Use the school_grade_multiplication algo, because it has the lowest memory complexity, so we can take advantage of the superfast registers. Only 2x 32bit extra registers needed to calculate it (an accumulator and a carry). The total register usage is something like (precision/32bit)*3/4 (in xyzw regs).
- I've tried the FFT method before: It was bad, because:
  • one float32 limb was only good for storing 9..7 bits (because the many mul/adds those numbers go through their 24bit capacity will used out)
  • up on that, memory complexity is so many times bigger than school_grade_mult (2x complex numbers * 2x FFT symmetry, this must be all present to make the final inverse fft, while schoolgrade works only with the inputs and 2 temps)
- Unroll everything(except the main loop ofc).
- Try using the 32bit mullo/mulhi (I've checked those in a 2007 cuda pdf cheesy). This will do the dirty job, the more bits we can multiply at once, the better!

Now the render is at 2^741 (512x512, rendering every 2x zooms, and the total iterationb count now is at 3562 million, started 2:15 hours ago and still 40% of the frames are left, and it will tok 20 more hours... This is so exponential :S plus that register bottleneck...

But wait a minute! -> It's at 37MillionMandels/sec now and you said that you have 10million muls/sec on an x86. This is totally slow then either cheesy 37/3= only 12mill mults/sec (dividing with the 3 multiplies in a mandel)...

Maybe I'm miscalculating something, but I'm affraid that not cheesy
Logged
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #9 on: June 28, 2011, 01:10:28 AM »

Very nice to see your progress. I haven't started to actually render anything.

Did you try to exploit the fact that squaring a number is less complex than multiplying arbitrary numbers? Some research papers I've skimmed suggest that some complexity (and memory access) can be saved when squaring bignums as opposed to multiplying two different numbers. The complex squaring z^2 = (a+bi)*(a+bi) required for a Mandelbrot requires squaring bignums a and b and computing the product 2ab. Because bignum squaring is required twice, some notable gains would be expected.

I agree with the reasons you cite for dropping FFT, but I am still a little sceptical about the school grade multiplication because its complexity (the number of operations required) will grow with the length of the input numbers squared, i.e. O(n^2).  Karatsuba would split the big problem into many smaller problems and that actually saves real work.  There is a reason some CPU based bignum libraries use Karatsuba, Toom-3 and like algorithms... wink

The mullo/mulhi you found in the CUDA doc actually only uses 24 bit operands and produces a 48 bit result. mulhi returns the upper 32 bits, mullo returns the lower 32 bits. There is overlap. For integer multiplication nVidia reused the multiplication circuits contained in the floating point multiplication ALU... floats happen to have a 24 bit mantissa... Newer cards (Fermi etc) have a redesigned ALU without this limitation.

I wonder if a Mandelbulb deep zoom could be performed with bignum arithmetics as well? But I guess we'd need transcendental functions such as a bignum logarithm for proper distance estimation.
« Last Edit: June 28, 2011, 11:49:44 PM by cbuchner1 » Logged
real_het
Forums Freshman
**
Posts: 13


« Reply #10 on: July 01, 2011, 12:48:25 AM »

Hi!
Lemme present my first zoom vid cheesy <a href="http://vimeo.com/moogaloop.swf?clip_id=25833277&amp;server=vimeo.com&amp;fullscreen=1&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=01AAEA" target="_blank">http://vimeo.com/moogaloop.swf?clip_id=25833277&amp;server=vimeo.com&amp;fullscreen=1&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=01AAEA</a>

I've optimized for sqrs, yes, this was exact inner loop I've used:
    xx:=x*x;
    yy:=y*y;
    xy:=x*y;
    x:=xx-yy+x0;
    y:=xy+xy+y0;
    z:=xx+yy;
(and check z if it's ge 4 then bailout)

And later in the code generator inside the 2 squares, there were some more optimization opportunities when doing the 32bit muls.When A2*B0 = A0*B2 because A=B. This eleminates some muls, but can't eleminate more '64bit carry adders' (3 instructions: an add, a carry check, and an add_1_to_carry thing). If the gpu was designed to maintain a carry flag and anc inrementor based on that flag, this would halve the operations. But now 50% of the time is spent on determining the carry bit and adding that to a 32 bit number cheesy

"i.e. O(n^2)" -> Weirdly it was not the operation count that became the bottleneck even on my old hd4850, but the need of registers when launching 16 threads on a simd engine. At 1280 bits precision it was around 16*40 registers out of 128, thus so many threads spent most of the time waiting for available registers. You were right, to use memory cheesy but at lower precision we can use regs and 'burn' the cards at 100% efficiency. Beyond this zoom level I gotta learn using that gpu memory, it's still 10x faster than cpu memory but 10-100x slower than what you can do in a single gpu thread and 8*16=128bytes of superfast register memory.

"For integer multiplication nVidia reused the multiplication circuits contained in the floating point multiplication ALU..." Damn how tricky those engineers are when it comes to do the most things on the lowest number of transistors cheesy I'm still amazed that they using only 1 control unit for 64x 32bit execution units, that's why we have no true IFs at all cheesy I've seen that the NV gpu is kinda intelligent like an Intel cpu when it comes to allocate execution units for a linear program code, but I know that ati(amd) cards are so 'dumb' that they are getting pure microcode (VLIW) from the pcie bus that tells every exec units when and what to do, and this dumbness saves a lot of space on the chip to place more exec units, but when it comes to memory operations, then lots of exec units have to sleep cheesy
« Last Edit: July 01, 2011, 01:05:05 AM by real_het » Logged
marius
Fractal Lover
**
Posts: 206


« Reply #11 on: July 01, 2011, 07:03:05 AM »

Hi!
Lemme present my first zoom vid cheesy <a href="http://vimeo.com/moogaloop.swf?clip_id=25833277&amp;server=vimeo.com&amp;fullscreen=1&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=01AAEA" target="_blank">http://vimeo.com/moogaloop.swf?clip_id=25833277&amp;server=vimeo.com&amp;fullscreen=1&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=01AAEA</a>

I've optimized for sqrs, yes, this was exact inner loop I've used:
    xx:=x*x;
    yy:=y*y;
    xy:=x*y;
    x:=xx-yy+x0;
    y:=xy+xy+y0;
    z:=xx+yy;
(and check z if it's ge 4 then bailout)

And later in the code generator inside the 2 squares, there were some more optimization opportunities when doing the 32bit muls.When A2*B0 = A0*B2 because A=B. This eleminates some muls, but can't eleminate more '64bit carry adders' (3 instructions: an add, a carry check, and an add_1_to_carry thing). If the gpu was designed to maintain a carry flag and anc inrementor based on that flag, this would halve the operations. But now 50% of the time is spent on determining the carry bit and adding that to a 32 bit number cheesy

"i.e. O(n^2)" -> Weirdly it was not the operation count that became the bottleneck even on my old hd4850, but the need of registers when launching 16 threads on a simd engine. At 1280 bits precision it was around 16*40 registers out of 128, thus so many threads spent most of the time waiting for available registers. You were right, to use memory cheesy but at lower precision we can use regs and 'burn' the cards at 100% efficiency. Beyond this zoom level I gotta learn using that gpu memory, it's still 10x faster than cpu memory but 10-100x slower than what you can do in a single gpu thread and 8*16=128bytes of superfast register memory.

"For integer multiplication nVidia reused the multiplication circuits contained in the floating point multiplication ALU..." Damn how tricky those engineers are when it comes to do the most things on the lowest number of transistors cheesy I'm still amazed that they using only 1 control unit for 64x 32bit execution units, that's why we have no true IFs at all cheesy I've seen that the NV gpu is kinda intelligent like an Intel cpu when it comes to allocate execution units for a linear program code, but I know that ati(amd) cards are so 'dumb' that they are getting pure microcode (VLIW) from the pcie bus that tells every exec units when and what to do, and this dumbness saves a lot of space on the chip to place more exec units, but when it comes to memory operations, then lots of exec units have to sleep cheesy


Awesome. I like all the computation shortcuts like constructing the final movie from 512x512 computed double zooms. Makes perfect sense for this type of thing.
Logged
Botond Kˇsa
Fractal Lover
**
Posts: 233



WWW
« Reply #12 on: September 21, 2011, 09:17:55 PM »

I've optimized for sqrs, yes, this was exact inner loop I've used:
    xx:=x*x;
    yy:=y*y;
    xy:=x*y;
    x:=xx-yy+x0;
    y:=xy+xy+y0;
    z:=xx+yy;
(and check z if it's ge 4 then bailout)

There is a trick to eliminate the x*y multiplication in your code: 2*x*y = (x+y)*(x+y) - x*x - y*y, that is 3 squares (and some extra add/subtract, which are almost free compared to square/multiply). Since x*x and y*y are already computed, you actually replace a multiply with a square.

My own Mandelbrot deep zoom generator can perform 3.7 million squares and 1.1 million Mandelbrot iterations / core / second at 1020 bits of precision on a mobile Core 2 Duo T9400 (2.53 GHz). Although written in pure Java, it is a pretty fast application.
Logged

Check out my Mandelbrot set explorer:
http://web.t-online.hu/kbotond/mandelmachine/
David Makin
Global Moderator
Fractal Senior
******
Posts: 2286



Makin' Magic Fractals
WWW
« Reply #13 on: September 21, 2011, 09:50:19 PM »


How many iterations on average can one expect per pixel when zooming deeply into a Mandelbrot fractal (say 1E-100 screen space)? How do these iteration counts tend to grow with the zoom depth?


In case it helps here's a quote from Damien Jones' "Whoosh", though it's considerably deeper than 1e-100: Edit:Ooops - ignore that bit about the depth !!

"The final frame of this movie needs 35 decimals, and has a minimum iteration count of more than 1.6 million (the maximum calculated was just under ten million)."

http://www.damienmjones.com/whoosh/demo3.php
« Last Edit: September 25, 2011, 02:39:18 PM by David Makin » Logged

The meaning and purpose of life is to give life purpose and meaning.

http://www.fractalgallery.co.uk/
"Makin' Magic Music" on Jango
marius
Fractal Lover
**
Posts: 206


« Reply #14 on: September 21, 2011, 10:30:53 PM »

In case it helps here's a quote from Damien Jones' "Whoosh", though it's considerably deeper than 1e-100:

"The final frame of this movie needs 35 decimals, and has a minimum iteration count of more than 1.6 million (the maximum calculated was just under ten million)."

'35 decimals' is not merely ~1e-35?  huh?
Logged
Pages: [1] 2   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
A processor benchmark for Mandelbulb 3d Mandelbulb 3d « 1 2 3 » ant123 44 12703 Last post December 10, 2017, 09:44:52 PM
by AtomicNixon
Fractal Extreme Performance Benchmark Fractal eXtreme stardust4ever 0 6568 Last post March 21, 2012, 03:21:06 PM
by stardust4ever
Benchmark-file for Mandelbulb3D, Mandelbulber, etc. Let's collaborate on something! VanlindtMarc 3 1028 Last post August 04, 2013, 12:26:52 PM
by taurus
squaring benchmark Complex Numbers claude 13 667 Last post April 11, 2016, 09:43:51 AM
by Kalles Fraktaler

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.162 seconds with 27 queries. (Pretty URLs adds 0.007s, 2q)