News: Support us via Flattr FLATTR Link

## The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

 Pages: 1 [2]   Go Down
 Author Topic: bignum benchmark?  (Read 5555 times) Description: 0 Members and 1 Guest are viewing this topic.
David Makin
Global Moderator
Fractal Senior

Posts: 2286

 « Reply #15 on: September 25, 2011, 02:38:04 PM »

In case it helps here's a quote from Damien Jones' "Whoosh", though it's considerably deeper than 1e-100:

"The final frame of this movie needs 35 decimals, and has a minimum iteration count of more than 1.6 million (the maximum calculated was just under ten million)."

'35 decimals' is not merely ~1e-35?

Duh ! Thanks Marius - I really should concentrate more ! It looked deeper than some of the 1e-100 zooms I've seen before, but probably because they cheated, being in dull and uninteresting locations that are quick to calculate
Post corrected
 Logged

The meaning and purpose of life is to give life purpose and meaning.

http://www.fractalgallery.co.uk/
"Makin' Magic Music" on Jango
real_het
Forums Freshman

Posts: 13

 « Reply #16 on: September 28, 2011, 04:48:04 PM »

There is a trick to eliminate the x*y multiplication in your code: 2*x*y = (x+y)*(x+y) - x*x - y*y, that is 3 squares...

A fast revisit to my code, to find out how can I improve it with your hint, turned out that the sqr() function is also unoptimal. It should accumulate the symmetric (a5*a8=a8*a5) multiplications and then shift it left by 1, rather that accumulate and add the previously calculated sum. (This optimization was done by my little optimizer, but at least automatically)

So let's say the x*y bigmul uses 4+4 instructions in it's basic 'cell' (1 mul_lo or mul_hi, plus an add, then a carry check and finnally an add for the accumulated carry values)
And the s*s bigmul uses 4+3 instructions (the smaller '3' cell is only reusing a previous multiplication result)

Then x*x, y*y, x*y costs (4+3)*2+(4+4) = 22 (based on the limbcount^2).
Further optimizing with the trick: x*x, y*y, (x+y)^2 costs (4+3)*3 = 21
And finally making sqr() with the proper optimization (not with the dumb automatic optimizer I used ) can reduce it to 4*3=12 (around 2x faster)

(note: the basic computing 'cell' in my long multiplication algo is this:
1. calculate a*b (return hi or lo 32bits)
3. compare new 'Accum' with previous Accum value (result is -1 when old>new, otherwise 0)
4. add the result of the comparison(3.) to the 'CarryAccum' register. )

Well, in theory this could be cool
Wish I have more time to play with it...
 Logged
Botond Kósa
Fractal Lover

Posts: 233

 « Reply #17 on: October 06, 2011, 10:00:22 AM »

1. calculate a*b (return hi or lo 32bits)
3. compare new 'Accum' with previous Accum value (result is -1 when old>new, otherwise 0)
4. add the result of the comparison(3.) to the 'CarryAccum' register. )

Another possible way of speeding up your basic computing cell is to substitute the comparison in step 3 with some bitwise operations and shifting. For example, wher using signed integers,
Code:
if(old>new) then result = -1; else result = 0;
can be replaced with
Code:
result = (new-old)>>31;
which does the same, since if new-old is negative, its two's complement representation starts with 1, otherwise with 0. Shifting it right by 31 bits preserves the sign of the number, resulting in 0 for nonnegative, -1 for negative numbers (with all bits set to 1). Although this is 2 operations, it will be faster than the IF statement. Comparisons are slow because they cause a conditional branch in the code, and despite the advanced branch predictors in modern CPUs, they can cause a pipeline flush, which degrades performance.
 Logged

Check out my Mandelbrot set explorer:
http://web.t-online.hu/kbotond/mandelmachine/
cbuchner1
Fractal Phenom

Posts: 443

 « Reply #18 on: November 03, 2014, 12:17:52 PM »

Hi,

myself and an associate have implemented maths with big numbers in CUDA. We use warp synchronous programming, i.e. 32 threads in combination process the maths for one number. The minimum system requirement is an nVidia GPU with Compute Capability 3.0 (Kepler and later).

Our current approach is a 1024 bit unsigned integer implementation, using limbs of 32 bit in each thread. We have implemented bit shifts, comparison, addition, multiplication (with 2048 bit result split into lo and hi parts), division. I also started working on a 2048 bit version using 64 bit integers per thread. I think going up to 4096 bits should be feasible too - afterwards it gets increasingly difficult to scale this up.

We've also done modular exponentiation with the Montgomery multiplication (for crypto related applications like e.g. dealing with large prime numbers).

It would be lovely to do some deep Mandelbrot zooms with these libraries, but I fear in its current form we can't do it unless we track the sign bit of each operation separately.
 « Last Edit: November 03, 2014, 12:21:51 PM by cbuchner1 » Logged
 Pages: 1 [2]   Go Down