Logo by MarkJayBee - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Check out the originating "3d Mandelbulb" thread here
 
*
Welcome, Guest. Please login or register. March 29, 2024, 01:16:39 AM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: 1 [2] 3 4 ... 6   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: CUDA Y.A.M.Z  (Read 14174 times)
Description: Yet Another Mandelbrot Zoomer
0 Members and 1 Guest are viewing this topic.
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #15 on: February 03, 2014, 11:44:00 AM »

nvcc fatal   : Unsupported gpu architecture 'compute_30'

Look like i need some software upgrade  sad
Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #16 on: February 03, 2014, 11:54:05 AM »

accessing an array on gpu like
Code:
index=y*width+x;
n = array[index]

is slower than accessing like
Code:
index=y*width
n = array[index+x]

       I think it's because the compiler optimizes "index" as an incremented register (haven't checked assembler output)

I don't know where you found this in the source code (did a quick grep) but
Code:
n = array[y*width+x]
should be event faster, to avoid Read-after-write latency.
And since it's a fused multiply-add :
Code:
n = array[fma(y,width, x)]

no ?
(sorry, can't try it right now)
Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #17 on: February 03, 2014, 02:04:04 PM »

in the CMakeLists.txt file you can adjust the "GENCODE" option from 30 to 13  smiley

in "simplerender"

Code:
    // map from threadIdx/BlockIdx to pixel position
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    // offset into output buffer
    int offset = (y*width);
    int clu;

    clu = fracfunc(lookup.x[offset+x], lookup.y[offset+x], d_divergence, d_maxiter, d_maxcol);

    // set pixel to color
    pixels[offset+x] = clookup[clu];


just speculating on my part:

using fma would calculate the value and store in var named offset, then look it up three times
using register would init reg once then for useage it would inc reg once and read reg three times

the speedup was very small, a few uSec

Struct Of Arrays vs Array Of Structs is better too but again not a lot

I haven't tried explicitly specifying fma(y,width, x) and I haven't examined the ptx assembler output to see if the compiler optimizes "x*y+n" as such, I understood that it does from what I've read, but I'll give it a go the next time I'm tinkering with that code.

Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #18 on: February 03, 2014, 02:48:52 PM »

Code:
/usr/local/cuda/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported!
I still need to do some software upgrade anyway  angry

In my experience (which is OpenCL, but a GPU is a GPU) using temporary variable to store intermediate reusable result is not always a good idea (as it is with CPU).
Memory access (any memory) is Slowwwwwwwwww while fma is done in 1 cycle (well, 8 fma per cycle actually, in ideal case (i don't remember the details))

i found in my code that using long brute force formula instead of splitting it in small part that could be reused was faster.
i'm busy learning FPGA, didn't played with GPU since a long time. (that's why i need some software update  grin )
Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #19 on: February 03, 2014, 03:35:20 PM »

global mem is slowest, __constant__ is faster, __shared__ is supposed to be almost as fast as registers, I'm just taking it for granted that most operations (individually) are going to be as fast or faster than cpu, if your gfx card supports it you can configure __shared__ mem access as 8 bytes instead of the default 4 bytes but this too raises some timing issues when mixing access to floats and doubles.

I'm very new to this GPU stuff and really just tinkering around, I'm sure I'll be making lots of mistakes  embarrass

The mandelbrot function in this was originally written on a 16Mhz 68000 Motorola CPU and optimized to compare with the assembler language, I'm sure that it's no where near as fast as more recent methods.
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #20 on: February 04, 2014, 10:13:57 PM »

Code:
/usr/local/cuda/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported!
I still need to do some software upgrade anyway  angry

gcc 4.7 works with CUDA 5.5 smiley

FPGA looks really interesting, soon harddrives will be obsolete ?
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #21 on: February 11, 2014, 12:05:06 AM »

should be event faster, to avoid Read-after-write latency.
And since it's a fused multiply-add :
Code:
n = array[fma(y,width, x)]

no ?
(sorry, can't try it right now)


finally found some time to test this.

when I try
Code:
n = array[fma(y,width, x)]
I get
Code:
error: expression must have integral or enum type


using...
Code:
index=y*width
n = array[index+x]
Benchmark:
        Number of frames:        1000
        Avg msec per frame:     2.307000

using...
Code:
index=fma(y,width,x)
n = array[index]
Benchmark:
        Number of frames:        1000
        Avg msec per frame:     2.315000

based on this result it seems that in this particular case fma(x,y,z) is not faster than x*y ... +z , fma took 0.008 seconds longer, really small but still...
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #22 on: February 13, 2014, 09:51:43 PM »

using GPU ARPREC vs CPU ARPREC

Precision = 1000 digits
Iterations = 50000

CenterX=-1.6285250343823361883838262548545045410348661236584141601769446480597742613978652866884534147015438695866330324908358967896639954103891124710858983400930548513556561043206532203421245449026144184088972414350968083884667444419967905346296122369829125808715959589508004033683543856979653688355703357362596797912790215729307519005701670693821446556437233617104789888651797464159254445779960831204398799963531144151826157979487498653952347762098181463615781771842676200742084357646694586924130378944176232745311158540371210889162520250043231951

CenterY=0.0006786726672986807036534783258711032797071726013030485649555004108079612273532854099054877878854548091617523457504272839323659938218011891095623421488662732771512925574351926524067445041101561085174335116653613367712657068886759881133817710332110609908168838538010271006938137369081253790287372868878042144253940125139876166256963471716935289610917988374315494511164693787453719396359779881539576510119495792820935158658401193999362816789422637990442661756991633307202607735211837364599264587661238085446201036331218816037486284880927809

-Z=(distance from center to edge of rendered area) = 2.420740E-530

CPU = 1310.34 sec
GPU = 213.925 sec

No optimizations, estimations or perturbation, just calculate every pixel until we get a value using standard M set calculation.  Xn+1 = (Xn * Xn) + X0
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #23 on: February 20, 2014, 11:54:26 AM »

Something of a milestone for me, using my old code to zoom this deep. So now that I have a gui for low bit (double) zoom render on GPU and garprec for hi precision render on GPU I am going to fiddle around with the SuperFractalThing maths and may try to build a garprec complex data type to use with those formulae.


* 1.2420740E-530.jpg (140.68 KB, 512x512 - viewed 284 times.)
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #24 on: March 15, 2014, 11:07:46 AM »

Another installment in the YAMZee file....
I've managed to cobble together a working  huh? Qt C++ clone huh? of SuperFractalThing huh?

Recap:
1: Really fast double precision GPU Mandel zoomer
2: GpuARbitraryPRECision Mandel zoomer
3: SFT Qt C++ (ported from java source uses ARPREC lib)

Now all I have to do is mash them together into a GPU friendly MandelZoomer  cheesy

anyone interested in the SFT C++ code just drop me a note, GCC, CMake, ARPREC, Linux required

Disclaimer: highly experimental miscellaneous hackings best described as "brutish butchery"


* cudabrotSFT.jpg (49.7 KB, 800x651 - viewed 298 times.)
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #25 on: March 19, 2014, 02:50:38 AM »

Well, after a lot of crashes and weird looking fractal rejects it is starting to stabilize, no more lost 25MB chunks, zoom/pan not choking and seems to be ok with minor rendering tasks, no big zooms yet but now I have some code to play with smiley a C++ version of SFT smiley my humblest thanks and appreciation to K.I. Martin for the SuperFractalThing.

As always, source code is available. It's a little rough, may crash, and probably won't compile out-of-the-box but it does work on my machine cheesy input is always appreciated, I have some ideas about where to apply GPU code but need to get things more stable ie: checks and balances.



* cudabrot2.jpg (135.41 KB, 600x532 - viewed 292 times.)
« Last Edit: March 26, 2014, 05:32:01 PM by 3dickulus, Reason: url » Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #26 on: March 23, 2014, 05:17:24 AM »

So...

I have the SFT engine ported from java to C++, shoe horned into a Qt GUI and functional enough to get some deep images, still a few blobs and other issues but it is basically a clone of the java version, the translation to equivalent Qt gnu C++ was a breeze (still needs fine tuning), the thread spawner was probably the hardest part. I think the 2DIndexBuffer class and gnu-threading can be replaced by QImage and QThread stuff. Currently renders to 2DIndexBuffer then converts into a pixmap that gets set as a GL texture, doing it this way so that moving code to the GPU will be easier as GPU has access to texture buffers and this sets up all the GL code I need.

Speed test between the C++ version and the Java version using the attached parameters file @ 1024x768 250000 iterations zoom e-1550

3Dickulus C++ version (ARPREC)

 (warm up)        1st run 58.88 Sec
                          2nd run 57.55 Sec

Java version standalone desktop app downloaded from SFT Sourceforge   (BigDecimal)

 (warm up?)      1st run 767.314 Sec
                         2nd run 766.096 Sec


I recall someone here saying "the math libs do all the work so the language doesn't really matter, interpreted or compiled", I beg to differ wink

Moving this to GPU shouldn't be too hard, some stuff will work while other stuff just won't fly but in the end a LOT of stuff can be offloaded to the GPU.

Java > C++ = 10x faster        I'm hoping the GPU will be 10x faster again smiley


* 1.09550882553E-1550.txt (3.19 KB - downloaded 115 times.)
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #27 on: March 26, 2014, 05:28:12 PM »

found some interesting things in the source code, removed some bits, optimized a few do-nothing-tests/loops (with the delete key)
I think I've got a stable base to work from so now the plan is to tweak the GUI a little, add a palette twiddler and a settings dialog then start on the GPU/GL part.

this is the deepest I've gone but after spending a couple of hours zooming and panning I think I can say it runs pretty smoothly  cheesy

attached settings are in SFT format, latest source code is available


* 1645.jpg (244.44 KB, 1024x768 - viewed 277 times.)
* 6.76383099125E-1645.txt (3.38 KB - downloaded 105 times.)
Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
knighty
Fractal Iambus
***
Posts: 819


« Reply #28 on: April 10, 2014, 03:44:05 PM »

Thank you for the c++ port of SFt!  smiley
I haven't been able to compile it yet -mostly because of OpenGl extensions which are not handled the same way under win32-, but after reading some of the code I have a (noobish  grin) question:
- It looks like most of the allocated memory is not freed elsewhere. for example in sftgui.cpp whenever the user selects new, a new QPixmap is allocated on the heap. Does Qt provide a grabage collector?
Logged
3dickulus
Global Moderator
Fractal Senior
******
Posts: 1558



WWW
« Reply #29 on: April 10, 2014, 04:59:47 PM »

Qt provides excellent GC, anything allocated (not malloc) in an object is freed when the object is destroyed and if needed you can add cleanup code very easily.
EDIT: if you reuse a pixmap (or any QObject afaik) the old one is destroyed first.

On my system the memory consumption looks like this...

before running...
KiB Mem:   6123220 total,  2733684 used,  3389536 free,   313704 buffers
while running...
KiB Mem:   6123220 total,  2792516 used,  3330704 free,   313780 buffers
after running...
KiB Mem:   6123220 total,  2718576 used,  3404644 free,   313864 buffers

there seems to be a few k extra after a run probably due to tossing out some firefox caches or something

I have made some changes since posting the code... separated Engine from GUI, added fractional iteration count, map iteration count to frequencies ie:380-780 angstom units (just for fun), color map has 1024 places.

I have been fiddling with the code a lot because I want to make sure that it's as stable as possible before trying to move it to the GPU, recently had this error when increasing zoom past E-2023

------------------------------
*** MPROUN: Exponent overflow.
*** mpabrt: execution terminated, error code =69
Segmentation fault
------------------------------

I'm in the process of tracking that down, please be aware that this port is only a hack'n'chop job to get the engine running, the GL stuff is not required but the intent is to have the GPU writing the texture buffer , it can just as easily map the pixmap as widget contents directly. A bonus from using QGLWidget is that it exploits hardware multisampling smiley




* MultiSampled.jpg (28.67 KB, 575x433 - viewed 281 times.)

* noMultiSample.jpg (36.19 KB, 559x417 - viewed 276 times.)
« Last Edit: April 10, 2014, 05:15:41 PM by 3dickulus, Reason: verbosity » Logged

Resistance is fertile...
You will be illuminated!

                            #B^] https://en.wikibooks.org/wiki/Fractals/fragmentarium
Pages: 1 [2] 3 4 ... 6   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
CUDA Programming with Ruby Programming ker2x 1 2959 Last post September 28, 2010, 09:55:46 PM
by ker2x
New CUDA developer Meet & Greet chacharles 9 4066 Last post May 19, 2011, 01:06:00 PM
by chacharles
CUDA Benoit Announcements & News RogerDahl 7 4201 Last post April 26, 2011, 09:39:00 PM
by RogerDahl
mandelbulb3D and CUDA Programming scavenger 12 6936 Last post May 08, 2013, 01:25:50 PM
by elphinstone
Anyone played with Arrayfire ? (CUDA/OpenCL/CPU) Programming « 1 2 » ker2x 18 11563 Last post February 16, 2016, 11:35:27 AM
by ker2x

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.23 seconds with 27 queries. (Pretty URLs adds 0.012s, 2q)