ker2x
Fractal Molossus
Posts: 795
|
|
« Reply #15 on: February 03, 2014, 11:44:00 AM » |
|
nvcc fatal : Unsupported gpu architecture 'compute_30' Look like i need some software upgrade
|
|
|
Logged
|
|
|
|
ker2x
Fractal Molossus
Posts: 795
|
|
« Reply #16 on: February 03, 2014, 11:54:05 AM » |
|
accessing an array on gpu like index=y*width+x; n = array[index]
is slower than accessing like index=y*width n = array[index+x]
I think it's because the compiler optimizes "index" as an incremented register (haven't checked assembler output) I don't know where you found this in the source code (did a quick grep) but should be event faster, to avoid Read-after-write latency. And since it's a fused multiply-add : n = array[fma(y,width, x)] no ? (sorry, can't try it right now)
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #17 on: February 03, 2014, 02:04:04 PM » |
|
in the CMakeLists.txt file you can adjust the "GENCODE" option from 30 to 13 in "simplerender" // map from threadIdx/BlockIdx to pixel position int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; // offset into output buffer int offset = (y*width); int clu;
clu = fracfunc(lookup.x[offset+x], lookup.y[offset+x], d_divergence, d_maxiter, d_maxcol);
// set pixel to color pixels[offset+x] = clookup[clu];
just speculating on my part: using fma would calculate the value and store in var named offset, then look it up three times using register would init reg once then for useage it would inc reg once and read reg three times the speedup was very small, a few uSec Struct Of Arrays vs Array Of Structs is better too but again not a lot I haven't tried explicitly specifying fma(y,width, x) and I haven't examined the ptx assembler output to see if the compiler optimizes "x*y+n" as such, I understood that it does from what I've read, but I'll give it a go the next time I'm tinkering with that code.
|
|
|
Logged
|
|
|
|
ker2x
Fractal Molossus
Posts: 795
|
|
« Reply #18 on: February 03, 2014, 02:48:52 PM » |
|
/usr/local/cuda/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported! I still need to do some software upgrade anyway In my experience (which is OpenCL, but a GPU is a GPU) using temporary variable to store intermediate reusable result is not always a good idea (as it is with CPU). Memory access (any memory) is Slowwwwwwwwww while fma is done in 1 cycle (well, 8 fma per cycle actually, in ideal case (i don't remember the details)) i found in my code that using long brute force formula instead of splitting it in small part that could be reused was faster. i'm busy learning FPGA, didn't played with GPU since a long time. (that's why i need some software update )
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #19 on: February 03, 2014, 03:35:20 PM » |
|
global mem is slowest, __constant__ is faster, __shared__ is supposed to be almost as fast as registers, I'm just taking it for granted that most operations (individually) are going to be as fast or faster than cpu, if your gfx card supports it you can configure __shared__ mem access as 8 bytes instead of the default 4 bytes but this too raises some timing issues when mixing access to floats and doubles. I'm very new to this GPU stuff and really just tinkering around, I'm sure I'll be making lots of mistakes The mandelbrot function in this was originally written on a 16Mhz 68000 Motorola CPU and optimized to compare with the assembler language, I'm sure that it's no where near as fast as more recent methods.
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #20 on: February 04, 2014, 10:13:57 PM » |
|
/usr/local/cuda/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported! I still need to do some software upgrade anyway gcc 4.7 works with CUDA 5.5 FPGA looks really interesting, soon harddrives will be obsolete ?
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #21 on: February 11, 2014, 12:05:06 AM » |
|
should be event faster, to avoid Read-after-write latency. And since it's a fused multiply-add : n = array[fma(y,width, x)] no ? (sorry, can't try it right now) finally found some time to test this. when I try n = array[fma(y,width, x)] I get error: expression must have integral or enum type using... index=y*width n = array[index+x]
Benchmark: Number of frames: 1000 Avg msec per frame: 2.307000 using... index=fma(y,width,x) n = array[index]
Benchmark: Number of frames: 1000 Avg msec per frame: 2.315000 based on this result it seems that in this particular case fma(x,y,z) is not faster than x*y ... +z , fma took 0.008 seconds longer, really small but still...
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #22 on: February 13, 2014, 09:51:43 PM » |
|
using GPU ARPREC vs CPU ARPREC Precision = 1000 digits Iterations = 50000 CenterX=-1.6285250343823361883838262548545045410348661236584141601769446480597742613978652866884534147015438695866330324908358967896639954103891124710858983400930548513556561043206532203421245449026144184088972414350968083884667444419967905346296122369829125808715959589508004033683543856979653688355703357362596797912790215729307519005701670693821446556437233617104789888651797464159254445779960831204398799963531144151826157979487498653952347762098181463615781771842676200742084357646694586924130378944176232745311158540371210889162520250043231951 CenterY=0.0006786726672986807036534783258711032797071726013030485649555004108079612273532854099054877878854548091617523457504272839323659938218011891095623421488662732771512925574351926524067445041101561085174335116653613367712657068886759881133817710332110609908168838538010271006938137369081253790287372868878042144253940125139876166256963471716935289610917988374315494511164693787453719396359779881539576510119495792820935158658401193999362816789422637990442661756991633307202607735211837364599264587661238085446201036331218816037486284880927809 -Z=(distance from center to edge of rendered area) = 2.420740E-530 CPU = 1310.34 sec GPU = 213.925 sec No optimizations, estimations or perturbation, just calculate every pixel until we get a value using standard M set calculation. Xn+1 = (Xn * Xn) + X0
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #23 on: February 20, 2014, 11:54:26 AM » |
|
Something of a milestone for me, using my old code to zoom this deep. So now that I have a gui for low bit (double) zoom render on GPU and garprec for hi precision render on GPU I am going to fiddle around with the SuperFractalThing maths and may try to build a garprec complex data type to use with those formulae.
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #24 on: March 15, 2014, 11:07:46 AM » |
|
Another installment in the YAMZee file.... I've managed to cobble together a working Qt C++ clone of SuperFractalThing Recap: 1: Really fast double precision GPU Mandel zoomer 2: Gpu ARbitrary PRECision Mandel zoomer 3: SFT Qt C++ (ported from java source uses ARPREC lib) Now all I have to do is mash them together into a GPU friendly MandelZoomer anyone interested in the SFT C++ code just drop me a note, GCC, CMake, ARPREC, Linux required Disclaimer: highly experimental miscellaneous hackings best described as "brutish butchery"
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #25 on: March 19, 2014, 02:50:38 AM » |
|
Well, after a lot of crashes and weird looking fractal rejects it is starting to stabilize, no more lost 25MB chunks, zoom/pan not choking and seems to be ok with minor rendering tasks, no big zooms yet but now I have some code to play with a C++ version of SFT my humblest thanks and appreciation to K.I. Martin for the SuperFractalThing. As always, source code is available. It's a little rough, may crash, and probably won't compile out-of-the-box but it does work on my machine input is always appreciated, I have some ideas about where to apply GPU code but need to get things more stable ie: checks and balances.
|
|
« Last Edit: March 26, 2014, 05:32:01 PM by 3dickulus, Reason: url »
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #26 on: March 23, 2014, 05:17:24 AM » |
|
So... I have the SFT engine ported from java to C++, shoe horned into a Qt GUI and functional enough to get some deep images, still a few blobs and other issues but it is basically a clone of the java version, the translation to equivalent Qt gnu C++ was a breeze (still needs fine tuning), the thread spawner was probably the hardest part. I think the 2DIndexBuffer class and gnu-threading can be replaced by QImage and QThread stuff. Currently renders to 2DIndexBuffer then converts into a pixmap that gets set as a GL texture, doing it this way so that moving code to the GPU will be easier as GPU has access to texture buffers and this sets up all the GL code I need. Speed test between the C++ version and the Java version using the attached parameters file @ 1024x768 250000 iterations zoom e-1550 3Dickulus C++ version (ARPREC) (warm up) 1st run 58.88 Sec 2nd run 57.55 Sec Java version standalone desktop app downloaded from SFT Sourceforge (BigDecimal) (warm up?) 1st run 767.314 Sec 2nd run 766.096 Sec I recall someone here saying "the math libs do all the work so the language doesn't really matter, interpreted or compiled", I beg to differ Moving this to GPU shouldn't be too hard, some stuff will work while other stuff just won't fly but in the end a LOT of stuff can be offloaded to the GPU. Java > C++ = 10x faster I'm hoping the GPU will be 10x faster again
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #27 on: March 26, 2014, 05:28:12 PM » |
|
found some interesting things in the source code, removed some bits, optimized a few do-nothing-tests/loops (with the delete key) I think I've got a stable base to work from so now the plan is to tweak the GUI a little, add a palette twiddler and a settings dialog then start on the GPU/GL part. this is the deepest I've gone but after spending a couple of hours zooming and panning I think I can say it runs pretty smoothly attached settings are in SFT format, latest source code is available
|
|
|
Logged
|
|
|
|
knighty
Fractal Iambus
Posts: 819
|
|
« Reply #28 on: April 10, 2014, 03:44:05 PM » |
|
Thank you for the c++ port of SFt! I haven't been able to compile it yet -mostly because of OpenGl extensions which are not handled the same way under win32-, but after reading some of the code I have a (noobish ) question: - It looks like most of the allocated memory is not freed elsewhere. for example in sftgui.cpp whenever the user selects new, a new QPixmap is allocated on the heap. Does Qt provide a grabage collector?
|
|
|
Logged
|
|
|
|
3dickulus
|
|
« Reply #29 on: April 10, 2014, 04:59:47 PM » |
|
Qt provides excellent GC, anything allocated (not malloc) in an object is freed when the object is destroyed and if needed you can add cleanup code very easily. EDIT: if you reuse a pixmap (or any QObject afaik) the old one is destroyed first. On my system the memory consumption looks like this... before running... KiB Mem: 6123220 total, 2733684 used, 3389536 free, 313704 buffers while running... KiB Mem: 6123220 total, 2792516 used, 3330704 free, 313780 buffers after running... KiB Mem: 6123220 total, 2718576 used, 3404644 free, 313864 buffers there seems to be a few k extra after a run probably due to tossing out some firefox caches or something I have made some changes since posting the code... separated Engine from GUI, added fractional iteration count, map iteration count to frequencies ie:380-780 angstom units (just for fun), color map has 1024 places. I have been fiddling with the code a lot because I want to make sure that it's as stable as possible before trying to move it to the GPU, recently had this error when increasing zoom past E-2023 ------------------------------ *** MPROUN: Exponent overflow. *** mpabrt: execution terminated, error code =69 Segmentation fault ------------------------------ I'm in the process of tracking that down, please be aware that this port is only a hack'n'chop job to get the engine running, the GL stuff is not required but the intent is to have the GPU writing the texture buffer , it can just as easily map the pixmap as widget contents directly. A bonus from using QGLWidget is that it exploits hardware multisampling
|
|
« Last Edit: April 10, 2014, 05:15:41 PM by 3dickulus, Reason: verbosity »
|
Logged
|
|
|
|
|