lycium
|
|
« Reply #15 on: May 05, 2008, 12:43:28 PM » |
|
it's SIMD, and i'd strongly suggest tackling altivec in your situation, since you have access to an older mac and this experience will let you dabble with cell spu programming (via a ps3), which will be a kind of nirvana for you in terms of mandelbrot flops i think yeah it's a good idea on modern processors to use masks instead of small-scale branching because processors are deeply pipelined to reach high levels of speed (in ghz) and efficiency (in terms of balancing compute vs bandwidth resources); this applies even moreso to gpus, whereas the cell broadband engine is a nice agile mix of both.
|
|
|
Logged
|
|
|
|
keldor314
Guest
|
|
« Reply #16 on: June 10, 2008, 10:47:41 PM » |
|
Actually, concerning the Geforce 8x00, SIMD isn't completely correct either - the 8800GTX, for instance, has 16 fully seperate multiprocessors, each with 16 ALUs which are truely SIMD, for a total of 128 ALUs. What this means is that within each multiprocessor, the 16 SIMD processors will run in lockstep with each other, BUT each of the 16 multiprocessors can branch seperately. Think of it like a 16 core processor, where each processor operates on a 16-wide group of registers simultaneously. In practice, this will give you perhaps 90% efficiency on fractal rendering, because these 16 units of a given "core" most likely are operating on a 4x4 grid of pixels, meaning that the vast majority of the time, the bailout of one of these pixels is not much sooner than any of the others. The unit must iterate until all 16 pixels are bailed out (or MAX_ITERATIONS is reached), but since all 16 pixels are in a pretty tiny area (about the size of this comma , assuming you're using a small font) they will mostly end up finishing all about at the same time. Branches are handled by masking out the computations of the units that take the branch, so that they effectively do nothing until all units are "finished" and ready to join in beyond the branch. This means that for a iteration loop, for instance, the units that have hit the bailout will "sit around" and do nothing until all units (in the 4x4 grid, remember) have bailed out. At this point, the multiprocessor can proceed down to the next line of code. Note that the multiprocessor doesn't necessarily have to do the whole MAX_ITERATIONS, just enough that all the processors have reached bailout. It's interesting to contrast this to the Radeon 2800 (I THINK - I'm not up as well on their specific GPU numbers...), which has 4 separate multiprocessors, each with 16 SIMD processors, each of THOSE in turn having 5 ALUs for a collective total of 320 ALUs. The thing to note here (and why the 320 ALUs do not trump the 128 of the GeForce) is that the groups of 5 ALUs are bound together on the same pixel, rather than split among multiple pixels. This leaves it to the programmer to somehow get their code to do the work for a single pixel to somehow spread out its arithmetic among the 5 ALUs. (Actually, it's a bit more complicated than that, since the 5 ALUs are not exactly SIMD, rather, they are 4+1 SIMD, with 4 standard ALUs in SIMD, and one special super-ALU which kinda supplements the others - e.g. the 5th unit is the only one that has hardware accelerated transcendental operations (sin, cos, e^x, ect) and you might have a opcode that tells the 4 units to multiply, then add the 5th unit to each of their results) ANYWAY... before I make your headaches from reading this any worse (it's HARD to explain this sort of thing without going even further into the jargon than I already did...), the ultimate point is GPUZ DO L33T FRACTALZ!!11! XD For example, I have a screensaver I wrote which can render some pretty hairy fractals (e^(sin(a*z)+cos(b*z) for instance) at a smooth framerate (30-60fps on average) with 50 or so iterations at a whopping screen resolution of 2048x1536! What do you do with fractals rendered in 1/40th of a second? Realtime animation, of course! Watch the fractal change as the values of a and b change... The downside to all this (and it's a big one) is that the GPU only does native single precision floating point - no zooming (at least not much) for you! Still, you can of course write code in the shader to emulate higher precision the same way some fractal programs do deep zoom - it will just run a good deal slower, although still faster than the CPU... By the way (insert shameless plug) if you want to see that screensaver, you can download it yourself from https://sourceforge.net/projects/gpujuliascreens/. Note that it uses DirectX 10, and will not run on anything less, so don't bother if you don't have Windows Vista and a new graphics card (Geforce 8x00+, Radeon 2*00+).
|
|
|
Logged
|
|
|
|
cKleinhuis
|
|
« Reply #17 on: June 10, 2008, 10:58:49 PM » |
|
hey there, great to see another GPU fanatic, it is true, the single floating point values are far to few to achieve high deep rates, but this way you stay realtime have you experience with the extended floating point unit on nvidia cards ? i heard they can do 64 bit floating point precision - or something, but on the gpu ...
|
|
|
Logged
|
---
divide and conquer - iterate and rule - chaos is No random!
|
|
|
cKleinhuis
|
|
« Reply #18 on: June 10, 2008, 11:06:17 PM » |
|
i see you are using directx for rendering ... why ? i do hope you are not using dx10 .... but be sure to read about my alternating mandelbrot method, this way you can easily combine each of your fractal formula together with many many parameters, i have used this method in my fractalmovies.com screensaver in fact the method yields fantastic new fractals without changing the calculation complexity it is simply done by using more than one seed in the iteration, e.g. mandelbrot is: z2+c so, if you are using c1 and c2 throughout the iteration, it yields in smooth animation, in fact you can use any number of seeds in one iteration, clearly no more than iteration depth, but beside of that, it is easy to just switch the whole formula in each step, even this yields smooth animatable results please check out my screensaver demonstrating this method with a plain mandelbrot formula http://fractalmovies.com/index.php?option=com_remository&Itemid=21&func=fileinfo&id=19i have also written some tiny explanation about that: http://fractalmovies.com/index.php?option=com_content&task=view&id=31&Itemid=34please take your time to think about it, i think it would be perfect in a gpu accellerated screensaver running on pixel shader 3.0 hardware cheers ck
|
|
|
Logged
|
---
divide and conquer - iterate and rule - chaos is No random!
|
|
|
lycium
|
|
« Reply #19 on: June 10, 2008, 11:30:26 PM » |
|
hmm, there's a few extremes going on in this topic on the one hand, i wrote it quite a while ago to explain to a then-uninterested audience (e.g. charles wehner who is probably STILL hacking around with x86 and x87 asm); now there are 3 of us who all code 3d graphics professionally explaining to each other things we already know (the simd comment was directed at duncan c)
|
|
|
Logged
|
|
|
|
keldor314
Guest
|
|
« Reply #20 on: June 11, 2008, 12:09:21 AM » |
|
LOL, maybe you didn't provide enough shiny pictures? XD
Anyway, thinking about that multiple parameters alternating between c1 and c2 would produce a 4th order julia set, in that every 2 iterations would be z = (z^2+c1)^2+c2 which is z = z^4+2*c1*z^2+c1*c1+c2, right?
The problem would be that the image produced by iteration 39 would be quite different from that from iteration 40, though a lot like iteration 41, so your users might be confused when you tweak the iteration value, and suddenly end up with a much different fractal!
In any case, it does indeed use dx10, partly because part of the reason I started is was to learn dx10 (it's quite a bit different from dx 9), and partly because one of the features of dx10 is the newer hlsl compiler, which supports dynamic branching. The dx9 hlsl compiler only supported static branching - e.g. SM 2. In order to take advantage of dynamic branching, you had to code the shaders in assembler. Ick! After digging through the documentation, I just found that the dx10 hlsl compiler is now the one used for dx9 as well, so it will be happy targeting SM 3 at least.
I guess this means it won't be too horrific to backport the thing to dx9, since it's not actually using any dx10 exclusive features. Still, given that setting up the DX api is about as fun as gouging ones eyes out with a dinner fork...
|
|
|
Logged
|
|
|
|
lycium
|
|
« Reply #21 on: June 11, 2008, 12:17:05 AM » |
|
LOL, maybe you didn't provide enough shiny pictures? XD well i haven't done much fractal computing with a gpu still prefer my 64bit x86 processor cores. In any case, it does indeed use dx10, partly because part of the reason I started is was to learn dx10 (it's quite a bit different from dx 9), and partly because one of the features of dx10 is the newer hlsl compiler, which supports dynamic branching. The dx9 hlsl compiler only supported static branching - e.g. SM 2. dx9 definitely supports dynamic branching, in hlsl. I guess this means it won't be too horrific to backport the thing to dx9, since it's not actually using any dx10 exclusive features. Still, given that setting up the DX api is about as fun as gouging ones eyes out with a dinner fork... even for crysis, dx9 is enough there's a funny fiasco about the dx9/dx10/vista-only issues with that game, where actually you can just enable "dx10 only" features in dx9 via a registry key! sure, there are things that really only dx10 can do and dx9 can't, but if you're worried about computational capabilities skip dx alltogether and go straight for cuda
|
|
|
Logged
|
|
|
|
keldor314
Guest
|
|
« Reply #22 on: June 11, 2008, 12:36:55 AM » |
|
sure, there are things that really only dx10 can do and dx9 can't, but if you're worried about computational capabilities skip dx alltogether and go straight for cuda Problem with that is then instead of being Vista only, it becomes Nvidia only. Besides, CUDA took FOREVER to be released for Vista. As in, the first public Vista beta was only a month or two ago >.< Hence, CUDA will have to wait for my next project.
|
|
|
Logged
|
|
|
|
cKleinhuis
|
|
« Reply #23 on: June 11, 2008, 12:36:04 PM » |
|
Anyway, thinking about that multiple parameters alternating between c1 and c2 would produce a 4th order julia set, in that every 2 iterations would be z = (z^2+c1)^2+c2 which is z = z^4+2*c1*z^2+c1*c1+c2, right?
yes, you are absolutely right. this is the outcome, and you see, that the repeating appears every second iteration, you have a decent control about the evaluation The problem would be that the image produced by iteration 39 would be quite different from that from iteration 40, though a lot like iteration 41, so your users might be confused when you tweak the iteration value, and suddenly end up with a much different fractal!
this is not true !!! consider the calculated fractal until iteration 39, certain areas of the picture are already bailed out. this means that the next iteration can only create new pixels in the not yet bailouted areas. now the distance between c1 and c2 come into play, if they are nearly the same, you would not see any difference between the c1 only fractal compared to this. if c1 and c2 are far away from eachother the image simply bails out earlier and at some unexpected positions, but iteration depth alone does not lead to unwanted results, believe me. the only thing which really changes the outcome is when adding more seeds, please give it a try, replace your seed variable with a seed[iteration%numSeeds] statement, and animate those seeds with some nice sine curves. have fun c.k
|
|
|
Logged
|
---
divide and conquer - iterate and rule - chaos is No random!
|
|
|
David Makin
|
|
« Reply #24 on: June 11, 2008, 08:52:46 PM » |
|
Hi, I'd love to have any one of you guys utilising the GPUs see if you can come up with a realtime GPU version of my escape-time 3D IFS algorithm (based on Hart's method). It should be possible to get it done just about realtime - for instance the CPU/FPU render version of the 3D Menger Sponge takes around 13 seconds at 640*480 on my P4HT and the algorithm as written allows for any affine transforms - obviously there's a lot of potential for optimisation if the GPU version is restricted to a particular fractal such as the Sponge. If anyone's interested the algorithm is the 3D IFS formula in mmf4.ufm from the Ultrafractal Formula database - http://formulas.ultrafractal.com/ - go to browse collection and "Makin, Dave". bye Dave
|
|
|
Logged
|
|
|
|
lycium
|
|
« Reply #25 on: June 13, 2008, 08:50:19 AM » |
|
i've tried to grok that src a few times, always finding: 1. the source is HUGE! 2. the variable names aren't very descriptive 3. lack of vector functions etc (everything is super inline) means the code is quite opaque and just dropping it into cg or hlsl/glsl wouldn't work. 4. no idea where the meat of the algorithm is, because there's so much uf-specific stuff around it... could you post the bit that would actually get ported here, please?
|
|
|
Logged
|
|
|
|
David Makin
|
|
« Reply #26 on: June 13, 2008, 12:38:24 PM » |
|
i've tried to grok that src a few times, always finding: 1. the source is HUGE! 2. the variable names aren't very descriptive 3. lack of vector functions etc (everything is super inline) means the code is quite opaque and just dropping it into cg or hlsl/glsl wouldn't work. 4. no idea where the meat of the algorithm is, because there's so much uf-specific stuff around it... could you post the bit that would actually get ported here, please? 1. I know, there's quite a bit that's superfluos to a GPU implimentation (such as calculating the fractal dimension of thte object). 2. That's just me, I'm a lazy typist (1 hand only). 3. In UF4 you can only do it "inline". 4. I'll post something here tonight - with comments (Am at work now) bye Dave
|
|
|
Logged
|
|
|
|
David Makin
|
|
« Reply #27 on: June 14, 2008, 02:06:40 AM » |
|
Hi Thomas and all,
Here's the meat of the routine: I should point out that the transforms here in the arrays are the reverse transforms i.e. the expanding ones, though I haven't changed the sign of the translations from the original and the scale values are the original contracting values. You should also note that the order you put the transforms in will greatly affect the render speed especially for fractals without rotation/skew such as the Menger - it's best to have a pre-test routine to find which transforms have areas closest to the viewpoint and order the transforms so the closest are done first.
x1 = gtestx - tx ; gtestx, gtesty and gtestz are the centre of the y1 = gtesty - ty ; bailout test sphere z1 = gtestz - tz ; tx, ty and tz are the base point on the viewing ray found = 1e200 ; "found" is a distance, large value means "not found" nomore = true ; flag is initially true to indicate no extra rays to traverse if x1*x1 + y1*y1 + z1*z1 - (x1*dx + y1*dy + z1*dz)^2<=gdist ; tests if distance of nearest approach of ray to centre of test sphere is less ; than the bailout distance (gdist), if not then we've missed the fractal xv[0] = t5 = tx ; set temp variables and ray position for depth 0 yv[0] = t6 = ty zv[0] = t7 = tz dxv[0] = dx ; dx,dy and dz are the viewing ray's direction vector dyv[0] = dy ; here we set the direction vectors at depth 0 dzv[0] = dz k = gft ; k = the first transform, gft s[0] = 1.0 ; scale at depth 0 = 1.0 t12 = 1e200 ; not really necessary (see t12 below) j = 0 ; start at depth 0 in the transform tree repeat ss = s[j]*gt[(i[j]=k),12] ; scale = oldscale*scale for this transform ; In the above note the i[j]=k i.e. set transform at depth j to k ; also note that the scale values are the original contracting values ; now transform the base point on the ray, storing in "current" values ; temp variables and variables for the next depth - note the (j=j+1) xv[(j=j+1)] = t5 = (x1=t5-gt[k,9])*gt[k,0] + (y1=t6-gt[k,10])*gt[k,3] \ + (z1=t7-gt[k,11])*gt[k,6] yv[j] = t6 = x1*gt[k,1] + y1*gt[k,4] + z1*gt[k,7] zv[j] = t7 = x1*gt[k,2] + y1*gt[k,5] + z1*gt[k,8] ; transform the ray's direction vectors and store in current and for depth dxv[j] = dx = (x1=dx)*gt[k,0] + (y1=dy)*gt[k,3] + dz*gt[k,6] dyv[j] = dy = x1*gt[k,1] + y1*gt[k,4] + dz*gt[k,7] dzv[j] = dz = x1*gt[k,2] + y1*gt[k,5] + dz*gt[k,8] x1 = gtestx - t5 ; distances from bailout sphere centre y1 = gtesty - t6 z1 = gtestz - t7 f = false ; a flag to make things easier (see below) s[j] = ss ; set scale for this depth ; Now test if we're within the bailout distance if (t9 = gdist - x1*x1 - y1*y1 - z1*z1 \ + (x2=1.0/(dx*dx+dy*dy+dz*dz))*(t11=x1*dx+y1*dy+z1*dz)^2)>=0.0 ; if we are compute the distances along the ray to the intersections with the ; sphere (in original unscaled terms) t12 = (t11=x2*t11) - (t9=sqrt(x2*t9)) ; if the nearest distance is <= the distance to the back clip (end) ; and the farthest is >= to the front clip if t12<=end && t11+t9>=t4 ; if we've reached the max depth (@igen) or ; the depth is >= the minimum depth (@imin) and we've gone below the min scale if j>=@igen || (j>=@imin && ss<minscale) ; You can safely ignore the "if @flat bit and use the "else" only if @flat if t11<found found = end = t11 if (@colouring==0 || @colouring>3) && @colmeth==0 nomore = false endif if (@colouring>0 || @colmeth==1) && ii==0 jj = l = j while (l=l-1)>=0 is[l] = i[l] endwhile endif endif else if t12<found ; if the nearest intersection distance is < the best found found = end = t12 ; set found AND the back clip to the new distance ; if using lighting set nomore to false so we calculate the adjacent rays ; for getting the normals if (@colouring==0 || @colouring>3) && @colmeth==0 nomore = false endif ; store the transform numbers from each depth for certain colourings if (@colouring>0 || @colmeth==1) && ii==0 jj = l = j while (l=l-1)>=0 is[l] = i[l] endwhile endif endif endif ; else (within clips but not found a collision) get first transform at next depth elseif (k=gn[i[j-1],gntm1])>=0 ; check depth flag for next transform k at depth j if !(f=df[k,j]) ; if k not allowed i[j] = k ; adjust values for j = j + 1 ; routine to get the next valid one (below) endif endif endif endif if !f ; if not going to new depth k = i[(j=j-1)] ; goto previous depth, get transform from previous depth repeat if (k=k-1)<0 ; next transform from previous depth, OK ? if (j=j-1)<0 ; no so go up another depth, if done f = true ; set flag to exit the repeat else k = i[j] ; transform from previous endif elseif j>0 ; if not on first depth if (k=gn[i[j-1],k])>=0 ; transform k valid ? f = df[k,j] ; yes, flag if k valid at depth j else k = i[(j=j-1)] ; no, go back up endif else f = df[k,0] ; f = k valid at depth 0 endif until f if (j>=0) ; depth>=0 t5 = xv[j] ; retreive ray base point and direction vectors t6 = yv[j] ; for the new depth t7 = zv[j] dx = dxv[j] dy = dyv[j] dz = dzv[j] endif endif until j<0 endif
; If found a point on the fractal store values accordingly if (sc[ii]=found)<1e100 if ii==0 vv = found px[0] = xx + found*xc py[0] = yy + found*yc pz[0] = zz + found*zc elseif abs(vv-found)<onepix px[ii] = xx + found*xc py[ii] = yy + found*yc pz[ii] = zz + found*zc else sc[ii] = 1e200 endif endif ii = ii + 1 until (ii==1 && nomore) || ii>=5
|
|
« Last Edit: June 14, 2008, 02:21:46 AM by David Makin »
|
Logged
|
|
|
|
David Makin
|
|
« Reply #28 on: June 14, 2008, 01:41:54 PM » |
|
Hi again,
I should have mentioned that I missed out the top of the ii loop, all that does is set up the viewing ray base position, the viewing ray direction vectors and the initial back-clip (end) and the front clip (t4). Also (fairly important) the initial direction vector of the viewing ray (dx,dy,dz) is normalised.
bye Dave
|
|
« Last Edit: June 14, 2008, 02:23:12 PM by David Makin »
|
Logged
|
|
|
|
lycium
|
|
« Reply #29 on: June 14, 2008, 02:55:06 PM » |
|
Also (fairly important) the initial direction vector of the viewing ray (dx,dy,dz) is normalised. of course, but it seems you forgot about that: ; Now test if we're within the bailout distance if (t9 = gdist - x1*x1 - y1*y1 - z1*z1 \ + (x2=1.0/(dx*dx+dy*dy+dz*dz))*(t11=x1*dx+y1*dy+z1*dz)^2)>=0.0
thanks for the code! it doesn't look very gpu-friendly though, with all that branching... can you explain it in pseudocode perhaps? to make sure i'm getting this right, you start out with some linear maps M_k which would be used for the normal "chaos game" (stochastic) ifs, from those get their inverses M^{-1}_k and then... well, i've read hart's method back when and would like to brush up on it again, but exactly how does your implementation differ? thanks, and sorry for all the questions!
|
|
|
Logged
|
|
|
|
|