Title: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 17, 2007, 12:42:53 AM I just posted a new demo version of my FractalWorks fractal generating program. It runs native on PowerPC and Intel macs running OS 10.4 (or later)
This version now fully supports mulitprocessor machines. A dual processor machine renders about twice as fast as a single processor machine at the same clock speed. This version also takes into account the symmetry of Mandelbrot and Julia sets, and only renders the unique parts of each plot. Performance on Intel macs is amazing. I clocked it on my wife's Intel MacBook at over 211 million iterations/second. You can download a demo version at http://homepage.mac.com/dmchampney1/FileSharing1.html (http://homepage.mac.com/dmchampney1/FileSharing1.html). This version expires 4/31/07. There should be a new version available by then however. There is a readme file in the archive that includes very rough directions that should be enough to get you started. Better documentation will be available at some future date. Duncan C Title: Comments/feedback? Post by: Duncan C on March 18, 2007, 05:51:45 PM This is a work in progress, so I'm very interested in comments, feedback, and suggestions.
Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 18, 2007, 06:54:32 PM since you ask, i'll bite. by now there are an abundance of programs to generate julia and mandelbrot sets, why make another and then have it expire? next up, the speed is completely irrelevant (though still technically impressive to us programmers) since even an old processor can generate unviewably huge julia/mandelbrot images in seconds, using a completely naive, 10-seconds-to-program algorithm.
in short, what are your goals with this program? Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 19, 2007, 03:35:21 AM since you ask, i'll bite. by now there are an abundance of programs to generate julia and mandelbrot sets, why make another and then have it expire? next up, the speed is completely irrelevant (though still technically impressive to us programmers) since even an old processor can generate unviewably huge julia/mandelbrot images in seconds, using a completely naive, 10-seconds-to-program algorithm. in short, what are your goals with this program? My goals are: 1. Eye candy. While it's true that there are lots of programs to generate Mandelbrot and Julia sets, I don't find many of them visually appealing. I last worked with this some 20 years ago, and found that gettting aesthetically pleasing plots was tricky. I'm focusing on a flexible approach to color mapping in order to create appealing plots. I am also planning to create animations of various sorts "Movies." Still images are just the beginning. I think there's plenty of room for unique approaches to the problem. As far as performance, you say "...even an old processor can generate unviewably huge julia/mandelbrot images in seconds, using a completely naive, 10-seconds-to-program algorithm." That's news to me. Modern processors are much faster than the machines available back when I first tackled this, but a high magnifcation plot at a very high max iterations still makes even a high end computer sweat. My program supports up to 64k iterations, and allows you to specify a unique color for every iteration in a plot. Is there some magic to rendering these plots that I'm not aware of? As far as I can tell, doing a plot involving hundreds of millions, or even billions, of iterations at double precision still takes enough time that fast algorithms matter. I've created high resolution plots with 40 million pixels, and those take a good long time to render, even on a fast machine. 2. I'm also pursuing this as a learning project. I let my Macintosh development skills go stale, and have a lot of catching up to do. There are lots of challenges in this project that I find interesting. Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: ericbigas on March 19, 2007, 06:57:26 AM Animation, you say? Bring it on. I'd love something new and fast for parameter-interpolated zoom animations on OSX. Have you used Graham Anderson's Escape or Jesse Jones' Mandella?
Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 19, 2007, 11:11:04 AM I am also planning to create animations of various sorts "Movies." Still images are just the beginning. I think there's plenty of room for unique approaches to the problem. fully agree with you. right now the only way to do real exploration (that i'm aware of) is to use ultrafractal or write your own app. along those lines, i rendered some julia set animations a while back, and i think the results would be difficult to replicate with any of the apps out there today (they're in my fractographer.com/wip folder if you're interested but i'm not sure if quicktime will play that h.264 stream - they're encoded with x264). the problem is of course that it's all done from c++ code, and is not tweakable/saveable through a nice interface; while i firmly believe the one true way to fractal exploration is to code it yourself, compromises must be made for those who don't walk the programmer's path... the difficulty is maintaining flexibility while still being usable. As far as performance, you say "...even an old processor can generate unviewably huge julia/mandelbrot images in seconds, using a completely naive, 10-seconds-to-program algorithm." That's news to me. Modern processors are much faster than the machines available back when I first tackled this, but a high magnifcation plot at a very high max iterations still makes even a high end computer sweat. depends what you call sweat. with two 2.2ghz cores (amd x2, so weaker than you intel core 2 duo mac) i can produce this in some seconds: http://www.fractographer.com/propaganda/tlset.png that has 144 supersamples per pixel, i don't remember the maximum iteration depth unfortunately. the iteration itself is done with my generic c++ complex number class, via operator overloading; i also wrote a very fast (double precision) sse-based iteration that would totally scream on your core 2 duo (since it's twice as fast doing sse than the athlon64), especially in 64bit mode... but the end result is that it's just not necessary (and specific to the mandelbrot/julia), and was done only for exercise. My program supports up to 64k iterations, and allows you to specify a unique color for every iteration in a plot. Is there some magic to rendering these plots that I'm not aware of? As far as I can tell, doing a plot involving hundreds of millions, or even billions, of iterations at double precision still takes enough time that fast algorithms matter. a somewhat indirect comparison from my own experiences: when ray tracing you need billions and billions of rays to generate images with global illumination; sampling a 2d brightness function for something simple like a julia is a walk in the park for any cpu since the pentium2, since that can do even realtime ray tracing and basic global illumination. 2. I'm also pursuing this as a learning project. I let my Macintosh development skills go stale, and have a lot of catching up to do. There are lots of challenges in this project that I find interesting. that's a fine reason of course, and people seem to be interested in the animation support, so definitely it's not just for you! the reason i asked is because, i'm sure you're aware, it's really trivial to render mandelbrots and julia sets and we've all been rendering them since at least the early 90s (interesting animations notwithstanding, but then again it is 2007!). in essence, i was hoping to see something really new... mutatorkammer for example takes a holistic approach to this by trying a huge variety (hmm perhaps too huge, but that's a different point) of random iteration formulae and variable types. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: ericbigas on March 19, 2007, 12:35:13 PM along those lines, i rendered some julia set animations a while back, and i think the results would be difficult to replicate with any of the apps out there today (they're in my fractographer.com/wip folder if you're interested but i'm not sure if quicktime will play that h.264 stream - they're encoded with x264). Nope. QuickTime won't even open them. Apple made a huge deal out of H.264 being the next big thing but is doing a poor job of supporting it. I'm already tired of having to avoid using b-frames in order to maintain QT compatibility. They play in VLC, of course. I don't know what's going on in the Julia2 anim, but it looks very cool. the problem is of course that it's all done from c++ code, and is not tweakable/saveable through a nice interface; while i firmly believe the one true way to fractal exploration is to code it yourself, compromises must be made for those who don't walk the programmer's path * meekly raises hand * I can code web apps but not C++ or the like. I understand what you mean, though. When you're trying to get a specific task done as well as possible, good custom code smokes general-purpose apps (or libraries). But that's why the rest of us rely on you 1337 programmers. If you share your know-how and tools, we'll plug away with them and create some fairly cool stuff, even if it's not cutting-edge. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 19, 2007, 02:12:41 PM Nope. QuickTime won't even open them. Apple made a huge deal out of H.264 being the next big thing but is doing a poor job of supporting it. I'm already tired of having to avoid using b-frames in order to maintain QT compatibility. i hear that a lot from apple folks, it's a shame :| They play in VLC, of course. I don't know what's going on in the Julia2 anim, but it looks very cool. thanks, unfortunately it's incomplete though - had to go to airport before the frames finished rendering! I can code web apps but not C++ or the like. I understand what you mean, though. When you're trying to get a specific task done as well as possible, good custom code smokes general-purpose apps (or libraries). But that's why the rest of us rely on you 1337 programmers. If you share your know-how and tools, we'll plug away with them and create some fairly cool stuff, even if it's not cutting-edge. careful who you call l33t, the truly l33t ones are easily offended and i would prefer not to incur their wrath... as for making good tools, i've mentioned elsewhere on the forums that my partner in crime and i will be starting a commercial project soon; that ought to be flexible (read: scriptable) and have really good 3d rendering, so it should be quite a good foothold for exploration. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 20, 2007, 12:28:22 AM I am also planning to create animations of various sorts "Movies." Still images are just the beginning. I think there's plenty of room for unique approaches to the problem. fully agree with you. right now the only way to do real exploration (that i'm aware of) is to use ultrafractal or write your own app. along those lines, i rendered some julia set animations a while back, and i think the results would be difficult to replicate with any of the apps out there today (they're in my fractographer.com/wip folder if you're interested but i'm not sure if quicktime will play that h.264 stream - they're encoded with x264). the problem is of course that it's all done from c++ code, and is not tweakable/saveable through a nice interface; while i firmly believe the one true way to fractal exploration is to code it yourself, compromises must be made for those who don't walk the programmer's path... the difficulty is maintaining flexibility while still being usable. Those animations are cool. Thanks for showing them to me.As far as performance, you say "...even an old processor can generate unviewably huge julia/mandelbrot images in seconds, using a completely naive, 10-seconds-to-program algorithm." That's news to me. Modern processors are much faster than the machines available back when I first tackled this, but a high magnifcation plot at a very high max iterations still makes even a high end computer sweat. depends what you call sweat. with two 2.2ghz cores (amd x2, so weaker than you intel core 2 duo mac) i can produce this in some seconds: http://www.fractographer.com/propaganda/tlset.png that has 144 supersamples per pixel, i don't remember the maximum iteration depth unfortunately. the iteration itself is done with my generic c++ complex number class, via operator overloading; i also wrote a very fast (double precision) sse-based iteration that would totally scream on your core 2 duo (since it's twice as fast doing sse than the athlon64), especially in 64bit mode... but the end result is that it's just not necessary (and specific to the mandelbrot/julia), and was done only for exercise. I've been toying with the idea of writing Altavec-specific code for G4 based Macs, but that would be a fair amount of work for an older processor. Back 20 years ago, I wrote the floating point code for my program in machine code for the FPU. I'm not really inclined to get that down and dirty any more. As you say, it's not really necessary any more. My program supports up to 64k iterations, and allows you to specify a unique color for every iteration in a plot. Is there some magic to rendering these plots that I'm not aware of? As far as I can tell, doing a plot involving hundreds of millions, or even billions, of iterations at double precision still takes enough time that fast algorithms matter. a somewhat indirect comparison from my own experiences: when ray tracing you need billions and billions of rays to generate images with global illumination; sampling a 2d brightness function for something simple like a julia is a walk in the park for any cpu since the pentium2, since that can do even realtime ray tracing and basic global illumination. 2. I'm also pursuing this as a learning project. I let my Macintosh development skills go stale, and have a lot of catching up to do. There are lots of challenges in this project that I find interesting. that's a fine reason of course, and people seem to be interested in the animation support, so definitely it's not just for you! the reason i asked is because, i'm sure you're aware, it's really trivial to render mandelbrots and julia sets and we've all been rendering them since at least the early 90s (interesting animations notwithstanding, but then again it is 2007!). in essence, i was hoping to see something really new... mutatorkammer for example takes a holistic approach to this by trying a huge variety (hmm perhaps too huge, but that's a different point) of random iteration formulae and variable types. I'm still pretty early in my development, and nowhere near feature complete. I'm definitely planning to generate movies. Right now, my program will animate changes to the color tables, and that looks quite cool. If you have access to a Mac, it's worth downloading my application and trying the color table animation tutorial I included. You can even share a color definition ("color scheme" between several plots, and changes to the color scheme are applied in real time to all the windows. Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 20, 2007, 01:18:57 PM Sounds like ultrafractal is the 700 lb gorrila of the fractal rendering programs. Is it expensive? depends what you call expensive, but they certainly charge enough to tempt my friend aexion and i into joining the market! I'm still limping along with a 1.25 gHz PowerPC G4. I'm several generations behind the Core 2 duo. My wife has a Core duo (not Core 2 duo) MacBook. That's what I did my Intel timing on. the core duo chip is still a bit faster (and certainly more efficient) than the x2 afaik, but the core 2 duo completely smashes it when it comes to most number crunching apps (128bit sse instructions take only a single cycle, versus 2 on core duo and athlon64). amd's upcoming quadcore chip will do two 128bit sse instructions per cycle, but that's another matter... I had never heard of SSE until your post. It sounds very much like Altavec for the PowerPC G4. I wonder if Apple's compiler generates SSE instructions for their Intel machines? indeed sse (and sse2, sse3, and sse4 on core 2 duo) is a lot like altivec in that they are simd extensions for coherent/streaming execution. as for "apple's" compiler, they now use gcc, which version i'm not sure, but i do know that the latest gcc is much better than microsoft visual c++ in producing sse code, particularly in 64bit mode. that's because the register allocator in gcc is apparently poor compared to intel and microsoft's compilers, and with amd64 you have a lot more registers to exploit. i totally love 64bit computing :) it's incredibly useful for specific things like random number generation, and in general helps to plough through complex code where juggling just a few general purpose registers with the stack introduces a ton overhead. I've been toying with the idea of writing Altavec-specific code for G4 based Macs, but that would be a fair amount of work for an older processor. definitely don't bother unless you're feeling nostalgic (see my next point). Back 20 years ago, I wrote the floating point code for my program in machine code for the FPU. I'm not really inclined to get that down and dirty any more. As you say, it's not really necessary any more. i had a long (and excruciating) discussion with another member of the forums about that (in this thread: http://www.fractalforums.com/index.php?topic=132.0). i too was an asm optimisation junkie in my earlier years, but these days cpus are so complex that a compiler will almost always do better - a sentiment that is echoed by very, very knowledgable low-level programmers. the One True Way to fast code these days (that's a mild contradiction, i know ;)) is: 1. plan your attack, organise your program's execution (and only the critical parts) into wider runs 2. test it with plain c/c++ 3. write an additional optimised path, using the intel supplied intrinsics for the instructions you wish to use (you should have a plain c/c++ fallback path anyway) this is the One True Way because: 1. it requires minimal effort from the programmer 2. that super-intelligent compiler knows exactly what you'd like to do, but is free to schedule the instructions optimally according to which processor architecture you're compiling for (p4 is wildly different from core duo and athlon64, for example) 3. with just a recompile you can take advantage of improved compilers and/or the enhanced capabilities of future architectures (for example the extra registers 64bit mode offers) 4. it requires minimal effort from the programmer ;) regarding points 1 and 4, a programming proverb might go "a programmer that is not lazy invariably ends up wasting their time." ;) anyway, that's enough advocacy for now. unfortunately i don't have access to a mac (and as a simple programmer am not nearly trendy enough to use one anyway ;)) so i can't try out your program, but if you paste code we can talk speed for sure :D edit: i just realised that i dropped some rather vague terms like "coherent" and "plan your attack"; to clarify, therein lies the difficulty of doing good simd optimisation: you should have low branching complexity (high coherence), since your intructions will be executed on all the data elements and obviously you don't want to be too wasteful there. a certain degree of redundancy is fine and desirable if it reduces code complexity, since it takes just as long to do 4 adds as it does 1, but the code you want to simd-ify should have "nice" execution properties, and unless they are really nice you have to take a trip back to the 80s and use execution masks and so on to unify your computation. sorry if that's a poor explanantion, but i needed to clarify that there are typically complications you have to deal with, and it's not always as simple as i made it out to be above :) Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 23, 2007, 02:31:42 AM ...was an asm optimisation junkie in my earlier years, but these days cpus are so complex that a compiler will almost always do better - a sentiment that is echoed by very, very knowledgable low-level programmers. the One True Way to fast code these days (that's a mild contradiction, i know ;)) is: 1. plan your attack, organise your program's execution (and only the critical parts) into wider runs 2. test it with plain c/c++ 3. write an additional optimised path, using the intel supplied intrinsics for the instructions you wish to use (you should have a plain c/c++ fallback path anyway) I'd really prefer not writing hardware-specific optimizations if I can help it. I'd like my code to run on PowerPC G4 and G5 Macs as well as the newer intel-based models. I need to look into library support for vector functions, or perhaps complex math functions. My HOPE is that the complex math libraries are written to run optimized code on the target platform. I found some trig functions that operate on complex numbers, but not simple functions like addition and multiplication. (I don't claim to understand how you take the cosine of a complex number) edit: i just realised that i dropped some rather vague terms like "coherent" and "plan your attack"; to clarify, therein lies the difficulty of doing good simd optimisation: you should have low branching complexity (high coherence), since your intructions will be executed on all the data elements and obviously you don't want to be too wasteful there. a certain degree of redundancy is fine and desirable if it reduces code complexity, since it takes just as long to do 4 adds as it does 1, but the code you want to simd-ify should have "nice" execution properties, and unless they are really nice you have to take a trip back to the 80s and use execution masks and so on to unify your computation. sorry if that's a poor explanantion, but i needed to clarify that there are typically complications you have to deal with, and it's not always as simple as i made it out to be above :) Right now, the code I want to optimize is the "iterate a point" routine. That's where my program spends the VAST majority of it's time, and it's pretty simple. It does the Zn+1 = Zn^2 + c iteration, checks for escape, and does it again. Right now I've unrolled the complex math into floating point functions and written it so I re-use the Z^2 value for both the iteration and checking for escape. If I could rewrite the code to use a complex math library, it might execute much faster. Duncan Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Nahee_Enterprises on March 23, 2007, 06:30:57 AM Duncan C. wrote:
> > I don't claim to understand how you take the cosine of a complex number I believe you could create a function called "ccos" that would return the cosine of a complex number represented by "x + iy" . Where you would supply it two parameter values, such as: ccos(x,y) . Where "ccos(x,y)" is defined as: cos (x + iy) which is equal to: (cos x cosh y- isin x sinh y) . Title: Re: New version of Fractalworks w full mulitprocessor support Post by: gandreas on March 23, 2007, 03:25:17 PM I'd really prefer not writing hardware-specific optimizations if I can help it. I'd like my code to run on PowerPC G4 and G5 Macs as well as the newer intel-based models. If you want speed on an OS X based machine, you'll need to do two things1) Use Altivec/SSE code where possible (which means you'll have different code for G4/G5 and x86) 2) Use multi-threading. Properly done, these can easily accelerate your code by a factor of 8 or more. Note that it's hard to get really good speed improvements on SSE as easily as you can with AltiVec, due to the limited number of vector registers (until you go to 64-bit x86 code, which you won't be able to reasonably do until Leopard ships), but depending on the image, it's not that hard to get 3x+ due to AltiVec (quadrium often gets 3.5 times the performance with AltiVec enabled, but only about 2x for SSE) Threading is also a must - basically split the image into parts, and have each part calculated by a different thread. It's not all that hard to add in once you've got the basics of threading (and this is a fairly simple threading approach). You pretty much get N times the performance then (where N is the number of cores your machine has). Quote I need to look into library support for vector functions, or perhaps complex math functions. My HOPE is that the complex math libraries are written to run optimized code on the target platform. I found some trig functions that operate on complex numbers, but not simple functions like addition and multiplication. (I don't claim to understand how you take the cosine of a complex number) gcc offers two different complex number packages - one from the C99 (?) complex data type (which offers an intrinsic complex data type that allows to compiler to do things like basic operations), and another that involves the <complex> C++ template based library. Neither are accelerated to support AltiVec/PPC (but due to that way that C++ works, you can extend that one via template specialization, but it's a fair amount of work). There is also the "Accelerate.framework" that Apple provides, which provides a full range of basic AltiVec/SSE based numerics (the usual assortment of transcendental and trigonometric functions), as well as routines to apply simple operations to large arrays of numbers (which are highly tuned to the specific machine, unfortunately these aren't of much use for fractal generation). There is also macstl, a derivative of STL that includes support for AltiVec and SSE based operation, which might be of use to you (though the last time I checked, it was still missing a few features I needed). Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 23, 2007, 04:03:32 PM Properly done, these can easily accelerate your code by a factor of 8 or more. Note that it's hard to get really good speed improvements on SSE as easily as you can with AltiVec, due to the limited number of vector registers (until you go to 64-bit x86 code, which you won't be able to reasonably do until Leopard ships), but depending on the image, it's not that hard to get 3x+ due to AltiVec (quadrium often gets 3.5 times the performance with AltiVec enabled, but only about 2x for SSE) i got nearly a clean 4x speedup from using sse, and that's on the simple mandelbrot/julia - not something more compute-limited like quadrium; what gives? Threading is also a must - basically split the image into parts, and have each part calculated by a different thread. It's not all that hard to add in once you've got the basics of threading (and this is a fairly simple threading approach). You pretty much get N times the performance then (where N is the number of cores your machine has). that's of course assuming you have an equal workload for all pixels, otherwise you'll have threads twiddling thumbs after a while. to be fair you did mention that uniform subdivision is a simple approach, so this is obviously @ duncan and not at you. gcc offers two different complex number packages - one from the C99 (?) complex data type (which offers an intrinsic complex data type that allows to compiler to do things like basic operations), and another that involves the <complex> C++ template based library. Neither are accelerated to support AltiVec/PPC (but due to that way that C++ works, you can extend that one via template specialization, but it's a fair amount of work). i'm pretty sure core 2 duo has specific instructions for accelerating complex number computations (finally!), so it's reasonable to assume that a recent gcc will spit those instructions if compiled with the right march-flag (microarchitecture). of course that doesn't help our friend duncan, but i'd be interested to know if that's indeed the case. finally, there are ways to speed up general fractal computations without resorting to special cases for various fractal types, but in the end your time will be more rewarded by just biting the bullet and going for the big speedup. put another way, if you ever want to catch up with quadrium, you'll have to get your hands dirty ;) Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 23, 2007, 05:21:14 PM I'd really prefer not writing hardware-specific optimizations if I can help it. I'd like my code to run on PowerPC G4 and G5 Macs as well as the newer intel-based models. If you want speed on an OS X based machine, you'll need to do two things1) Use Altivec/SSE code where possible (which means you'll have different code for G4/G5 and x86) 2) Use multi-threading. Properly done, these can easily accelerate your code by a factor of 8 or more. Note that it's hard to get really good speed improvements on SSE as easily as you can with AltiVec, due to the limited number of vector registers (until you go to 64-bit x86 code, which you won't be able to reasonably do until Leopard ships), but depending on the image, it's not that hard to get 3x+ due to AltiVec (quadrium often gets 3.5 times the performance with AltiVec enabled, but only about 2x for SSE) I have also written logic that detects the symmetric parts of Mandelbrot and Julia sets, and only renders the unique parts. It copies and flips the symmetric parts (Mandelbrot sets have line symmetry across the x axis, and Julia sets are point symmetric across 0,0.) Threading is also a must - basically split the image into parts, and have each part calculated by a different thread. It's not all that hard to add in once you've got the basics of threading (and this is a fairly simple threading approach). You pretty much get N times the performance then (where N is the number of cores your machine has). Quote I need to look into library support for vector functions, or perhaps complex math functions. My HOPE is that the complex math libraries are written to run optimized code on the target platform. I found some trig functions that operate on complex numbers, but not simple functions like addition and multiplication. (I don't claim to understand how you take the cosine of a complex number) gcc offers two different complex number packages - one from the C99 (?) complex data type (which offers an intrinsic complex data type that allows to compiler to do things like basic operations), and another that involves the <complex> C++ template based library. Neither are accelerated to support AltiVec/PPC (but due to that way that C++ works, you can extend that one via template specialization, but it's a fair amount of work).There is also the "Accelerate.framework" that Apple provides, which provides a full range of basic AltiVec/SSE based numerics (the usual assortment of transcendental and trigonometric functions), as well as routines to apply simple operations to large arrays of numbers (which are highly tuned to the specific machine, unfortunately these aren't of much use for fractal generation). There is also macstl, a derivative of STL that includes support for AltiVec and SSE based operation, which might be of use to you (though the last time I checked, it was still missing a few features I needed). Can you point me in the direction of some resources on Altavec/SSE optimization? I haven't tackled either before. Also, isn't the optimization different for G4 vs G5 and Intel Core duo vs. Core 2 duo (SSE vs SSE2)? It seems to me I'd get decent performance benifits by teaching my code to do simultaneous operations for complex add and multiply operations. (since Z^2 works out to (a+bi)^2, or a^2 +2bi - b^2. The a^, 2bi, and b^2 could all be performed at once. Adding two complex numbers, I should also be able to use Altevec/SSE to add the real and imaginary components simultaneously.) Is it true that PPC G4 altavec is single precision floating point only? That's a deal-killer for me. My app is based on double precison. Single precision falls apart too fast for high "magnification" zooms. Double precision seems to fall apart at around 10^-14. Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 23, 2007, 05:23:00 PM Animation, you say? Bring it on. I'd love something new and fast for parameter-interpolated zoom animations on OSX. Have you used Graham Anderson's Escape or Jesse Jones' Mandella? I haven't tried either. Clearly I need to get my head out of my code long enough to survey the other apps that are out there. Can you point me to links to these two? Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 23, 2007, 05:43:34 PM Properly done, these can easily accelerate your code by a factor of 8 or more. Note that it's hard to get really good speed improvements on SSE as easily as you can with AltiVec, due to the limited number of vector registers (until you go to 64-bit x86 code, which you won't be able to reasonably do until Leopard ships), but depending on the image, it's not that hard to get 3x+ due to AltiVec (quadrium often gets 3.5 times the performance with AltiVec enabled, but only about 2x for SSE) i got nearly a clean 4x speedup from using sse, and that's on the simple mandelbrot/julia - not something more compute-limited like quadrium; what gives? Threading is also a must - basically split the image into parts, and have each part calculated by a different thread. It's not all that hard to add in once you've got the basics of threading (and this is a fairly simple threading approach). You pretty much get N times the performance then (where N is the number of cores your machine has). that's of course assuming you have an equal workload for all pixels, otherwise you'll have threads twiddling thumbs after a while. to be fair you did mention that uniform subdivision is a simple approach, so this is obviously @ duncan and not at you. The way I deal with that is to break the plot into a fairly large number of pieces, and queue the pieces up for computation. I break the plot into at least (num_processors x 2) pieces. I spawn as many compute threads as I have processors, and when a thread finishes, I spawn another one for the next piece of the plot. I'm still fine-tuning the optimum number of pieces. I do sometimes end up with one thread working alone at the end if it contains a large number of pixels at max iterations. For brute-force computation, it would be pretty easy to break the plot into pices that have roughly the same number of iterations, by sub-sampling the region and estimating the total number of iterations required. However that approach breaks down for a boundary following algorithm (see below), since contiguous regions with the same iteration value would only require that the perimeter pixels be computed. I haven't yet figured out how I'll break up plots using that approach. I might just go for equal area and call that "close enough." My program already builds a histogram for a plot, tracking the number of pixels at each iteration value. I use the histogram in building a color table to display the plot. I use a large color change between iteration values with lots of pixels, and small color change between iteration values with a small number of pixels. This makes color tables that adapt nicely to the complexity of the plot. Once I implement boundary following, I may add a second histogram that tracks "average number of contiuous pixels" for each iteration value. That should give me a better metric for building color tables than my current histogram does. gcc offers two different complex number packages - one from the C99 (?) complex data type (which offers an intrinsic complex data type that allows to compiler to do things like basic operations), and another that involves the <complex> C++ template based library. Neither are accelerated to support AltiVec/PPC (but due to that way that C++ works, you can extend that one via template specialization, but it's a fair amount of work). i'm pretty sure core 2 duo has specific instructions for accelerating complex number computations (finally!), so it's reasonable to assume that a recent gcc will spit those instructions if compiled with the right march-flag (microarchitecture). of course that doesn't help our friend duncan, but i'd be interested to know if that's indeed the case. finally, there are ways to speed up general fractal computations without resorting to special cases for various fractal types, but in the end your time will be more rewarded by just biting the bullet and going for the big speedup. put another way, if you ever want to catch up with quadrium, you'll have to get your hands dirty ;) I'm planning to tackle a boundary following algorithm for Mandelbrot and Julia sets. This produces HUGE speed improvements for all but the most convoluted of plots, since each portion of an iteration band usually has an area that's large compaired to it's perimeter. For plots where the iteration values change too rapidly to have regions with significant interiors, there's still a large payoff in using boundary following to identify Mandelbrot and Julia regions, since those are the most expensive of all to calculate. There's a Java app I found that blows the doors off anything else I've seen, using double precision math in Java, with no machine-specific optimization. It is able to render several frames a second at 1000 iterations on my single processor G4. See this link: http://www.ibiblio.org/e-notes/MSet/Anim/ManJuOrb.htm. It's very impressive. The author even posted the source code, which should help me quite a bit. Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 23, 2007, 05:47:33 PM I haven't tried either. Clearly I need to get my head out of my code long enough to survey the other apps that are out there. Can you point me to links to these two? lazy! 10 seconds of googling reveals: http://asia.cnet.com/downloads/mac/swinfo/0,39001908,20000561r-39035314s,00.htm http://xahlee.org/PageTwo_dir/MathPrograms_dir/fractals.html now to answer your other questions... The way I deal with that is to break the plot into a fairly large number of pieces, and queue the pieces up for computation. I break the plot into at least (num_processors x 2) pieces. I spawn as many compute threads as I have processors, and when a thread finishes, I spawn another one for the next piece of the plot. I'm still fine-tuning the optimum number of pieces. I do sometimes end up with one thread working alone at the end if it contains a large number of pixels at max iterations. more pieces == more overhead, it's not about that at all. it's about load balancing, which is best solved by a thread pool. see recent discussions between trifox and i about mutatorwhatitsthingy's multithreading on these forums. For brute-force computation, it would be pretty easy to break the plot into pices that have roughly the same number of iterations, by sub-sampling the region and estimating the total number of iterations required. that's not easy at all, and demands a priori knowledge of your fractal iteration. i say again: mandelbrots and julias belong in the 80s, you're fighting a losing battle trying to optimise that specific problem more than others have - rather make ANY fractal render more efficiently, and render something new! I'm planning to tackle a boundary following algorithm for Mandelbrot and Julia sets. This produces HUGE speed improvements for all but the most convoluted of plots, since each portion of an iteration band usually has an area that's large compaired to it's perimeter. For plots where the iteration values change too rapidly to have regions with significant interiors, there's still a large payoff in using boundary following to identify Mandelbrot and Julia regions, since those are the most expensive of all to calculate. let's look ahead a little bit... say you get your boundary following algo working, massively outrageously optimised to the point where you can do 1920x1200 images at 120fps like my graphics card can. you're still rendering mandelbrots while everyone else is doing the cool stuff!! i'd really like to know why you're focusing all this attention on such a ridiculously simple problem, with no intention of walking your own path :( Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 23, 2007, 06:14:35 PM I haven't tried either. Clearly I need to get my head out of my code long enough to survey the other apps that are out there. Can you point me to links to these two? lazy! 10 seconds of googling reveals: http://asia.cnet.com/downloads/mac/swinfo/0,39001908,20000561r-39035314s,00.htm http://xahlee.org/PageTwo_dir/MathPrograms_dir/fractals.html Guilty as charged. Thanks for the links. now to answer your other questions... The way I deal with that is to break the plot into a fairly large number of pieces, and queue the pieces up for computation. I break the plot into at least (num_processors x 2) pieces. I spawn as many compute threads as I have processors, and when a thread finishes, I spawn another one for the next piece of the plot. I'm still fine-tuning the optimum number of pieces. I do sometimes end up with one thread working alone at the end if it contains a large number of pixels at max iterations. more pieces == more overhead, it's not about that at all. it's about load balancing, which is best solved by a thread pool. see recent discussions between trifox and i about mutatorwhatitsthingy's multithreading on these forums. Understood. That's what I was saying. There's an optimum number of pieces that gets the best balance between overhead and load balancing. For brute-force computation, it would be pretty easy to break the plot into pices that have roughly the same number of iterations, by sub-sampling the region and estimating the total number of iterations required. that's not easy at all, and demands a priori knowledge of your fractal iteration. i say again: mandelbrots and julias belong in the 80s, you're fighting a losing battle trying to optimise that specific problem more than others have - rather make ANY fractal render more efficiently, and render something new! I'm planning to tackle a boundary following algorithm for Mandelbrot and Julia sets. This produces HUGE speed improvements for all but the most convoluted of plots, since each portion of an iteration band usually has an area that's large compaired to it's perimeter. For plots where the iteration values change too rapidly to have regions with significant interiors, there's still a large payoff in using boundary following to identify Mandelbrot and Julia regions, since those are the most expensive of all to calculate. let's look ahead a little bit... say you get your boundary following algo working, massively outrageously optimised to the point where you can do 1920x1200 images at 120fps like my graphics card can. I've heard about people writing code to use the multiprocessor ability of their graphics cards to do fractal rendering, but 1920x1200 sounds a bit over the top. you're still rendering mandelbrots while everyone else is doing the cool stuff!! i'd really like to know why you're focusing all this attention on such a ridiculously simple problem, with no intention of walking your own path :( I'm just getting started. I'm tackling Mandelbrot and Julia sets first, to get warmed up. I also think there's a lot of potential for original work there, even still. It's a very rich subject. I have significant plans for animated "movies," for example. When I exhaust that, i'll move on to other formulas. Many of the tools and approaches I'm using will be quite applicable to other types of iterative fractals. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 23, 2007, 06:25:10 PM Are you saying your GRAPHICS CARD can do high iteration plots at 1920x1200 at 120 fps, or that it can display such plots at that rate? I've heard about people writing code to use the multiprocessor ability of their graphics cards to do fractal rendering, but 1920x1200 sounds a bit over the top. there's a screenshot in an older thread of mine: http://www.fractalforums.com/index.php?topic=386.0 yeah it's "only" single precision floating point (also used for colour-computations), but double precision can easily be used too. the take-away point is that no one cares because mandelbrots and julias have been done TO DEATH, and that application was written in a few seconds probably just to illustrate the computational power of modern gpus. I'm just getting started. I'm tackling Mandelbrot and Julia sets first, to get warmed up. I also think there's a lot of potential for original work there, even still. It's a very rich subject. I have significant plans for animated "movies," for example. When I exhaust that, i'll move on to other formulas. Many of the tools and approaches I'm using will be quite applicable to other types of iterative fractals. the "one step at a time" point is fair enough, but you didn't really answer my question: doing a single specific kind of fractal fast is sidestepping the greater goal of writing a good, generic fractal renderer - that code simply won't be used because as soon as you start exploring interesting other kinds of fractals 9/10th of your program is useless! edge tracing, symmetry exploitation, none of those techniques will help when you have fully flexible fractal iteration. an excellent point of comparison at this stage is quadrium. he constructs his fractal iterations via a directed graph, entered via a gui, and has to iterate THAT fast! that's what i call a good learning experience, and certainly his results speak volumes: http://gandreas.deviantart.com/ edit: oh and you misunderstood what i said about load balancing; splitting the image into smaller pieces alone is a very poor way to distribute your computation. once you've split the image, you should dole out work to the worker threads as soon as they are done with their previous task, in a dynamic fashion. a simple and near-perfect alternative that only works in this instance, is to just spawn two threads to do the even/odd scanlines (with obvious generalisation for N cores). Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 24, 2007, 04:22:17 AM snip... oh and you misunderstood what i said about load balancing; splitting the image into smaller pieces alone is a very poor way to distribute your computation. once you've split the image, you should dole out work to the worker threads as soon as they are done with their previous task, in a dynamic fashion. a simple and near-perfect alternative that only works in this instance, is to just spawn two threads to do the even/odd scanlines (with obvious generalisation for N cores). I think I understood you. I divide my plot into pieces, and create a single queue of "render tasks", each of which represents a piece of roughly the same size. (Much like a single line at a bank feeds customers to the tellers.) I then start a rendering process where I take render tasks off my task queue and hand them to worker threads. My code keeps as many worker threads running as there are processors. When a thread completes, I take another render task off the queue and spawn another worker thread to render it. As a result, all the processors on the system stay busy until the queue of render tasks has run out. The worst case for my code is when the last task in the queue takes significantly longer than any other. In that case, the queue is empty and the other processors end up idle. It would be better if each task required roughly the same time to complete, but I am going to re-factor my rendering soon in a way such that estimating the compute time will be pretty difficult. (boundary following.) I expect the variation in compute times between tasks to drop quite a bit, however, since large areas at max. iterations are the largest reason for slow rendering areas, and boundary following greatly reduces the number of points in a region that actually have to be computed. I should probably do a little extra work and have my worker threads go into an idle state between tasks rather than shutting down. That would save the overhead of spawning a new thread for each task. My implementation was definitely "quick and dirty." Is that what you were referring to? Do you think the overhead of spawning a worker thread for each task has a major impact on efficiency? Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 24, 2007, 01:22:58 PM clarified for sure, misunderstanding was on my part; yeah there is considerable overhead in spawning/terminating threads compared to just having a pool of them and re-using. it depends on your application though, like if you use lots of supersampling i doubt you can measure the difference (unless you're creating really a ton of threads); for realtime purposes (like generating fractals at 120 fps ;)) you'll want the pool.
still, in this simple instance where your workload is still relatively consistent, i think the best (measured by results/effort) approach would just be to spawn as many threads as you have processors*cores, and let them process the scanlines in an interleaved fashion. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 24, 2007, 06:45:23 PM clarified for sure, misunderstanding was on my part; yeah there is considerable overhead in spawning/terminating threads compared to just having a pool of them and re-using. it depends on your application though, like if you use lots of supersampling i doubt you can measure the difference (unless you're creating really a ton of threads); for realtime purposes (like generating fractals at 120 fps ;)) you'll want the pool. still, in this simple instance where your workload is still relatively consistent, i think the best (measured by results/effort) approach would just be to spawn as many threads as you have processors*cores, and let them process the scanlines in an interleaved fashion. I create less than 10 threads for a typical plot. I'll have to look at refactoring my code so that I keep the worker threads around until my task pool is empty. Given the small number of threads, however, I suspect the speed improvement will be very small. Doing alternating scan-lines is not a good choice for me because my next step is the boundary following algorithm. My goal is multiple frames per second on an average machine for a typical plot. I want my program to be able to do animated "walks" through the mandelbrot set, and through Julia sets. That's the real "meat" of my application, and where I feel there's room for innovation. Julia sets transform in fascinating ways as the seed point moves from one place to another, and the animations can be quite striking. If I can get the rendering speed fast enough, I'll be able to generate these walks "on the fly." I've discovered through investigation that Julia sets have different structure on several different scales (at least 3) based on the location of their source point. The structure of the macro scale (width of 3 - 4) is dictated by the Julia's source-point on the macro scale (e.g. in the "cleft" of the Cardioid, in "seahorse valley", along the central spike at i=0, etc. )If you zoom in on the julia, at a width of .2 or so, the detail at that scale takes the structure of the second level detail from the Mandelbrot set. If you zoom in further on the Julia plot, it's third scale of detail (at a width of around .02 or so) takes the form of the next level of detail from the source point, etc. I'm planning on creating animations that show a Mandelbrot plot at several scales, or levels of magnification, and corrisponding Julia sets, also at several levels of magnification. I'll create a Julia walk that moves through different structures and sub-structures of the Mandelbrot set and shows the change in structure of the Julia set, also at several different scales. I need as much rendering speed as possible in order to get a reasonable frame rate for such an animation. I also need at least double precision for my calculations, because the computations fall apart when the magnification gets too high. The different scales in Mandelbrot plots seem to require larges changes in magnification than different scales in Julias. After 4 or 5 scales on Mandelbrot plots, I run out of precision (my computations seem to fall apart at around 10^-14 or so.) Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 24, 2007, 07:01:31 PM i look forward to seeing the results! ;D
Title: Re: New version of Fractalworks w full mulitprocessor support Post by: gandreas on March 26, 2007, 06:18:19 PM Can you point me in the direction of some resources on Altavec/SSE optimization? I haven't tackled either before. http://developer.apple.com/hardwaredrivers/ve/quickstart.html (http://developer.apple.com/hardwaredrivers/ve/quickstart.html) http://developer.apple.com/documentation/Performance/Conceptual/Accelerate_sse_migration/migration_intro/chapter_1_section_1.html (http://developer.apple.com/documentation/Performance/Conceptual/Accelerate_sse_migration/migration_intro/chapter_1_section_1.html) http://developer.apple.com/hardwaredrivers/ve/sse.html (http://developer.apple.com/hardwaredrivers/ve/sse.html) Quote Also, isn't the optimization different for G4 vs G5 and Intel Core duo vs. Core 2 duo (SSE vs SSE2)? Not all that much - especially with regards to the vector stuff (and the compiler can handle the instruction scheduling issues). The G4 and G5 Altivec is essentially identical, and the baseline hardware requirements for OS X on Intel is SSE2 (and SSE3 may be available as well, but its not guaranteed).Quote It seems to me I'd get decent performance benifits by teaching my code to do simultaneous operations for complex add and multiply operations. (since Z^2 works out to (a+bi)^2, or a^2 +2bi - b^2. The a^, 2bi, and b^2 could all be performed at once. Adding two complex numbers, I should also be able to use Altevec/SSE to add the real and imaginary components simultaneously.) Ideally, you want to do many similar operations on many parallel independent chunks of data - so you'll get better results if you calculate 4 pixels at once than if you try to split up parts of the calculation for only one 1 pixel. Quote Is it true that PPC G4 altavec is single precision floating point only? That's a deal-killer for me. My app is based on double precison. Single precision falls apart too fast for high "magnification" zooms. Double precision seems to fall apart at around 10^-14. That is true - AltiVec only does single precision. SSE can do double precision, but you only can process two elements at a time then (so you can only theoretically get a maximum of 2x improvement instead of 4x). Also note that unless you're doing 64 bit mode (which requires Leopard to do anything other than a command line tool), you may not have enough SSE registers available to come close to that 4x (or 2x) improvement. Note there are tricks that can be done with single precision to get more precision, but basically they involve using two of them to simulate double precision (basically, the number is broken down into two numbers such that their sum is the first number, for example, 123.456 would be broken down into 123e0 + 456e-3, allowing for twice the precision bits, though the same exponent range). For other sorts of performance enhancements, you may want to look at XGrid, which would allow for distributed rendering (actually, the XGrid sample projects include one to draw a mandelbrot image, though not very well). The XGrid sample may even have an AltiVec version as well... Title: Re: New version of Fractalworks w full mulitprocessor support Post by: David Makin on March 27, 2007, 11:20:15 AM (my computations seem to fall apart at around 10^-14 or so.) Duncan C Hi Duncan - if you're using double and the calculations start falling apart at 10^-14 or so then I'm guessing you've implimented the same optimisation that I did in MMFrac i.e. you're using deltas for the x(real) and y(imag) loops. To get greater precision using double you need to explicitly calculate start+pixelcount*delta for the real and imag loop values instead of using deltas. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 27, 2007, 11:34:21 AM should be roughly the same speed.
Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 28, 2007, 04:31:06 AM (my computations seem to fall apart at around 10^-14 or so.) Duncan C Hi Duncan - if you're using double and the calculations start falling apart at 10^-14 or so then I'm guessing you've implimented the same optimisation that I did in MMFrac i.e. you're using deltas for the x(real) and y(imag) loops. To get greater precision using double you need to explicitly calculate start+pixelcount*delta for the real and imag loop values instead of using deltas. Hmm. I do indeed use a precalcuated deta (or step) value to calculate my x and y value for a point. How much more precision do you think I'd gain if I refactored my code to calculate x and y position each time? And I wonder if this would make a detectable difference in performance. Mulitplies usually take longer than adds. Of course, we're talking a single multiply vs. a single add per pixel, not per iteration. Duncan C P.S. I already ran into precision problems. My code does a "snap to grid" on my real and imaginary bounds so that the origin lands on a pixel. That lets me map symmetric regions of a plot without being off by a fraction of a step. Those calculations run into precision problems, and I had to resort to "long double" variables for that (once per plot) Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 28, 2007, 04:38:25 AM How much more precision do you think I'd gain if I refactored my code to calculate x and y position each time? lazy! test it out, that's < 1 line of code. or you could do a little thought experiment: how often have your x/y values been subjected to roundoff errors in the further reaches of your image (depending on how you split it)? i guarantee you that the "snapping" code you have, besides precluding supersampling by the sound of things, is a zillion times slower than just 2 extra multiplies. And I wonder if this would make a detectable difference in performance. Mulitplies usually take longer than adds. lazy! read my post above your last ;) with sse, afaik multiply and add have the same latency but you have less throughput with a multiply (half speed?). Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 28, 2007, 04:58:43 AM How much more precision do you think I'd gain if I refactored my code to calculate x and y position each time? lazy! test it out, that's < 1 line of code. or you could do a little thought experiment: how often have your x/y values been subjected to roundoff errors in the further reaches of your image (depending on how you split it)? i guarantee you that the "snapping" code you have, besides precluding supersampling by the sound of things, is a zillion times slower than just 2 extra multiplies. I think not. My "snap to grid" logic takes place ONCE, at the beginning of a plot, and involves a half-dozen floating point calculations, once. 2 multiplies/pixel is a lot more calculations than a half-dozen calcuations for the whole plot. Also, my snap to grid code lets me recognize symmetric areas of my plot and skip ALL the calculations for symmetric areas, and instead just copy the pixels. That saves potentially millions of floating point calculations.And I wonder if this would make a detectable difference in performance. Mulitplies usually take longer than adds. lazy! read my post above your last ;) with sse, afaik multiply and add have the same latency but you have less throughput with a multiply (half speed?). [/quote] Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 28, 2007, 05:21:11 AM How much more precision do you think I'd gain if I refactored my code to calculate x and y position each time? lazy! test it out, that's < 1 line of code. or you could do a little thought experiment: how often have your x/y values been subjected to roundoff errors in the further reaches of your image (depending on how you split it)? design change? you will still use those very same step values! i guarantee you that the "snapping" code you have, besides precluding supersampling by the sound of things, is a zillion times slower than just 2 extra multiplies. I think not. My "snap to grid" logic takes place ONCE, at the beginning of a plot, and involves a half-dozen floating point calculations, once. 2 multiplies/pixel is a lot more calculations than a half-dozen calcuations for the whole plot. Also, my snap to grid code lets me recognize symmetric areas of my plot and skip ALL the calculations for symmetric areas, and instead just copy the pixels. That saves potentially millions of floating point calculations.hmm i don't really understand the snapping then, but what especially enigmatic to me is your investment in symmetry exploitation... 50% at best (i.e. the centred mandelbrot and julia we all have burnt into our retinas from the 80s) is small fish and with any degree of zooming you can basically call it 0%. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: cKleinhuis on March 28, 2007, 08:18:35 AM (http://hmm i don't really understand the snapping then, but what especially enigmatic to me is your investment in symmetry exploitation... 50% at best (i.e. the centred mandelbrot and julia we all have burnt into our retinas from the 80s) is small fish and with any degree of zooming you can basically call it 0%.)
i must agree to that, beside of that, even optimization based on rectangles with same edge color leads to nothing if you use another coloring method than iteration->colorindex Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 28, 2007, 03:36:56 PM (http://hmm i don't really understand the snapping then, but what especially enigmatic to me is your investment in symmetry exploitation... 50% at best (i.e. the centred mandelbrot and julia we all have burnt into our retinas from the 80s) is small fish and with any degree of zooming you can basically call it 0%.) i must agree to that, beside of that, even optimization based on rectangles with same edge color leads to nothing if you use another coloring method than iteration->colorindex My program is based on iteration counts and color tables. I calculate the iteration values for a plot, and then map iteration values to colors for the points in the plot. I map iteration values to colors at display time. My program allows me to go back and use DIFFERENT colors at any time, in near real time. Changing colors on an on-screen plot makes for interesting animations. Colors are only used for display. The boundary following algorithm I am planning to use will trace contiguous areas with the same ITERATION VALUE, not color. It's true that this approach will not be useful for other methods of coloring a plot, like using fractional escape values. Duncan C Title: Re: Snap to grid, symmetry, rounding errors from repeated adding of deltas Post by: Duncan C on March 28, 2007, 03:58:28 PM How much more precision do you think I'd gain if I refactored my code to calculate x and y position each time? lazy! test it out, that's < 1 line of code. or you could do a little thought experiment: how often have your x/y values been subjected to roundoff errors in the further reaches of your image (depending on how you split it)? design change? you will still use those very same step values! complex_x_coord = min_complex_x + (pixel_x/total_x_pixels) * x_width. complex_y_coord = min_complex_y + (pixel_y/total_y_pixels) * y_height. That way each x and y position would be calculated directly, rather than based on a step value. I guess multiplying the pixel coordinates by the step value would give less rounding errors than adding the step value to each previous coordainte, but it seems to me that getting rid of the step value entirely would give the most accurate results. i guarantee you that the "snapping" code you have, besides precluding supersampling by the sound of things, is a zillion times slower than just 2 extra multiplies. I think not. My "snap to grid" logic takes place ONCE, at the beginning of a plot, and involves a half-dozen floating point calculations, once. 2 multiplies/pixel is a lot more calculations than a half-dozen calcuations for the whole plot. Also, my snap to grid code lets me recognize symmetric areas of my plot and skip ALL the calculations for symmetric areas, and instead just copy the pixels. That saves potentially millions of floating point calculations.hmm i don't really understand the snapping then, but what especially enigmatic to me is your investment in symmetry exploitation... 50% at best (i.e. the centred mandelbrot and julia we all have burnt into our retinas from the 80s) is small fish and with any degree of zooming you can basically call it 0%. For Mandelbrot plots, symmetry isn't all that useful, since most of the interesting areas are well off the x axis. For Julias, however, it has real benifit. I find the symmetry of Julia sets to be part of their visual appeal. As mentioned in a previous post, I've been investigating the way Julia sets have structure on several different scales depending on the source Mandelbrot point's location on several different scales. For this type of investigation, I zoom in on the center of a Julia plot repeatedly. All the resulting plots are symmetric, and mapping symmetric areas and copying them doubles the plot speed. I think a 100% speed improvement is worth the effort. I'll post a screen shot from my application that shows what I'm talking about later today if I get a chance. Duncan C Title: Re: Snap to grid, symmetry, rounding errors from repeated adding of deltas Post by: David Makin on March 28, 2007, 04:43:34 PM [Thats not how I understood the original posters comments. The way I understood it, the step values were the problem. As the code handles plots at higher and higher magnifications, the step values get very small, and start to have rounding errors. I got the impression that he was suggesting rewriting the plot code to caluclate the x y position of each point as complex_x_coord = min_complex_x + (pixel_x/total_x_pixels) * x_width. complex_y_coord = min_complex_y + (pixel_y/total_y_pixels) * y_height. But surely your delta/step values are precalculated as x_width/total_x_pixels and y_height/total_y_pixels in any case (that's what lycium meant by "the very same step values") you only add the slight overhead of 1 multiply per outer loop and 1 per inner loop which is not much extra for the relevant increase in accuracy considering the speed of FPU multiplies nowadays (it should give you around an extra *100 magnification). Title: Re: Snap to grid, symmetry, rounding errors from repeated adding of deltas Post by: Duncan C on March 28, 2007, 10:18:35 PM [Thats not how I understood the original posters comments. The way I understood it, the step values were the problem. As the code handles plots at higher and higher magnifications, the step values get very small, and start to have rounding errors. I got the impression that he was suggesting rewriting the plot code to caluclate the x y position of each point as complex_x_coord = min_complex_x + (pixel_x/total_x_pixels) * x_width. complex_y_coord = min_complex_y + (pixel_y/total_y_pixels) * y_height. But surely your delta/step values are precalculated as x_width/total_x_pixels and y_height/total_y_pixels in any case (that's what lycium meant by "the very same step values") you only add the slight overhead of 1 multiply per outer loop and 1 per inner loop which is not much extra for the relevant increase in accuracy considering the speed of FPU multiplies nowadays (it should give you around an extra *100 magnification). Is it the repetitive addition that leads to errors, or is the small scale of the delta value? My thinking is that the method that would lead to the LEAST error is what I posted, calculating the x and y coordinates directly, rather than multiplying by a precalculated delta value. I'm up to my elbows in other code changes right now. When I get a chance I'll do some testing. Duncan C Title: Sample images of Julia structure at different scales Post by: Duncan C on March 28, 2007, 10:33:32 PM I posted an image on my pBase account that shows what I am talking about with Julia sets having different structure at different scales, refelecting the morphology of the Mandelbrot set at different scales. Click the link below to see the full sized image, along with an explanation of the plot:
(http://www.pbase.com/duncanc/image/76305937/large.jpg) click to see the larger "original" version on pbase. (http://www.pbase.com/duncanc/image/76305937/original) Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: David Makin on March 29, 2007, 01:21:07 PM When using plain delta values (i.e. without the extra multiplies) it's the repetitive add that causes the rounding errors.
For example if the real value is 1.0 but the delta is 0.000000000001234556 then some of the 123456 will get lost on each addition (i.e. the error is incremental) but if you use basevalue+numpixels*delta e.g. 1.0 + numpixels*0.000000000001234556 then this rounding error is reduced (it is no longer incremental). There's still a rounding error but it allows zooming to an extra *100 or so. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: Duncan C on March 29, 2007, 01:46:33 PM When using plain delta values (i.e. without the extra multiplies) it's the repetitive add that causes the rounding errors. For example if the real value is 1.0 but the delta is 0.000000000001234556 then some of the 123456 will get lost on each addition (i.e. the error is incremental) but if you use basevalue+numpixels*delta e.g. 1.0 + numpixels*0.000000000001234556 then this rounding error is reduced (it is no longer incremental). There's still a rounding error but it allows zooming to an extra *100 or so. That makes sense. Thanks for the clarification. Don't you think I would gain addional precision if I didn't use a delta value at all, but instead calculated my x and y position for each pixel position? Duncan C Title: Re: New version of Fractalworks w full mulitprocessor support Post by: David Makin on March 29, 2007, 06:44:41 PM No there's no point doing the divde to calculate the deltas within the loops - the extra precision friom doing that would be negligible, if any, and divides are considerably slower than multiply or add.
Title: Re: New version of Fractalworks w full mulitprocessor support Post by: lycium on March 30, 2007, 11:40:05 AM No there's no point doing the divde to calculate the deltas within the loops - the extra precision friom doing that would be negligible, if any, and divides are considerably slower than multiply or add. correct, by a factor of 200 or more, ESPECIALLY in double precision. that's the big kicker duncan: you're willing to sacrifice a ton of speed to get that extra accuracy, but then you squander it on cumulative roundoff error! try this: time how long it takes to add 1/x x times to an accumulating variable, and see how far off the answer gets from 1 as x increases. then time how long it takes you to do x multiplies, and finally time how long x divides take... a quote from donald knuth is appropriate here: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (emphasis mine) i've said several times that you're spending your time in the wrong places. i bet there are speedups of 10x or more lurking in your program if you're not aware of basics like constant optimisation, effective use of reciprocals, threading principles, etc. spending time on this symmetry stuff* and worrying about the (neglible) speed difference between an add and a multiply, particularly at an extreme precision cost, is holding you back: rather learn to write tight generic code than desperately try to speed up poor algorithms. * that's REALLY clutching at straws, and if you can't convince yourself of that you have good reason for despair. your speedup is directly proportional to the boringness of the resulting image, up to a maximum of only 2x (cf. with the 200x mentioned above!), and as soon as you look away from the real axis it slows down. even if you're obsessive about that real axis, i don't think all your users (still considering going commercial?) will be. Title: Re: New version of Fractalworks w full mulitprocessor support Post by: gandreas on March 30, 2007, 04:45:19 PM a quote from donald knuth is appropriate here: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (emphasis mine) And the corollary to this is "The biggest performance optimization is going from not working to working" After all, getting the wrong answer really fast isn't exactly useful... Title: Precision improvements Post by: Duncan C on April 02, 2007, 06:10:59 AM No there's no point doing the divde to calculate the deltas within the loops - the extra precision friom doing that would be negligible, if any, and divides are considerably slower than multiply or add. correct, by a factor of 200 or more, ESPECIALLY in double precision. that's the big kicker duncan: you're willing to sacrifice a ton of speed to get that extra accuracy, but then you squander it on cumulative roundoff error! try this: time how long it takes to add 1/x x times to an accumulating variable, and see how far off the answer gets from 1 as x increases. then time how long it takes you to do x multiplies, and finally time how long x divides take... a quote from donald knuth is appropriate here: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (emphasis mine) i've said several times that you're spending your time in the wrong places. i bet there are speedups of 10x or more lurking in your program if you're not aware of basics like constant optimisation, effective use of reciprocals, threading principles, etc. spending time on this symmetry stuff* and worrying about the (neglible) speed difference between an add and a multiply, particularly at an extreme precision cost, is holding you back: rather learn to write tight generic code than desperately try to speed up poor algorithms. * that's REALLY clutching at straws, and if you can't convince yourself of that you have good reason for despair. your speedup is directly proportional to the boringness of the resulting image, up to a maximum of only 2x (cf. with the 200x mentioned above!), and as soon as you look away from the real axis it slows down. even if you're obsessive about that real axis, i don't think all your users (still considering going commercial?) will be. lycium, You lay it on a bit thick, don't you think? How about you lighten up a little. I thought this was a friendly discussion, not an inquisition. I've got over 25 years of commercial software development experience, and am able to pick and choose the work I do as a result of my commercial success. Please don't preach, okay? You and David are probably right that the small increase in accuracy from doing a divide to calculate the coordinates of each pixel would not be worth the performance cost. Remember, though, that we are talking about one divide per pixel, (plus one divide for each row) not a divide for each iteration. I adjusted my code so that I don't do repetitive adds any more, and it improved things slightly. My images don't get distorted (skewed to one side due to roundoff errors) any more, but I am still seeing artifacts from computation error at the same magnification. I haven't done rigorous testing, but the changes seem to keep the images from completely falling apart for about another 5x magnification. Thanks, David, for the suggestion. I miscounted by the way. My images are showing artifacts at a width of around 4E-15, not -14. You're right that for Mandelbrot plots, the most interesting images are not on the real axis. However, I feel that the symmetry of Julia sets is part of their appeal. Further, there are different structures in Julia sets at a variety of different scales, but those structures appear centered on the origin. See my post of a few days ago, which included a set of images showing what I am talking about. (or click here (http://www.pbase.com/duncanc/image/76305937/original).) Duncan C Title: Re: Precision improvements Post by: lycium on April 02, 2007, 04:02:27 PM You lay it on a bit thick, don't you think? fair enough, the bit about despairing wasn't nice; i was frustrated that you left alone the obvious point which trifox and i were making about the relative utility of the symmetry optimisation. How about you lighten up a little. that you're telling me that certainly did lighten me up, lol! you also talk about me preaching: I've got over 25 years of commercial software development experience, and am able to pick and choose the work I do as a result of my commercial success. good for you; being highly qualified ought to speak for itself. Please don't preach, okay? facts are facts, i'm not preaching; if you dispute something i said, please tell let me know - we're all here to learn. note that i'm not making this stuff up for sport, it's for your benefit - feel free to tell me if you'd rather not hear it. that's really all i'd like (but not have) to say on this matter. |