FWIW, Nanoscope's post-e308 scaling is working perfectly on 64-bit Windows. Java, no strictfp. I don't think I've implemented it the same as Kalle though. Kalle seems to be using a pair of doubles whose product is the real value of delta. I'm using one double and an exponent, and most iterations don't need to care about the exponent. The innermost post-e308 loop in Nanoscope is just:
delta_scaled_n+1 = delta_scaled_n * reference_orbit_n + delta_scaled_0.
No delta-squared, because if delta is O(10
-308) or smaller, delta squared is smaller than delta by a similar factor and is lost in the noise. No 2* because the copy of the reference orbit used for this loop is premultiplied by 2. If the reference orbit's not about to escape (and if it does, it takes delta with it as they're sure to diverge past the bailout distance), it's confined in a disk of radius 2, so its doubled version is in a disk of radius 4, and the modulus of the first term in the sum above is at most five times larger than the previous, if delta_scaled_0 is still comparable in magnitude to delta_scaled_n, and more usually delta_scaled_0 is smaller. So, basically delta_scaled's modulus grows by a factor of four or less most iterations, and that means it takes delta_scaled around 500 iterations to go from O(1) to O(10
308). Only then is delta_scaled (and delta_scaled_0) rescaled and the stored exponent adjusted. This makes for a very fast inner loop below e308.
In principle, this method allows zooming to nearly arbitrary depths; there's no further barrier at e616 or anything like that.
I've also greatly streamlined the memory use, without even having to recalculate the high precision orbit every frame or persist it to disk. High precision points are only needed in three places:
- The first point, and only the first point, for computing delta_0 per pixel.
- Each term R_n-1 is needed to compute the series coefficients A_n, B_n, C_n, and those only need wide exponents, not wide mantissas.
- Each term R_n is needed to compute the premultiplied-by-two, double-precision version of itself used in the iteration loop above and in the above-e308 loop.
So we only need to store:
- The first point R_0 at full precision.
- A_i, B_i, C_i up to some point n, as far as will work at the current magnification. We only need wide exponents, not wide mantissas, so these take up much less space.
- The n we've computed those up to.
- R_n at full precision, for computing R_n+1, A_n+1, B_n+1, C_n+1 when some deeper image with this reference point may need them.
- The points 2*R_i, up to R_maxit, at double precision.
And to compute:
- Per sequence of zooms with a given center point R_0, the double-precision points 2*R_i up to maxit. (Kept.)
- Per image, any additional coefficients A_i, B_i, C_i that can now be used at the deeper magnification (smaller side length), and R_i along with them, stopping at R_n_new, A_n_new, B_n_new, C_n_new. Only R_n_new and narrow-mantissa, wide-exponent versions of the new A_i need to be kept; the previous R_i can be discarded, except of course for R_0.
- Per pixel, delta_0 and delta_n (using A_n, B_n, C_n, delta_0).
- Per iteration, delta_i (or delta_scaled_i, and, every 500 iterations, scale and delta_scaled_0).
The high precision R_i are traversed twice, though, first to build the arrays of doubled doubles and again gradually to build up the sequence of ABC_i values. Small price to pay.
I've also found the need to identify the highest iteration region in an image pre-calculation and sometimes use a second reference point there. The algorithm I'm currently using for this proceeds in two phases. First, it samples 100 points in the image region using perturbation theory (fast!), sorts them by iterations, discards all but the top ten, and then does some graph theory manipulation to identify a big cluster of close-together points, by looking for the closest pair of samples, and then all samples transitively reachable from those two by sample-to-sample hops no longer than 1.5x their separation. The resulting possibly-reduced set of points tends to be a "blob" of high-iter material which, if off-center (i.e. not containing the primary reference point), may not be rendered correctly. This blob's diameter is measured (longest distance between any two of the samples in it) and its barycenter (average of the member samples' coordinates), and if it's significantly bigger than 1 pixel but significantly smaller than the previous examined region, the process is recursively applied to the smaller region to refine the estimate of the high iter region's core's size and center. When the "blob stops shrinking" or reaches only a few pixels in width, the second phase begins. If the "blob" is now tiny (only a few tens of pixels max), they're all iterated with high precision normal Mandelbrot calculations and the highest iteration point is returned; if the blob's still substantial, high precision calculations take over from low but a recursive refinement algorithm that takes logarithmically many points in the blob's diameter is used instead of brute scanning to try to guess the highest-iter point, gradient-climbing to reach it.
Once the high-iter point is identified, its iteration count is checked for being bigger, by more than 10, than the iteration count the PT algorithm assigns. If it is off by ten or more iterations, that means PT is bailing early in that region and we need a second reference point. The highest-iter point is used as that point, which means a second 2*S_i array and D_i, E_i, F_i series coefficients. Also, points in the image are iterated using PT with each set in turn, and the highest iteration wins. This seems to clear up the "blank blobs" in the test images I've thrown it at so far, and even a corrupted minibrot in one test (where the reference point was in an incomparably smaller minibrot some several diameters of the larger minibrot away from it). Sadly, the detection of the need for a second reference point is slowish, taking a minute or so in the e70-e100 few-thousand-iter "shallows" and half an hour at e300+ and 1e5+ iterations (java.math.BigDecimal is
slow).
I am, of course, working on figuring out ways to squeeze more speed out of the highest-iter-point-finding algorithm. Even as things are, it seems to be as accurate as UF and dozens to hundreds of times faster, depending on the zoom.
AIUI, Superfractalthing also has some method for determining when a second reference point is needed; I don't know about Kallesfractaler, though, or the methods either use. I've described mine in the hope that it may be helpful to either of their authors, and in the hope that someone might have a useful suggestion for speeding my own up. Similarly, I hope my explanation of Nanoscope's e308+ loop structure can help Kalle to overcome his(?) difficulties with different behavior on different processor architectures, and to be able to go past e616.