Hoping some of this might be of some value.
In the code below, notice how NTHREADS is defined to 64. It's perhaps fractionally faster than 32 threads, but most certainly faster than 8 threads. I don't know if this will translate to mandelbrot iterations though.
-snip-
My app, FractalWorks, renders Mandelbrot and Julia sets using an arbitrary number of worker threads.
I always create as many worker threads as I have cores. More than that and your system spends too much time task switching. I am not sure about the efficiencies of "hyperthreading", although I am skeptical.
I create a thread-safe queue of compute jobs, and each worker thread takes a job off the queue, calculates those pixels, and then asks for another job from the queue. Each thread writes it's pixel data into a shared plot memory buffer, but because each job maps to a unique set of pixels, there are no collisions.
What I do is to divide the target plot into blocks. I divide along the imaginary axis, and allow for symmetry. After mapping out any symmetric regions, I take the asymmetric area(s) and add them to an array of jobs where each job is a block of pixels. While the number of jobs is less than my target size, I look for the job that contains the largest block of pixels and split that job in half. I repeat that step until I have the desired number of jobs in the queue, and then I begin rendering. I use this rather screwy approach because after mapping out symmetric areas of a plot, I can wind up with blocks of pixels of widely varying size.
Each job in the job queue also has information about symmetric areas of the plot, so once the computing is done, the results are simply flipped and copied to the symmetric areas.
I have run my app on machines from 1 core to 8 cores, and it usually runs at 95 - 98% of maximum theoretical efficiency. Sometimes the jobs require very different amounts of compute time so the efficiency can go down, but in general it works extremely well.
In computer science lingo, Mandelbrot and Julia sets are an "
embarrassingly parallel problem".
I wrote my code to create twice as many jobs as there are worker threads by default. My thinking was that some jobs may have a low average iteration count, and complete quickly, and some may have a high average iteration count, and take longer. With twice as many jobs in the queue as worker threads, those threads that finish their jobs early can pick up another job.
In practice, cores only sit idle at the end, when there are only a few jobs left to compute.
One very interesting twist:
My program also uses a boundary following algorithm to trace the boundaries of the Mandelbrot set, and of connected iteration bands. That way it can trace the outline of an iteration band, and "flood fill" the interior without computing it.
In many cases, boundary following makes a much larger difference in compute times than multi-threading. That's because the interior area of an iteration band is often many times larger than the "perimeter." By tracing the boundary, I can avoid computing the iteration values of those interior pixels entirely.
In fact, cutting up the plot into too many pieces (for multi-threading) can actually slow down the plot time, because the smaller the pieces, the larger the total "surface area" of each contiguous blob of pixels with the same iteration value. To adjust for this, when boundary following is turned on, my program backs off the number of jobs in the queue to only slightly more than the number of worker threads (cores). I don't remember how I calculate "slightly more" off the top of my head.
Boundary following has the biggest payoff for blobs of Mandelbrot points, because those have to be computed to max iterations. Contiguous blobs of pixels at lower iteration values don't yield as big a payoff in the number of calculations that can be skipped. And, for deep zooms, there are not usually very many contiguous pixels at iteration counts less than max iterations.
Finally, if the user asks for distance estimate or fractional iteration values, I can't use boundary following for non-mandelbrot points at all. I have to brute-force calculate all non-mandelbrot pixels in that case. I still use boundary following for Mandelbrot points. My program doesn't do interior distance estimates or orbit traps, so I can always skip calculating the interior of "blobs" of Mandelbrot points.
Interestingly, the boundary following algorithm works pretty well for Julia sets as well, even when they break down into "Fatou dust". When a Julia set becomes disconnected, a given iteration band can develop "holes" that contain higher iteration values. However, these holes only occur at fairly low iteration values (10-12 iterations, if memory serves). I found that in practice, I simply turn off boundary following when the iteration value is lower than a small threshold value, and my algorithm doesn't miss any "holes".
Regards,
Duncan C