Logo by reallybigname - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Visit us on facebook
 
*
Welcome, Guest. Please login or register. April 25, 2024, 02:38:46 AM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: What is the general approach to threading for 2d plotting  (Read 10288 times)
0 Members and 1 Guest are viewing this topic.
jwm-art
Iterator
*
Posts: 171



WWW
« on: January 05, 2010, 09:32:37 PM »

Hello,

I'm thinking I need my program to be using both cores of my CPU for calculating deep zoomed images in decent resolution. Unfortunately I don't know a lot about threading other than being aware that problems can arise with two or more threads accessing the same data.

I'm wondering what the general approach is to threads when plotting a 2d image of the M-set as an example:

Is the image divided into lines so that cpu1 does lines 0,2,4,8... cpu2 does lines 1,3,5,7...

Or can it be divided even further into pixels?

But I better get straight on a nagging suspicion that I read somewhere not to use the MPFR library in multithreaded programs (but I have another nagging suspicion that I'm wrong about that).
Logged
gaston3d
Guest
« Reply #1 on: January 05, 2010, 09:52:30 PM »

I use several threads in my renderer (C#).
Main thread for GUI, one background thread for managing calculation threads (send/receive data) and 'worker' threads in number: cpu cores + 1 (it is always as much 'heavy' threads as cpu cores).
I split rendering image in blocks ie. 20x20 pixels, and write to one common array. Synchronisation of threads depends on platform you use.

On dual-core machine this works well, but tested on 2x4cores and, I don't know why, processors were saturated in 70-90%...
« Last Edit: January 05, 2010, 09:54:05 PM by gaston3d » Logged
Buddhi
Fractal Iambus
***
Posts: 895



WWW
« Reply #2 on: January 05, 2010, 10:56:46 PM »

I have 4xcores and I always make multi-threaded programs. My approach is very similar to proposed by @jwm-art. Program is not sending and receiving any data form main thread. Main thread only one time initialise all threads. Every thread have to render whole image but I made some simple algorithm to avoid rendering the same line of image by two threads. Program has array with markers which informs all threads which lines are during rendering and which was already rendered. Main thread only waits for finish of all threads. Additionally I added one thread for refreshing graphics in window in some cycle period (it is usable for long time renders)
I think that dividing it into pixels is less effective.
Using this IMHO simple algorithm all processors are utilized in 100%.
Logged

gaston3d
Guest
« Reply #3 on: January 05, 2010, 11:25:12 PM »

of course nothing is copied between threads, only references (pointers) to arrays are passed.
in this send/receive approach i plan to extend rendering engine to network computing... maybe spring huh?

Buddhi, are you sure that main thread does not consume cpu time during spinnig and waiting?
« Last Edit: January 05, 2010, 11:27:07 PM by gaston3d » Logged
Buddhi
Fractal Iambus
***
Posts: 895



WWW
« Reply #4 on: January 05, 2010, 11:31:23 PM »

Buddhi, are you sure that main thread does not consume cpu time during spinnig and waiting?

Yes, I'm sure because I'm using g_thread_join(GThread *thread);     -- function from GLib (Linux GTK+)
This function freezes thread and waits for semaphore from requested thread.
Logged

jwm-art
Iterator
*
Posts: 171



WWW
« Reply #5 on: January 05, 2010, 11:48:09 PM »

Yes, I'm sure because I'm using g_thread_join(GThread *thread);     -- function from GLib (Linux GTK+)
This function freezes thread and waits for semaphore from requested thread.

Are you using the GLib threads because you're using GTK, or because they have some other advantage over pthreads?
I was going to start with pthreads, using GLib had not crossed my mind, although I am using GTK.

Regarding how the image rendering is distributed between threads, processing of blocks of 20x20 pixels would probably be more work to code than a line by line approach. Having to take care of additional row strides through the image's data array... (thinking out loud) Although that is already happening to some degree. I won't write it off yet. I need to write some simple test programs to get to grips with threads first!

of course nothing is copied between threads, only references (pointers) to arrays are passed.
in this send/receive approach i plan to extend rendering engine to network computing... maybe spring huh?

Rendering distributed over the network was something I briefly thought about until I looked at the wikipedia pages about it.
Logged
jwm-art
Iterator
*
Posts: 171



WWW
« Reply #6 on: January 07, 2010, 09:32:01 AM »

Found quite a good introduction to parallel computing today, after reading their tutorial on POSIX threads:
https://computing.llnl.gov/tutorials/parallel_comp/

I've written a test program which does a series of additions on a [5000 * 25] array so that each index holds 2.0. Several POSIX threads doing this simultaneously. Come up against the problem which I had not thought of: there's no guarantee the scheduler will run the threads in the order they were called so the lines in the image will/might not be generated in order.

Basically a bit of an aesthetics problem, plus requires a few flags to tell which lines are done and which aren't. Someone elsewhere suggested each thread does 20 lines rather than just one which made me think of the 20x20 pixel blocks Buddhi suggested. These blocks seem more attractive imagining them arriving on the screen unordered.

Incidently, testing this "my first threading program" experiment, it seems setting the noof threads to an odd number was slower than an even, and in a handful of tests 6 was better than 4. I'm running a dual core sys.
Logged
jwm-art
Iterator
*
Posts: 171



WWW
« Reply #7 on: January 07, 2010, 10:29:36 AM »

Hoping some of this might be of some value.

In the code below, notice how NTHREADS is defined to 64. It's perhaps fractionally faster than 32 threads, but most certainly faster than 8 threads. I don't know if this will translate to mandelbrot iterations though.

Code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <mpfr.h>
#include <string.h>

/*
gcc mutex2.c -lgmp -lmpfr -pthread -D_REENTRANT -ggdb
*/
#define W 15000
#define H 25

#define NTHREADS 64

typedef struct
{
    int arr[H * W];
    int next_line;
    int lines_done;
} image_info;

void *watch_image(void* ptr);
void *render(void* ptr);
int next_line(image_info* img);

pthread_mutex_t next_line_mutex =  PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t lines_done_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t image_done_cond =   PTHREAD_COND_INITIALIZER;

main()
{
    int i;
    int rc[NTHREADS];
    pthread_t thread[NTHREADS];
    pthread_attr_t attr;
    image_info* img = malloc(sizeof(image_info));
    memset(img, 0, sizeof(image_info));

    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    rc[0] = pthread_create(&thread[0], &attr, &watch_image, (void*)img);
    if (rc[0])
        printf("Thread #0 creation failed: %d\n", rc[0]);

    for (i = 1; i < NTHREADS; ++i)
    {
        rc[i] = pthread_create(&thread[i], &attr, &render, (void*)img);
        if (rc[i])
            printf("Thread #%d creation failed: %d\n", i, rc[i]);
    }

    for (i = 0; i < NTHREADS; ++i)
        pthread_join(thread[i], NULL);

    exit(0);
}

void *watch_image(void* ptr)
{
    image_info* img = (image_info*)ptr;
    pthread_mutex_lock(&lines_done_mutex);

    if (img->lines_done < H)
    {
        pthread_cond_wait(&image_done_cond, &lines_done_mutex);
    }

    printf("Image Complete!\n");
    pthread_mutex_unlock(&lines_done_mutex);
}

void* render(void* ptr)
{
    int line = 0;
    image_info* img = (image_info*)ptr;

    while(next_line(img))
        ;
}

int next_line(image_info* img)
{
    int line;
    pthread_mutex_lock(&next_line_mutex);
    line = img->next_line++;
    if (line >= H)
    {
        pthread_mutex_unlock(&next_line_mutex);
        return 0;
    }
    pthread_mutex_unlock(&next_line_mutex);

    int offset = line * W;
    int ix;
    mpfr_t a,b;
    mpfr_init2(a, 128);
    mpfr_init2(b, 128);
    for (ix = offset; ix < offset + W; ++ix)
    {
        int n;
        mpfr_set_d(a, 0.0f, GMP_RNDN);
        mpfr_set_d(b, 0.001f, GMP_RNDN);
        for (n = 0; n < 200; ++n)
        {
            mpfr_add(a, a, b, GMP_RNDN);
        }
        img->arr[ix] = mpfr_get_d(a, GMP_RNDN) * 100;
    }
    pthread_mutex_lock(&lines_done_mutex);
    img->lines_done++;
    printf("line %d complete\n", line);
    if (img->lines_done == H)
    {
        pthread_cond_signal(&image_done_cond);
        pthread_mutex_unlock(&lines_done_mutex);
        return 0;
    }
    pthread_mutex_unlock(&lines_done_mutex);
    return 1;
}


And here's a sample output:
Code:
musys2@Scrapyard:~/Projects/pthread_tests$ time ./a.out 
line 0 complete
line 1 complete
line 15 complete
line 6 complete
line 7 complete
line 8 complete
line 13 complete
line 9 complete
line 16 complete
line 10 complete
line 14 complete
line 17 complete
line 11 complete
line 3 complete
line 4 complete
line 5 complete
line 12 complete
line 24 complete
line 21 complete
line 2 complete
line 18 complete
line 23 complete
line 20 complete
line 22 complete
line 19 complete
Image Complete!

real    0m1.383s
user    0m2.531s
sys     0m0.250s
« Last Edit: January 07, 2010, 10:32:00 AM by jwm-art, Reason: error in code » Logged
Duncan C
Fractal Fanatic
****
Posts: 348



WWW
« Reply #8 on: February 20, 2010, 04:11:35 AM »

Hoping some of this might be of some value.

In the code below, notice how NTHREADS is defined to 64. It's perhaps fractionally faster than 32 threads, but most certainly faster than 8 threads. I don't know if this will translate to mandelbrot iterations though.
-snip-

My app, FractalWorks, renders Mandelbrot and Julia sets using an arbitrary number of worker threads.

I always create as many worker threads as I have cores. More than that and your system spends too much time task switching. I am not sure about the efficiencies of "hyperthreading", although I am skeptical.

I create a thread-safe queue of compute jobs, and each worker thread takes a job off the queue, calculates those pixels, and then asks for another job from the queue. Each thread writes it's pixel data into a shared plot memory buffer, but because each job maps to a unique set of pixels, there are no collisions.

What I do is to divide the target plot into blocks. I divide along the imaginary axis, and allow for symmetry. After mapping out any symmetric regions, I take the asymmetric area(s) and add them to an array of jobs where each job is a block of pixels. While the number of jobs is less than my target size, I look for the job that contains the largest block of pixels and split that job in half. I repeat that step until I have the desired number of jobs in the queue, and then I begin rendering. I use this rather screwy approach because after mapping out symmetric areas of a plot, I can wind up with blocks of pixels of widely varying size.

Each job in the job queue also has information about symmetric areas of the plot, so once the computing is done, the results are simply flipped and copied to the symmetric areas.

I have run my app on machines from 1 core to 8 cores, and it usually runs at 95 - 98% of maximum theoretical efficiency. Sometimes the jobs require very different amounts of compute time so the efficiency can go down, but in general it works extremely well.

In computer science lingo, Mandelbrot and Julia sets are an "embarrassingly parallel problem".

I wrote my code to create twice as many jobs as there are worker threads by default. My thinking was that some jobs may have a low average iteration count, and complete quickly, and some may have a high average iteration count, and take longer. With twice as many jobs in the queue as worker threads, those threads that finish their jobs early can pick up another job.

In practice, cores only sit idle at the end, when there are only a few jobs left to compute.

One very interesting twist:

My program also uses a boundary following algorithm to trace the boundaries of the Mandelbrot set, and of connected iteration bands. That way it can trace the outline of an iteration band, and "flood fill" the interior without computing it.

In many cases, boundary following makes a much larger difference in compute times than multi-threading. That's because the interior area of an iteration band is often many times larger than the "perimeter." By tracing the boundary, I can avoid computing the iteration values of those interior pixels entirely.

In fact, cutting up the plot into too many pieces (for multi-threading) can actually slow down the plot time, because the smaller the pieces, the larger the total "surface area" of each contiguous blob of pixels with the same iteration value. To adjust for this, when boundary following is turned on, my program backs off the number of jobs in the queue to only slightly more than the number of worker threads (cores). I don't remember how I calculate "slightly more" off the top of my head.

Boundary following has the biggest payoff for blobs of Mandelbrot points, because those have to be computed to max iterations. Contiguous blobs of pixels at lower iteration values don't yield as big a payoff in the number of calculations that can be skipped. And, for deep zooms, there are not usually very many contiguous pixels at iteration counts less than max iterations.

Finally, if the user asks for distance estimate or fractional iteration values, I can't use boundary following for non-mandelbrot points at all. I have to brute-force calculate all non-mandelbrot pixels in that case. I still use boundary following for Mandelbrot points. My program doesn't do interior distance estimates or orbit traps, so I can always skip calculating the interior of "blobs" of Mandelbrot points.

Interestingly, the boundary following algorithm works pretty well for Julia sets as well, even when they break down into "Fatou dust". When a Julia set becomes disconnected, a given iteration band can develop "holes" that contain higher iteration values. However, these holes only occur at fairly low iteration values (10-12 iterations, if memory serves). I found that in practice, I simply turn off boundary following when the iteration value is lower than a small threshold value, and my algorithm doesn't miss any "holes".


Regards,

Duncan C
Logged

Regards,

Duncan C
Pages: [1]   Go Down
  Print  
 
Jump to:  


Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.165 seconds with 24 queries. (Pretty URLs adds 0.009s, 2q)