Logo by Cyclops - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Check out the originating "3d Mandelbulb" thread here
 
*
Welcome, Guest. Please login or register. March 29, 2024, 04:04:46 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: If you liked Assembler, you’ll love OpenCL (and Cuda too)  (Read 4998 times)
0 Members and 1 Guest are viewing this topic.
ker2x
Fractal Molossus
**
Posts: 795


WWW
« on: July 30, 2010, 09:18:22 AM »

i found this article http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers  on http://www.khronos.org/news/C124/ (home of openCL, openGL, ... standard) homepage

I agree (of course grin ), i like assembler (as long as it support SSE2 and higher instruction), i do some assembler (pureBasic + TASM), and i love OpenCL  cheesy
(i wanted to do GPGPU since the 1st 3Dfx, of course it was nearly impossible to do, and i had to wait for Cuda & OpenCL (i tried cGL, but too weird for me))

I bought the NVidia book "Programming Massively parallel processor", the book is all about speed, optimization, speed and optimization (and speed (and optimization)).
That make sense, the only goal of GPGPU is ... being fast, and faster (and maybe the fastest, too). So the book focus on speed and optimization and how the GPU architecture can be used at its best (and there is a lot to learn, it's totally different than CPU architecture)

If you do not care about speed, forget about GPGPU, it's weird, it's a completely different architecture, it have some crazy limitation and behaviours, memory access is painfully slow compared to arithmetic operation, ... eg : an addition take 1/8th cycle to proceed, a global memory access take 800 cycles of latency.  angry

If you care about speed. Well ... as long as your problem is embarrassingly parallel, it's insanely incredibly f*cking fast ! Forget about what you know (easy for me, i don't know much about programming) and be ready to run hundreds of thousands of threads per second per core (did i mentioned that high-end gpu card have 300~500 cores ? each one running at more than 1Ghz ?).

High-end GFX card support 64bits floating point operation, the Tesla M2050/M2070 run at 500GFlops double precision, or 1TeraFlops(!) single precision.
An intel Q6600 run at 38GFlops, an high-end i7 cpu run at ~50GFlops.

Most CPU optimization are about trading Memory vs CPU, which is a bad thing to do with GPU, but you have more than 8000 registers (forgot the exact amount) and "read-after-write" register latency is "only" 24 cycles. Okay, that still sux, considering you can do 8 additions/cycle, but memory access latency can be hidden if you run many many many threads (another thread can run while another is waiting for memory latency).

My best optimization (for now) involved : Bruteforce \o/

And... ho wait, i'm being late for work, bbl ...   sad  cry angry
« Last Edit: July 30, 2010, 09:20:00 AM by ker2x » Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
kram1032
Fractal Senior
******
Posts: 1863


« Reply #1 on: July 30, 2010, 09:04:35 PM »

speed and optimization (and speed (and optimization)).

Are that few iterations enough to show the fractal structure of this?
Logged
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #2 on: July 31, 2010, 11:00:59 PM »

speed and optimization (and speed (and optimization)).

Are that few iterations enough to show the fractal structure of this?

Only if coded in LISP smiley
Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
Synaesthesia
Forums Newbie
*
Posts: 2


« Reply #3 on: August 03, 2010, 02:39:25 PM »

Thank you, I wanna try this out..
Logged
Duncan C
Fractal Fanatic
****
Posts: 348



WWW
« Reply #4 on: August 24, 2010, 01:54:01 PM »

i found this article http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers  on http://www.khronos.org/news/C124/ (home of openCL, openGL, ... standard) homepage

I agree (of course grin ), i like assembler (as long as it support SSE2 and higher instruction), i do some assembler (pureBasic + TASM), and i love OpenCL  cheesy
(i wanted to do GPGPU since the 1st 3Dfx, of course it was nearly impossible to do, and i had to wait for Cuda & OpenCL (i tried cGL, but too weird for me))

I bought the NVidia book "Programming Massively parallel processor", the book is all about speed, optimization, speed and optimization (and speed (and optimization)).
That make sense, the only goal of GPGPU is ... being fast, and faster (and maybe the fastest, too). So the book focus on speed and optimization and how the GPU architecture can be used at its best (and there is a lot to learn, it's totally different than CPU architecture)

If you do not care about speed, forget about GPGPU, it's weird, it's a completely different architecture, it have some crazy limitation and behaviours, memory access is painfully slow compared to arithmetic operation, ... eg : an addition take 1/8th cycle to proceed, a global memory access take 800 cycles of latency.  angry

If you care about speed. Well ... as long as your problem is embarrassingly parallel, it's insanely incredibly f*cking fast ! Forget about what you know (easy for me, i don't know much about programming) and be ready to run hundreds of thousands of threads per second per core (did i mentioned that high-end gpu card have 300~500 cores ? each one running at more than 1Ghz ?).

High-end GFX card support 64bits floating point operation, the Tesla M2050/M2070 run at 500GFlops double precision, or 1TeraFlops(!) single precision.
An intel Q6600 run at 38GFlops, an high-end i7 cpu run at ~50GFlops.

Most CPU optimization are about trading Memory vs CPU, which is a bad thing to do with GPU, but you have more than 8000 registers (forgot the exact amount) and "read-after-write" register latency is "only" 24 cycles. Okay, that still sux, considering you can do 8 additions/cycle, but memory access latency can be hidden if you run many many many threads (another thread can run while another is waiting for memory latency).

My best optimization (for now) involved : Bruteforce \o/

And... ho wait, i'm being late for work, bbl ...   sad  cry angry

ker2x,

Thanks for the description. Does the NVIDA book cover OpenCL?

And is there any built-in support for higher precision math?

I really want a card that supports double precision in hardware. However, Apple is really terrible about video card support. They support only a couple of cards for any given machine, and they are always yesterday's middle-of-the-road cards.

On another subject, from the reading I've done about GPGPU, they are not great for fractal calculations because they are highly tuned to SIMD operations. How do you handle situations where each GPU needs to run a different code-path in order to iterate a separate pixel?


Duncan C
Logged

Regards,

Duncan C
ker2x
Fractal Molossus
**
Posts: 795


WWW
« Reply #5 on: September 06, 2010, 09:11:31 AM »

ker2x,
Thanks for the description. Does the NVIDA book cover OpenCL?

Only a small chapter.
If you follow the NVidia book chapter by chapter (and you should) you learn CUDA first, then there is a chapter that explain how to write an OpenCL code according to what you wrote in CUDA.

It's not so bad, if you didn't knew OpenCL first.
The problem is : there is a *lot* to remember and the names are sometimes very confusing, and the naming differences between OpenCL and CUDA even more confusing.

Beside that, it's the very same architecture, you write the very same C code for the Cuda/OpenCL Kernel and function, and the optimisation tips and tricks are exactly the sames.
OpenCL and CUDA are just 2 differents API to "talk" to the GPU Card. But the code you upload to the GPU is the same code for both, it's only the "host" (cpu) code that change.

Quote
And is there any built-in support for higher precision math?

i don't know any useable arbitrary precision lib for gpu yet. you should probably take a close look to http://www.mpir.org/


Quote
I really want a card that supports double precision in hardware. However, Apple is really terrible about video card support. They support only a couple of cards for any given machine, and they are always yesterday's middle-of-the-road cards.

what you (usually) want is the latest generation of GPU (with latest feature, eg: double precision) , not the fastest card.
I'm perfectly happy developping GPU code on my laptop powered by a ION2.
I'm not buying a $150 latest generation GPU card for my desktop because i'm a gamer and my happier with my old highend 8800GTX than a last-generation low-end GFX Card.

If you're not gaming (or happy with gaming on apple hardware (lol?)) i suggest to buy a low end PC with a low-end last generation gfx card, and here you go !

Quote
On another subject, from the reading I've done about GPGPU, they are not great for fractal calculations because they are highly tuned to SIMD operations. How do you handle situations where each GPU needs to run a different code-path in order to iterate a separate pixel?

I didn't know any nice way to avoid that problem. So ... Bruteforce and confidence into the GPU Scheduler.   embarrass
As far as i understood, different code-path do not block all the others threads just 1 cuda core (middle-range and high-end card have between 256 to 512 Cores)

Considering our (fractals) problem, you will not have 100% GPU occupancy, live with it.
But even a 50% Occupancy is still incredibely fast smiley
Logged

often times... there are other approaches which are kinda crappy until you put them in the context of parallel machines
(en) http://www.blog-gpgpu.com/ , (fr) http://www.keru.org/ ,
Sysadmin & DBA @ http://www.over-blog.com/
cbuchner1
Fractal Phenom
******
Posts: 443


« Reply #6 on: September 06, 2010, 11:07:09 AM »


yeah scheduling granularity are "warps" of 32 threads. If you compute rectangular chunks of 8x4 pixels in each warp, for example, you will get a good spatial localization of each warp. Meaning the iteration depth for all threads in this warp should be similar (statistically speaking).

Christian
Logged
Pages: [1]   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
A Place for Love Mandel Brot talfrac 0 1314 Last post April 14, 2010, 03:48:37 PM
by talfrac
GCN ISA Assembler / AMD_IL Error Checker Programming real_het 0 1142 Last post November 15, 2012, 04:38:39 PM
by real_het
mandelbulb3D and CUDA Programming scavenger 12 6944 Last post May 08, 2013, 01:25:50 PM
by elphinstone
CUDA Y.A.M.Z Programming « 1 2 ... 5 6 » 3dickulus 75 14191 Last post January 27, 2015, 02:38:01 AM
by 3dickulus
Anyone played with Arrayfire ? (CUDA/OpenCL/CPU) Programming « 1 2 » ker2x 18 11604 Last post February 16, 2016, 11:35:27 AM
by ker2x

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.212 seconds with 24 queries. (Pretty URLs adds 0.01s, 2q)