Title: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: ker2x on July 30, 2010, 09:18:22 AM i found this article http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers (http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers) on http://www.khronos.org/news/C124/ (http://www.khronos.org/news/C124/) (home of openCL, openGL, ... standard) homepage
I agree (of course ;D ), i like assembler (as long as it support SSE2 and higher instruction), i do some assembler (pureBasic + TASM), and i love OpenCL :D (i wanted to do GPGPU since the 1st 3Dfx, of course it was nearly impossible to do, and i had to wait for Cuda & OpenCL (i tried cGL, but too weird for me)) I bought the NVidia book "Programming Massively parallel processor", the book is all about speed, optimization, speed and optimization (and speed (and optimization)). That make sense, the only goal of GPGPU is ... being fast, and faster (and maybe the fastest, too). So the book focus on speed and optimization and how the GPU architecture can be used at its best (and there is a lot to learn, it's totally different than CPU architecture) If you do not care about speed, forget about GPGPU, it's weird, it's a completely different architecture, it have some crazy limitation and behaviours, memory access is painfully slow compared to arithmetic operation, ... eg : an addition take 1/8th cycle to proceed, a global memory access take 800 cycles of latency. :angry: If you care about speed. Well ... as long as your problem is embarrassingly parallel, it's insanely incredibly f*cking fast ! Forget about what you know (easy for me, i don't know much about programming) and be ready to run hundreds of thousands of threads per second per core (did i mentioned that high-end gpu card have 300~500 cores ? each one running at more than 1Ghz ?). High-end GFX card support 64bits floating point operation, the Tesla M2050/M2070 run at 500GFlops double precision, or 1TeraFlops(!) single precision. An intel Q6600 run at 38GFlops, an high-end i7 cpu run at ~50GFlops. Most CPU optimization are about trading Memory vs CPU, which is a bad thing to do with GPU, but you have more than 8000 registers (forgot the exact amount) and "read-after-write" register latency is "only" 24 cycles. Okay, that still sux, considering you can do 8 additions/cycle, but memory access latency can be hidden if you run many many many threads (another thread can run while another is waiting for memory latency). My best optimization (for now) involved : Bruteforce \o/ And... ho wait, i'm being late for work, bbl ... :sad1: :'( :angry: Title: Re: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: kram1032 on July 30, 2010, 09:04:35 PM speed and optimization (and speed (and optimization)). Are that few iterations enough to show the fractal structure of this? Title: Re: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: ker2x on July 31, 2010, 11:00:59 PM speed and optimization (and speed (and optimization)). Are that few iterations enough to show the fractal structure of this? Only if coded in LISP :) Title: Re: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: Synaesthesia on August 03, 2010, 02:39:25 PM Thank you, I wanna try this out..
Title: Re: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: Duncan C on August 24, 2010, 01:54:01 PM i found this article http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers (http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers) on http://www.khronos.org/news/C124/ (http://www.khronos.org/news/C124/) (home of openCL, openGL, ... standard) homepage I agree (of course ;D ), i like assembler (as long as it support SSE2 and higher instruction), i do some assembler (pureBasic + TASM), and i love OpenCL :D (i wanted to do GPGPU since the 1st 3Dfx, of course it was nearly impossible to do, and i had to wait for Cuda & OpenCL (i tried cGL, but too weird for me)) I bought the NVidia book "Programming Massively parallel processor", the book is all about speed, optimization, speed and optimization (and speed (and optimization)). That make sense, the only goal of GPGPU is ... being fast, and faster (and maybe the fastest, too). So the book focus on speed and optimization and how the GPU architecture can be used at its best (and there is a lot to learn, it's totally different than CPU architecture) If you do not care about speed, forget about GPGPU, it's weird, it's a completely different architecture, it have some crazy limitation and behaviours, memory access is painfully slow compared to arithmetic operation, ... eg : an addition take 1/8th cycle to proceed, a global memory access take 800 cycles of latency. :angry: If you care about speed. Well ... as long as your problem is embarrassingly parallel, it's insanely incredibly f*cking fast ! Forget about what you know (easy for me, i don't know much about programming) and be ready to run hundreds of thousands of threads per second per core (did i mentioned that high-end gpu card have 300~500 cores ? each one running at more than 1Ghz ?). High-end GFX card support 64bits floating point operation, the Tesla M2050/M2070 run at 500GFlops double precision, or 1TeraFlops(!) single precision. An intel Q6600 run at 38GFlops, an high-end i7 cpu run at ~50GFlops. Most CPU optimization are about trading Memory vs CPU, which is a bad thing to do with GPU, but you have more than 8000 registers (forgot the exact amount) and "read-after-write" register latency is "only" 24 cycles. Okay, that still sux, considering you can do 8 additions/cycle, but memory access latency can be hidden if you run many many many threads (another thread can run while another is waiting for memory latency). My best optimization (for now) involved : Bruteforce \o/ And... ho wait, i'm being late for work, bbl ... :sad1: :'( :angry: ker2x, Thanks for the description. Does the NVIDA book cover OpenCL? And is there any built-in support for higher precision math? I really want a card that supports double precision in hardware. However, Apple is really terrible about video card support. They support only a couple of cards for any given machine, and they are always yesterday's middle-of-the-road cards. On another subject, from the reading I've done about GPGPU, they are not great for fractal calculations because they are highly tuned to SIMD operations. How do you handle situations where each GPU needs to run a different code-path in order to iterate a separate pixel? Duncan C Title: Re: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: ker2x on September 06, 2010, 09:11:31 AM ker2x, Thanks for the description. Does the NVIDA book cover OpenCL? Only a small chapter. If you follow the NVidia book chapter by chapter (and you should) you learn CUDA first, then there is a chapter that explain how to write an OpenCL code according to what you wrote in CUDA. It's not so bad, if you didn't knew OpenCL first. The problem is : there is a *lot* to remember and the names are sometimes very confusing, and the naming differences between OpenCL and CUDA even more confusing. Beside that, it's the very same architecture, you write the very same C code for the Cuda/OpenCL Kernel and function, and the optimisation tips and tricks are exactly the sames. OpenCL and CUDA are just 2 differents API to "talk" to the GPU Card. But the code you upload to the GPU is the same code for both, it's only the "host" (cpu) code that change. Quote And is there any built-in support for higher precision math? i don't know any useable arbitrary precision lib for gpu yet. you should probably take a close look to http://www.mpir.org/ Quote I really want a card that supports double precision in hardware. However, Apple is really terrible about video card support. They support only a couple of cards for any given machine, and they are always yesterday's middle-of-the-road cards. what you (usually) want is the latest generation of GPU (with latest feature, eg: double precision) , not the fastest card. I'm perfectly happy developping GPU code on my laptop powered by a ION2. I'm not buying a $150 latest generation GPU card for my desktop because i'm a gamer and my happier with my old highend 8800GTX than a last-generation low-end GFX Card. If you're not gaming (or happy with gaming on apple hardware (lol?)) i suggest to buy a low end PC with a low-end last generation gfx card, and here you go ! Quote On another subject, from the reading I've done about GPGPU, they are not great for fractal calculations because they are highly tuned to SIMD operations. How do you handle situations where each GPU needs to run a different code-path in order to iterate a separate pixel? I didn't know any nice way to avoid that problem. So ... Bruteforce and confidence into the GPU Scheduler. :embarrass: As far as i understood, different code-path do not block all the others threads just 1 cuda core (middle-range and high-end card have between 256 to 512 Cores) Considering our (fractals) problem, you will not have 100% GPU occupancy, live with it. But even a 50% Occupancy is still incredibely fast :) Title: Re: If you liked Assembler, you値l love OpenCL (and Cuda too) Post by: cbuchner1 on September 06, 2010, 11:07:09 AM yeah scheduling granularity are "warps" of 32 threads. If you compute rectangular chunks of 8x4 pixels in each warp, for example, you will get a good spatial localization of each warp. Meaning the iteration depth for all threads in this warp should be similar (statistically speaking). Christian |