Now this program is completely GPU unfriendly
Some general advice for GPU friendlyness.
Using structures of Arrays is much faster on GPUs arrays of structures. Or in other words, your array of sComplexImage structs is a worst case for the GPU. I converted it to an a struct of arrays like this
unsigned short *shadowsBuf16;
unsigned short *shadingBuf16;
unsigned short *specularBuf16;
unsigned short *glowBuf16;
unsigned short *colorIndexBuf16;
Of course it needs quite a bit more new/delete operators to (re)allocate, but it's worth the cost.
Another thing are the sRGB8 sRGB16 structures, which currently have three elements. Now the GPU can much better work with 4 element vectors, so I am going to use the CUDA native ushort4, uchar4 types here instead.
The above modifications result in much higher speeds of memory access, where one can achieve the full 80GB/sec or so of modern GPUs (some modern ones have even higher bandwidths in excess of 100GB/sec).