Hello I am trying to understand how to get CUDAImageMultiply to use the same precision as ImageMultiply:
Here is a prototype example which illustrates the question:
m1 = Table[Sin[4. Pi x/500] Sin[4. Pi y/500], {x, 500}, {y, 500}];
m2 = RandomReal[{0.5, 1}, {500, 500}];
im1 = Image[m1, ColorSpace -> "Grayscale"]
im2 = Image[m2, ColorSpace -> "Grayscale"]
ImageMultiply[im1, im2] (*as expected*)
Using CUDAImageMultiply but without explicitly allocating memory on the GPU:
Needs["CUDALink`"]
CUDAImageMultiply[im1, im2] (*as expected*)
The above doesn't really give any speed up. It is probably the memory transfer. So, it is natural to try:
cimg1 = CUDAMemoryLoad[im1]
cimg2 = CUDAMemoryLoad[im2]
Allocate gpu memory for the product (I suspect the problem lies in the next step)
cimg3 = CUDAMemoryLoad[im2]
One gets a nice speedup without the memory transfer:
RepeatedTiming[
CUDAImageMultiply[cimg1, cimg2, "OutputMemory" -> cimg3];]
But, the result appears to have have only, ummm, 256 shades of gray.
CUDAMemoryGet[cimg3] (*not as expected*)
Does anyone have a fix for this? Or, even better, some CUDA or OpenCL code for matrix multiplication m1*m2 (not Dot).
Thanks