Message Boards Message Boards

Why is CUDA slow for me?

Posted 11 years ago
I have CUDA 6.0 and Mathematica 9.  CUDADot runs slower than Dot for large matrix multiplication.
 In[1]:= ClearAll[Evaluate[Context[] <> "*"]]
 Needs["CUDALink`"]
 CUDAQ[]
 CUDAInformation[]
 CUDAResourcesInformation[]
 n = 3*^3;
 M = RandomReal[1, {n, n}];
 AbsoluteTiming[M.M;]
 MG = CUDAMemoryLoad[M];
AbsoluteTiming[A = CUDADot[MG, MG]]
B = CUDAMemoryUnload[A];


Out[3]= True


Out[4]= {1 -> {"Name" -> "GeForce GT 650M", "Clock Rate" -> 900000,
   "Compute Capabilities" -> 3., "GPU Overlap" -> 1,
   "Maximum Block Dimensions" -> {1024, 1024, 64},
   "Maximum Grid Dimensions" -> {2147483647, 65535, 65535},
   "Maximum Threads Per Block" -> 1024,
   "Maximum Shared Memory Per Block" -> 49152,
   "Total Constant Memory" -> 65536, "Warp Size" -> 32,
   "Maximum Pitch" -> 2147483647,
   "Maximum Registers Per Block" -> 65536, "Texture Alignment" -> 512,
    "Multiprocessor Count" -> 2, "Core Count" -> 64,
   "Execution Timeout" -> 1, "Integrated" -> False,
   "Can Map Host Memory" -> True, "Compute Mode" -> "Default",
   "Texture1D Width" -> 65536, "Texture2D Width" -> 65536,
   "Texture2D Height" -> 65536, "Texture3D Width" -> 4096,
   "Texture3D Height" -> 4096, "Texture3D Depth" -> 4096,
   "Texture2D Array Width" -> 16384,
   "Texture2D Array Height" -> 16384,
   "Texture2D Array Slices" -> 2048, "Surface Alignment" -> 512,
   "Concurrent Kernels" -> True, "ECC Enabled" -> False,
   "TCC Enabled" -> False, "Total Memory" -> 1073414144}}


Out[5]= {{"Name" -> "CUDAResources", "Version" -> "9.0.0.0",
  "BuildNumber" -> "", "Qualifier" -> "OSX",
  "MathematicaVersion" -> "9.0.0+", "SystemID" -> {"MacOSX-x86-64"},
  "Description" -> "{ToolkitVersion -> 5.0, MinimumDriver -> 270.0}",
  "Category" -> "", "Creator" -> "", "Publisher" -> "",
  "Support" -> "", "Internal" -> False,
  "Location" ->
   "/Users/Brett/Library/Mathematica/Paclets/Repository/CUDAResources-\
OSX-9.0.0.0", "Context" -> {}, "Enabled" -> True, "Loading" -> Manual,
   "Hash" -> "fa491b5d7dd0144b2608a1daf4530222"}}


Out[8]= {1.819966, Null}


Out[10]= {3.032584, CUDAMemory["<477942836>", "Double"]}
POSTED BY: B R
2 Replies
When the input data is small, the overhead (passing data from ram to gpu and other way round) takes longer time than the time it saved on parallelization. 
POSTED BY: Shenghui Yang
Posted 11 years ago
This introduction shows significant speed increase on 4000 x 4000 matrix multiplication using CUDADot

http://reference.wolfram.com/mathematica/CUDALink/tutorial/Introduction.html#104550813

I run out of memory if I use a matrix this large: 
 In[28]:= ClearAll[Evaluate[Context[] <> "*"]]
 Needs["CUDALink`"]
 n = 4*^3;
 M = RandomReal[1, {n, n}];
 AbsoluteTiming[M.M;]
 MG = CUDAMemoryLoad[M];
 AbsoluteTiming[A = CUDADot[MG, MG]]
 B = CUDAMemoryUnload[A];
 

Out[32]= {3.864413, Null}


During evaluation of In[28]:= CUDADot::outmem: CUDALink ran out of available memory, possibly due to not freeing memory using the memory manager. >>


Out[34]= {0.073591,
CUDADot[CUDAMemory["<1682054120>", "Double"],
  CUDAMemory["<1682054120>", "Double"]]}


During evaluation of In[28]:= CUDAMemoryUnload::unlmem: Unable to unload memory {CUDADot[CUDAMemory[<1682054120>,Double],CUDAMemory[<1682054120>,Double]]}. Make sure it is a valid CUDALink memory. >>
POSTED BY: B R
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract