Message Boards Message Boards

Why is CUDA slow for me?

Posted 10 years ago
I have CUDA 6.0 and Mathematica 9.  CUDADot runs slower than Dot for large matrix multiplication.
 In[1]:= ClearAll[Evaluate[Context[] <> "*"]]
 Needs["CUDALink`"]
 CUDAQ[]
 CUDAInformation[]
 CUDAResourcesInformation[]
 n = 3*^3;
 M = RandomReal[1, {n, n}];
 AbsoluteTiming[M.M;]
 MG = CUDAMemoryLoad[M];
AbsoluteTiming[A = CUDADot[MG, MG]]
B = CUDAMemoryUnload[A];


Out[3]= True


Out[4]= {1 -> {"Name" -> "GeForce GT 650M", "Clock Rate" -> 900000,
   "Compute Capabilities" -> 3., "GPU Overlap" -> 1,
   "Maximum Block Dimensions" -> {1024, 1024, 64},
   "Maximum Grid Dimensions" -> {2147483647, 65535, 65535},
   "Maximum Threads Per Block" -> 1024,
   "Maximum Shared Memory Per Block" -> 49152,
   "Total Constant Memory" -> 65536, "Warp Size" -> 32,
   "Maximum Pitch" -> 2147483647,
   "Maximum Registers Per Block" -> 65536, "Texture Alignment" -> 512,
    "Multiprocessor Count" -> 2, "Core Count" -> 64,
   "Execution Timeout" -> 1, "Integrated" -> False,
   "Can Map Host Memory" -> True, "Compute Mode" -> "Default",
   "Texture1D Width" -> 65536, "Texture2D Width" -> 65536,
   "Texture2D Height" -> 65536, "Texture3D Width" -> 4096,
   "Texture3D Height" -> 4096, "Texture3D Depth" -> 4096,
   "Texture2D Array Width" -> 16384,
   "Texture2D Array Height" -> 16384,
   "Texture2D Array Slices" -> 2048, "Surface Alignment" -> 512,
   "Concurrent Kernels" -> True, "ECC Enabled" -> False,
   "TCC Enabled" -> False, "Total Memory" -> 1073414144}}


Out[5]= {{"Name" -> "CUDAResources", "Version" -> "9.0.0.0",
  "BuildNumber" -> "", "Qualifier" -> "OSX",
  "MathematicaVersion" -> "9.0.0+", "SystemID" -> {"MacOSX-x86-64"},
  "Description" -> "{ToolkitVersion -> 5.0, MinimumDriver -> 270.0}",
  "Category" -> "", "Creator" -> "", "Publisher" -> "",
  "Support" -> "", "Internal" -> False,
  "Location" ->
   "/Users/Brett/Library/Mathematica/Paclets/Repository/CUDAResources-\
OSX-9.0.0.0", "Context" -> {}, "Enabled" -> True, "Loading" -> Manual,
   "Hash" -> "fa491b5d7dd0144b2608a1daf4530222"}}


Out[8]= {1.819966, Null}


Out[10]= {3.032584, CUDAMemory["<477942836>", "Double"]}
POSTED BY: B R
2 Replies
Posted 10 years ago
This introduction shows significant speed increase on 4000 x 4000 matrix multiplication using CUDADot

http://reference.wolfram.com/mathematica/CUDALink/tutorial/Introduction.html#104550813

I run out of memory if I use a matrix this large: 
 In[28]:= ClearAll[Evaluate[Context[] <> "*"]]
 Needs["CUDALink`"]
 n = 4*^3;
 M = RandomReal[1, {n, n}];
 AbsoluteTiming[M.M;]
 MG = CUDAMemoryLoad[M];
 AbsoluteTiming[A = CUDADot[MG, MG]]
 B = CUDAMemoryUnload[A];
 

Out[32]= {3.864413, Null}


During evaluation of In[28]:= CUDADot::outmem: CUDALink ran out of available memory, possibly due to not freeing memory using the memory manager. >>


Out[34]= {0.073591,
CUDADot[CUDAMemory["<1682054120>", "Double"],
  CUDAMemory["<1682054120>", "Double"]]}


During evaluation of In[28]:= CUDAMemoryUnload::unlmem: Unable to unload memory {CUDADot[CUDAMemory[<1682054120>,Double],CUDAMemory[<1682054120>,Double]]}. Make sure it is a valid CUDALink memory. >>
POSTED BY: B R
When the input data is small, the overhead (passing data from ram to gpu and other way round) takes longer time than the time it saved on parallelization. 
POSTED BY: Shenghui Yang
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract