Message Boards Message Boards

GROUPS:

This OpenCL code seems to be leaking memory. Any ideas?

Posted 6 years ago
7435 Views
|
3 Replies
|
9 Total Likes
|
Hello all,

I've been playing around with OpenCL and I've run into a problem that I can't pin down. When running this n-body simulation, the amount of memory in use gradually increases with every time step (until I run out completely). I've tried using Bytecount[] on every variable that I can think of, and none of them seem to be the culprit.

Anyway, the OpenCL code in question is here:
 __kernel void nbody_kern(
     float timeStep,
     float eps,
     __global float4* position1,
     __global float4* velocity1,
     __global float4* acceleration1,
     __global float4* position2,
     __global float4* velocity2,
     __global float4* acceleration2,
    __local float4* localPosition
) {
    const float4 dt = (float4)(timeStep,timeStep,timeStep,0.0f);

    int idxGlobal = get_global_id(0);
    int idxLocal = get_local_id(0);

    int globalSize = get_global_size(0);
    int localSize = get_local_size(0);
    int blocks = globalSize / localSize;
   
    float4 currentPosition = position1[idxGlobal];
    float4 currentVelocity = velocity1[idxGlobal];
    float4 currentAcceleration = acceleration1[idxGlobal];
   
    float4 newPosition = (float4)(0.0f,0.0f,0.0f,0.0f);
    float4 newVelocity = (float4)(0.0f,0.0f,0.0f,0.0f);
    float4 newAcceleration = (float4)(0.0f,0.0f,0.0f,0.0f);
   
    for(int currentBlock = 0; currentBlock < blocks; currentBlock++)
    {
        localPosition[idxLocal] = position1[currentBlock * localSize + idxLocal];
        barrier(CLK_LOCAL_MEM_FENCE);
        for(int idx = 0; idx < localSize; idx++)
        {
            float4 dir = localPosition[idx] - currentPosition;
            float invRadius = rsqrt(dir.x * dir.x + dir.y * dir.y + dir.z * dir.z + eps);
            float magnitude = localPosition[idx].w * invRadius * invRadius * invRadius;
            newAcceleration += magnitude * dir;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
   
    // leapfrog integration
    newPosition = currentPosition + currentVelocity * dt + 0.5f * currentAcceleration * dt * dt;
    newVelocity = currentVelocity + 0.5f * (currentAcceleration + newAcceleration) * dt;
   
    position2[idxGlobal] = newPosition;
    velocity2[idxGlobal] = newVelocity;
    acceleration2[idxGlobal] = newAcceleration;
}
 
and the Mathematica code is here:
 initialVelocity[pt_] := Module[
   {y, v0, \[Theta]},
   y = Norm[pt[[1 ;; 3]]]/(2 rd);
   v0 = 4*\[Pi]*\[CapitalSigma]0*rd*
     y^2 (BesselI[0, y]*BesselK[0, y] - BesselI[1, y]*BesselK[1, y]);
   \[Theta] = ArcTan[pt[[1]], pt[[2]]];
   {-v0 Sin[\[Theta]], v0 Cos[\[Theta]], 0.0, 0.0} +
    RandomVariate[NormalDistribution[0, .001], 4]
   ]

Needs["OpenCLLink`"];
source = {"D:\\Google Drive\\Mathematica\\nbody.cl"};

stepSimulation = OpenCLFunctionLoad[
   source,
   "nbody_kern",
   {
    "Float",
    "Float",
    {"Float", _, "Input"},
    {"Float", _, "Input"},
    {"Float", _, "Input"},
    {"Float", _, "Output"},
    {"Float", _, "Output"},
    {"Float", _, "Output"},
    {"Local", "Float"}
    },
   256, "ShellOutputFunction" -> Print
   ];

dt = .01;
eps = .0001;
rd = 1.0;
\[CapitalSigma]0 = 0.3;
len = 2^14;

particles = (RandomVariate[ExponentialDistribution[rd]]*{Cos[#],
       Sin[#], 0.0, 0.0}) & /@ RandomReal[{0, 2 Pi}, len];
particles[[All, 4]] = 1.0/len;
velocity = initialVelocity /@ particles;
acceleration = Array[{0.0, 0.0, 0.0, 0.0} &, len];

position1 = OpenCLMemoryLoad[particles, "Float"];
velocity1 = OpenCLMemoryLoad[velocity, "Float"];
acceleration1 = OpenCLMemoryLoad[acceleration, "Float"];

position2 = OpenCLMemoryAllocate["Float", {len, 4}];
velocity2 = OpenCLMemoryAllocate["Float", {len, 4}];
acceleration2 = OpenCLMemoryAllocate["Float", {len, 4}];

pr = 8;
memList = {MemoryInUse[]};
Graphics3D[{
  PointSize[Small],
  Point[Dynamic[Refresh[
     AppendTo[memList, MemoryInUse[]];
     stepSimulation[dt, eps, position1, velocity1, acceleration1,
      position2, velocity2, acceleration2, 256*4, len];
     stepSimulation[dt, eps, position2, velocity2, acceleration2,
      position1, velocity1, acceleration1, 256*4, len];
     OpenCLMemoryGet[position1][[All, 1 ;; 3]],
     UpdateInterval -> 0]]
   ]}, PlotRange -> {{-pr, pr}, {-pr, pr}, {-pr/4, pr/4}},
ImageSize -> 1000
]
Dynamic[ListLinePlot[memList]]

As far as I can tell, the code does what it's supposed to and I'm generally pleased with the results. This is what I get after a few steps:

However, the memory problem limits how long I can run this thing for. This plot shows the amount of memory in use with respect to the number of steps completed in the simulation.


The only way I've been able to free up all that memory is by quitting the kernel, which is hardly ideal. I'm guessing that I'm just missing some important aspect of OpenCL programming. If anyone is more familiar with this stuff and has some ideas, I'd be thrilled to hear them.
3 Replies
Richard,

(Very nice OpenCL simulation)

I think I am observing the same problem when eliminating the OpenCL code:
 memList = {MemoryInUse[]};
 Graphics3D[{
   PointSize[Small],
   Point[Dynamic[Refresh[
      AppendTo[memList, MemoryInUse[]];
      RandomReal[1, {1000, 3}],
      UpdateInterval -> 0]]]},
  PlotRange -> {{0, 1}, {0, 1}, {0, 1}},
  ImageSize -> 1000]
Dynamic[ListLinePlot[memList]]
When I run this piece of code, I see similar memory usage increases. So your problem may not lie with your OpenCL code, but with how the kernel manages its memory. Of course memList itself grows in memory usage (due to the AppendTo), but that does not nearly account for the memory increases you are seeing.
 memList = {
    MemoryInUse[]};
 Graphics3D[
  {
   PointSize[Small],
   Point[Dynamic[Refresh[memList ~~ Join ~~ MemoryInUse[];
      RandomReal[1, {1000, 3}], UpdateInterval -> 0]]]
   },
  PlotRange -> {{0, 1}, {0, 1}, {0, 1}}, ImageSize -> 1000]
Dynamic[ListLinePlot[memList]]
Changing it to join seems to fix the memory issue. Keep in mind append makes a copy of the data each time its run, Appendto's memory ussage is just downright funky (presumably, harder on the computer, i dun understand it lol).
I'm not entirely sure I got the original coding work as intinded... but changing 
  AppendTo[memList, MemoryInUse[]];
 memList ~~ Join ~~ MemoryInUse[];
 (*IE*)
 Graphics3D[{PointSize[Small],
   Point[Dynamic[Refresh[memList ~~ Join ~~ MemoryInUse[];
      stepSimulation[dt, eps, position1, velocity1, acceleration1,
       position2, velocity2, acceleration2, 256*4, len];
      stepSimulation[dt, eps, position2, velocity2, acceleration2,
       position1, velocity1, acceleration1, 256*4, len];
     OpenCLMemoryGet[position1][[All, 1 ;; 3]],
     UpdateInterval -> 0]]]},
PlotRange -> {{-pr, pr}, {-pr, pr}, {-pr/4, pr/4}}, ImageSize -> 1000]
Dynamic[ListLinePlot[memList]]
Seemed to do the trick... 

Hopefully this helped,
Good Luck.

PS: You should add manipulates everwhere for good measure.

Edit- I completly lied, this did nothing but break the graph... I'm bad at this.

Edit 2- Now I'm just confused... It fix's his mathematic simplfication code, but not richard's...
Well, hopefully something I said was useful >.>
I ran a few tests in an attempt to isolate the issue, but I'm still stuck without a solution. In each of these cases, I started with a freshly restarted kernel and the following initializtion code:
 initialVelocity[pt_] := Module[
   {y, v0, \[Theta]},
   y = Norm[pt[[1 ;; 3]]]/(2 rd);
   v0 = 4*\[Pi]*\[CapitalSigma]0*rd*
     y^2 (BesselI[0, y]*BesselK[0, y] - BesselI[1, y]*BesselK[1, y]);
   \[Theta] = ArcTan[pt[[1]], pt[[2]]];
   {-v0 Sin[\[Theta]], v0 Cos[\[Theta]], 0.0, 0.0} +
    RandomVariate[NormalDistribution[0, .0001], 4]
   ]

Needs["OpenCLLink`"];
source = {"D:\\Google Drive\\Mathematica\\nbody.cl"};

stepSimulation = OpenCLFunctionLoad[
   source,
   "nbody_kern",
   {
    "Float",
    "Float",
    {"Float", _, "Input"},
    {"Float", _, "Input"},
    {"Float", _, "Input"},
    {"Float", _, "Output"},
    {"Float", _, "Output"},
    {"Float", _, "Output"},
    {"Local", "Float"}
    },
   256, "ShellOutputFunction" -> Print
   ];

dt = .01;
eps = .0001;
rd = 1.0;
\[CapitalSigma]0 = 0.3;
len = 2^13;

particles = (RandomVariate[ExponentialDistribution[rd]]*{Cos[#],
       Sin[#], 0.0, 0.0}) & /@ RandomReal[{0, 2 Pi}, len];
particles[[All, 4]] = 1.0/len;
velocity = initialVelocity /@ particles;
acceleration = Array[{0.0, 0.0, 0.0, 0.0} &, len];

position1 = OpenCLMemoryLoad[particles, "Float"];
velocity1 = OpenCLMemoryLoad[velocity, "Float"];
acceleration1 = OpenCLMemoryLoad[acceleration, "Float"];

position2 = OpenCLMemoryAllocate["Float", {len, 4}];
velocity2 = OpenCLMemoryAllocate["Float", {len, 4}];
acceleration2 = OpenCLMemoryAllocate["Float", {len, 4}];
The nbody.cl code is the same from the original post. After each test, I used the respective command below to keep track of my data between kernel resets.
DumpSave["memList1.mx", memList1];
DumpSave["memList2.mx", memList2];
DumpSave["memList3.mx", memList3];
DumpSave["memList4.mx", memList4];

Here are the details of each test I ran. Test 1 replicates my original conditions. Test 2 uses Arnoud's example for comparison. Test 3 keeps everything regarding the simulation on the GPU, and the Mathematica kernel only handles the loop and the memory tracking. Lastly, test 4 is the same as 3, except the kernel accesses the GPU memory to get the current particle positions at each step.

Test 1
 pr = 8;
 mem0 = MemoryInUse[];
 memList1 = {MemoryInUse[] - mem0};
 Graphics3D[{PointSize[Small],
   Point[Dynamic[Refresh[AppendTo[memList1, MemoryInUse[] - mem0];
      stepSimulation[dt, eps, position1, velocity1, acceleration1,
       position2, velocity2, acceleration2, 256*4, len];
      stepSimulation[dt, eps, position2, velocity2, acceleration2,
       position1, velocity1, acceleration1, 256*4, len];
     OpenCLMemoryGet[position1][[All, 1 ;; 3]],
     UpdateInterval -> 0]]]},
PlotRange -> {{-pr, pr}, {-pr, pr}, {-pr/4, pr/4}}, ImageSize -> 1000]
Dynamic[ListLinePlot[memList1]]
Test 2
mem0 = MemoryInUse[];
memList2 = {MemoryInUse[] - mem0};
Graphics3D[{PointSize[Small],
  Point[Dynamic[Refresh[AppendTo[memList2, MemoryInUse[]];
     RandomReal[1, {len, 3}], UpdateInterval -> 0]]]},
PlotRange -> {{0, 1}, {0, 1}, {0, 1}}, ImageSize -> 1000]
Dynamic[ListLinePlot[memList2]]
Test 3
 mem0 = MemoryInUse[];
 memList3 = {MemoryInUse[] - mem0};
 Do[
  AppendTo[memList3, MemoryInUse[] - mem0];
  stepSimulation[dt, eps, position1, velocity1, acceleration1,
   position2, velocity2, acceleration2, 256*4, len];
  stepSimulation[dt, eps, position2, velocity2, acceleration2,
   position1, velocity1, acceleration1, 256*4, len];
  , {15000}]
Test 4
 mem0 = MemoryInUse[];
 memList4 = {MemoryInUse[] - mem0};
 Do[
  AppendTo[memList4, MemoryInUse[] - mem0];
  stepSimulation[dt, eps, position1, velocity1, acceleration1,
   position2, velocity2, acceleration2, 256*4, len];
  stepSimulation[dt, eps, position2, velocity2, acceleration2,
   position1, velocity1, acceleration1, 256*4, len];
  OpenCLMemoryGet[position1][[All, 1 ;; 3]];
, {15000}]
Results
After all of these had run, I used
Get /@ {"memList1.mx", "memList2.mx", "memList3.mx", "memList4.mx"};
to recover the memory tracking data. The results are interesting:
ListLinePlot[{memList1, memList2, memList3, memList4},
PlotStyle -> Thick,
PlotLegends -> Automatic,
PlotRange -> {{1, 15000}, Automatic},
AxesOrigin -> {1, -10^9}]


The memory usage from tests 1 and 2 is clearly a problem, since it appears to be increasing quadratically with each step. I ended up using an additional 20 GB of memory after 15000 steps. Although there's clearly a memory leak here somewhere that isn't OpenCL related, there's still a difference between test 1 and 2 that must be related to the OpenCL code. Although it's not as severe, it's the one I can't avoid, so I need to pin that one down.

The results from test 3 show that the simulation running on the GPU alone isn't the problem. I verified that the code did in fact run by getting the final position state afterwards, and everything appeared to have functioned properly. However, this code is entirely useless to me, since I need the particle positions at each step. Unfortunately, as you can see from the test 4 results, reading the position data from the GPU (even though I'm not assigning it to anything) increases the amount of memory in use linearly with every step. After 15000 steps, I was using approximately 4 GB of additional memory.

So from what I can tell is that using the function OpenCLMemoryGet[] is copying GPU memory to the kernel, but it never lets it go. This becomes a serious problem when you need to repeatedly get updates from the GPU, since each time the memory is read, a new section of kernel memory is allocated for it. What I need is some way to free up this memory from the kernel, but I have no idea how to refer to it. The function OpenCLMemoryUnload[] seems like it would be the obvious choice, but that is for unloading the memory I have allocated in the GPU, which doesn't help me. Does anyone know of a way to clear this memory from the kernel after each step?
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract