Message Boards Message Boards

Unable to train network on GTX 970 -- unspecified failure

Posted 8 years ago

Hi -- As a stopgap until 11.1 when I can use my 1080 GPU, I've switched to a machine with a 970, but am having a different problem. I can train a simple (stripped down parameters) network on the CPU, but when I add the GPU as the target, it starts to run and then flashes, stops, and turns most of the previous computations back to undefined.

For reference, I have successfully run this same code targeting a Quadro 5000M GPU, where it worked perfectly.

I hope it is something simple that I need to do differently, but I don't know what to try next. I tried uninstalling and re-installing the display driver, but that didn't seem to matter. I'm currently using the latest version from Nvidia/EVGA for my card (an FTW version, with all factory settings).

I've attached a notebook that includes a call to SystemInformation, as well as the CPU & GPU training calls.

Thanks for any thoughts! -- David

POSTED BY: David Cardinal
9 Replies

Stefan -- That makes sense, although I've noticed that once it runs out of memory with a failure message, it seems like even a simpler network (that would succeed if I ran it first thing) can fail and crash the kernel, so there may also be an issue for how the kernel recovers state & resources after an out of memory condition (in addition to the problem you mentioned where it's internal checks on memory requirements fail).

PS The 970 (and possibly 4GB 960s) do have that weird Nvidia memory issue, where 4GB isn't really available. I have no idea whether that might confuse an application, but it is something memory-related that's unique to these cards. (Nvidia is having to send $30 to every 970 owner as a result -- not the 960, since Nvidia never configured it with 4GB, those were done by board vendors aftermarket).

POSTED BY: David Cardinal

I also just noticed that sometimes the first time I run the code I get GPU Memory exhausted, and then if I try to simplify it and run again, the kernel often crashes. So perhaps there is something flawed with the way the kernel recovers from the OOM condition.

POSTED BY: David Cardinal

I need to check with the developers, but what I think is happening is that the internal checks that guard against exhausting GPU memory during training are failing. Normally it should immediately see that training wouldn't work and bail out (giving the Failure["GPU memory exhausted"] answer), but here it fails to detect this and proceeds with training. Then when the GPU runs out of memory in the middle of the computations, the kernel doesn't know what to do and crashes.

Yes, that's entirely possible. The Quadro that runs it successfully has 8GB, while my 970 (and the 960 I also tried) have 4GB.

As to your question about the In/Out counter resetting, yes it does.

So I guess the solution (at least for now) is to keep reducing the complexity of the NN until it works. Are there any resources available to help estimate the amount of GPU memory Mathematica needs to represent various layers?

Thanks for your help!

POSTED BY: David Cardinal

What you describe sounds like a kernel crash. To be clear, when you run into the issue does the In/Out cell counter reset? I.e. if you immediately evaluated 1+1 it would look like

In[1]:= 1+1
Out[1]= 2

I ripped the sample down to nothing, and built it back up. From what I can tell so far, the problem arises with the 512 neuron DotPlusLayer. If I take that down to 256 or less, I can train the network.

POSTED BY: David Cardinal

When I run the example on my machine (which has a GeForce 960 GTX) it returns with a Failure answer, saying that GPU memory was exhausted. Perhaps that is also happening on your machine but instead of properly bailing and returning a Failure the kernel is crashing...

Is there a way to access your training data? I.e. the files that are in "c:\\projects\\Submissions\\Thumbnails\\Selects", referenced in the notebook.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract