Message Boards Message Boards

Training loss goes to 0 when specifying TargetDevice = GPU

Posted 8 years ago

HI -- The neural network package is really cool. I'm learning a lot from experimenting with it. I have noticed one odd thing I can't explain. I have a fairly simple CNN that I train on some images. It trains (although super-slowly) on my CPU, with a more or less reasonable loss getting smaller on each training round. But when I set TargetDevice = "GPU" it instantly reports 0 loss and finishes in just a couple seconds. As one clue, I had an Nvidia 970 with the latest Nvidia drivers and this didn't happen. I just upgraded to a new 1080 (partially for faster network training!), and that is when this started happening.

I've attached a Notebook that demonstrates this behavior and has SystemInformation[] in it. (It does use images that aren't included, but I don't think there is anything special about them, they are just a bunch of JPEGs). I'm running Nvidia driver 372.70. If there is a different driver I should be using instead, please let me know. Thanks!

POSTED BY: David Cardinal
17 Replies

So, in Linux this statement:

So it looks like you will need to wait for 11.1, which will definitely support CUDA 8.0.

does not seem to be true. To wit-

Detailed Information	   3 items
Driver Path	/usr/lib/nvidia-378/libnvidia-tls.so.378.13
Library Path	/usr/lib/x86_64-linux-gnu/libcuda.so.378.13

System Libraries	{/usr/lib/x86_64-linux-gnu/libcuda.so,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcudart.so,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcudart.so.7.5,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcufft.so,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcufft.so.7.5,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcufftw.so,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcufftw.so.7.5,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcublas.so,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcublas.so.7.5,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcurand.so,/home/flip/.Mathematica/Paclets/Repository/CUDAResources-LIN64-10.5.0/CUDAToolkit/lib64/libcurand.so.7.5}

You'll see a bunch of 7.5s there at the end of the .sos. I have 8.0 installed here, locally, and use it to build various things, but for some reason, Mathematica doesn't want to use it.

libcuda1-378/zesty,now 378.13-0ubuntu0~gpu17.04.3 amd64 [installed,automatic] libcudart8.0/zesty,now 8.0.44-3 amd64 [installed,automatic] nvidia-cuda-dev/zesty,now 8.0.44-3 amd64 [installed] nvidia-cuda-doc/zesty,zesty,now 8.0.44-3 all [installed]

Any thoughts?

POSTED BY: Flip Phillips

CUDALink is entirely separate from the NeuralNetworks GPU computing via CUDA. Neither depends on the other in any way. CUDALink still needs to be updated for CUDA Toolkit 8.

Ah- silly me to think they would be integrated / related somehow. :)

POSTED BY: Flip Phillips
Posted 8 years ago

I have my Titan X working now with Mathematica 11.1. Without any trouble:)

POSTED BY: Fred Hugen

This issue seems to be resolved in version 11.1 (at least for my GTX 1050)!

Gijsbert

Great! NeuralNetworks in general and NetTrain in particular got a huge overhaul on nearly every level in 11.1, so many of the bugs and limitations in 11.0.1 should be fixed, or at least ameliorated.

Posted 8 years ago

Jeez... just finally purchased both Mathematica and my Nvidia Titan X Pascal video card and ran in to the same issue as you guys : ( . I was hoping to avoid the CUDA problem on windows since I heard it was plaguing tensorflow on Linux a few months back. I guess I will have to wait for a fix to come out.

POSTED BY: P J
Posted 8 years ago

David-- thanks helps me out for the momen., I'll try a 4GB Tesla K10 from a colleague and if it doesn't have enough memory I'll buy the GTX 970 with 8GB

POSTED BY: Fred Hugen

Fred -- I doubt it is the fastest card that works, but I've fallen back to a 970 until the bug that prevents my 1080 from working is fixed. It works okay, but of course is both slower and has less memory.

POSTED BY: David Cardinal
Posted 8 years ago

Hi-- I experience the same problem with the Titan X GPU card from NVidia which I bought for the Deep Learning toolkit of Mathematica.

The Mathematica Deep Learning toolkit is so much easier to use compared to Caffe, Theano or TensorFlow in combination with Python.

But without a reasonable GPU, the Deep Learning toolkit it is too slow for solving applications. Is there a prediction when Mathematica 11.1 is about to be released?

Can I subscribe to a beta release of RC of 11.1?

Which is the fast graphics/GPU card that the Mathematica Toolkit will run the Deep learning functions?

POSTED BY: Fred Hugen

Sebastian -- Thanks for the prompt reply, although "ouch" on the timing.

POSTED BY: David Cardinal

Any estimate of when this might get patched?

We were hoping to rebuild the 11.0 release backend with CUDA 8.0 Release Candidate, and provide a patch. Unfortunately, it does not appear to be compatible with CUDA 8.0 RC, so we can't go this simple route.

So it looks like you will need to wait for 11.1, which will definitely support CUDA 8.0.

Any estimate of when this might get patched? Training on the CPU is, of course, almost useless, so I'd love to be using my 1080. Thanks for any info!

POSTED BY: David Cardinal

Thanks for the quick reply. I assume that there isn't any way I can update the runtime libraries for CUDA & cuDNN that Mathematica uses on my system, but that I need to wait for a patch from you guys?

POSTED BY: David Cardinal

We build other libraries against CUDA 7.5, which would need to be rebuilt. So we would need to push a patch.

A number of TensorFlow users are reporting problems using a 1080/1070 with CUDA v7.5 and cuDNN v5.0 (which we are currently using for 11.0). The latest is CUDA v8RC and cuDNN v5.1:

This seems to be resolved when upgrading to CUDA 8RC and cuDNN 5.1. But we are in the process of acquiring our own 1080 GPU's and will verify that this fixes the problem soon. Will keep you posted on what we find.

This might be a problem on NVIDIA's end, an interaction between the 1080/1070 cards and CUDAToolkit 7.5 and cuDNN library v5.0. We are investigating.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract