Group Abstract

Message Boards

WOLFRAM COMMUNITY

24.6K Views

9 Replies

24 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Data Science Mathematics Mathematica Dynamic Interactivity Wolfram Language Statistics and Probability Machine Learning Neural Networks

Neural network regression with error bars

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 7 years ago

POSTED BY: Sjoerd Smit

9 Replies

Sort By:

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 5 years ago

Hi Joydeep, thank you for your questions. If you don't use any L2-regularisation, this is really just equivalent to setting `λ2 = 0`. All of the formulas should still work and for the homoscedastic model this does indeed reduce the prediction error to the sample varience. I'll leave it up to you to decide if it's actually wise to do this, though, since turning off the L2-regularisation might cause sever overfitting. The dropout layers will prevent overfitting to a certain extend as well, but personally I'd recommend keeping some degree of L2-regularisation anyway. And honestly, the best way to go about it, is to explore different values of dropout probability and λ2 to see what gives the best results.

POSTED BY: Sjoerd Smit

Joydeep Chakrabortty

Posted 5 years ago

Hi Sjoerd! This is not exactly about the post, but related to the Dropout method used in training evaluation mode. I am using the Self-normalizing Neural Network (SNN) for regression in one of my works. I had the idea to create a sample similar to the way you have generated here. I am not applying regularization for now. My questions are: If I do not use regularization, can I just use the sample variance as the measure of the variance of my output? If not, could you explain the sentence The prior l=2 seems to work reasonably well, though in real applications you'd need to calibrate it with a validation set in your post? As SNNs use "AlphaDropout"s, (and I am using the default probability set in the one from Wolfram repo.) is this okay to do?

POSTED BY: Joydeep Chakrabortty

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 7 years ago

- Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 7 years ago

Yesterday I found the following paper (Dropout Inference in Bayesian Neural Networks with Alpha-divergences) by Yingzhen Li and Yarin Gal in which they address one of the shortcomings of the approach presented in the blog post. Basically, the method I showed above is based on Variational Bayesian Inference, which has a tendency to under fit the posterior (meaning that it gives more optimistic results than it should). To address this, they propose a modified loss function to train your neural network on. In the attached notebook I tried to implement their loss function. It took a bit of tinkering, but I think this should work adequately. Other than that, I haven't given much thought yet to the calibration of the network and training parameters, which is definitely an important thing to do. edit For those of who who're interested in understanding what the alpha parameter does in the modified loss function, it might be instructive to look at figure 2 in the following paper (Black-Box ?-Divergence Minimization) by Hernández-Lobato et al., Attachments:

POSTED BY: Sjoerd Smit

Eduardo Serna

Eduardo Serna, Wolfram

Posted 7 years ago

Really cool stuff, Why does the variance when extrapolating only get big to the right of one the images? From a Gaussian processes perspective, I would expect it to happen whenever you get away from a data point.

POSTED BY: Eduardo Serna

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 7 years ago

POSTED BY: Sjoerd Smit

Eduardo Serna

Eduardo Serna, Wolfram

Posted 7 years ago

I would have to look into it, but my implementation with an exponential covariance, doesn't have a standard deviation that levels off when extrapolating. It has been a while so it could be a bug on my side, or, more likely, something conceptually different.

POSTED BY: Eduardo Serna

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 7 years ago

I think it depends on how you do your GP regression in this case and what kernel you use. If you make a point estimate for the covariance length scale, I think you end up with constant prediction variance far from your data (since basically none of the data points are correlated with the prediction point). If you do a proper Bayesian inference where you integrate over the posterior distribution of your length scale, it will be different since you'll have contributions from very long length scales. In the blog post by Yarin he also shows an example of GP regression with a square exponential where the error bands become constant.

POSTED BY: Sjoerd Smit

Eduardo Serna

Eduardo Serna, Wolfram

Posted 7 years ago

I will need to talk to you when I get back into this.

POSTED BY: Eduardo Serna

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback