Message Boards Message Boards


Sequence-to-sequence regression with recurrent neural networks

Posted 2 years ago
20 Replies
15 Total Likes

I recently wanted to learn more about recurrent neural networks (RNNs) and how to use them for regression problems. In particular, I wanted to try and combine RNNs with the ideas for Bayesian error bar estimation I discussed in my blog/community posts a while back:

I also found the found the following answer on StackExchange about sequence learning with SequenceAttentionLayer, which helped me quite a bit to understand how to use to use these layers:

With those resources in hand, I wanted to give time-series regression a shot.

First, let's use FinancialData as an easy source of a test dataset. Here I download 5 years of data from 5 different indices and turn them into a regularly sampled matrix of vectors and then split the dataset into 80/20% training/test data:

indices = {"GOOGL", "AS:ASML", "MSFT", "AAPL", "GE"};
data = FinancialData[indices, {Today - 5 Quantity[1, "Years"], Today}];
   matrix = N@Normal[
        TimeSeriesResample[TimeSeriesThread[Identity, TimeSeries /@ data]], {0, Max[Length /@ data] - 1}]
    ][[All, 2]]
  {fe, data} = FeatureExtraction[matrix,  "StandardizedVector", {"ExtractorFunction", "ExtractedFeatures"}]
{trainingData, testData} = TakeDrop[data, Round[0.8*Length[data]]];
Dimensions /@ {trainingData, testData}

enter image description here

Next, we define a function that can generate training data for the network from the data matrix we just made:

genFun[data_, { seqLengthIn_, seqLengthout_, offset_}] := With[{
    dim = Dimensions[data][[2]],
    max = Max[1, Length[data] - seqLengthout - offset - seqLengthIn - 2]
    With[{ri = RandomInteger[{1, max}, #BatchSize]},
      "Input" -> Function[data[[# ;; # + seqLengthIn - 1]]] /@ ri,
      "Target" -> Join[
        ConstantArray[{ConstantArray[0., dim]}, #BatchSize],
        Function[With[{j = # + offset + seqLengthIn - 1},   data[[j ;; j + seqLengthout - 1]]]] /@ ri,

Each time the function is called, it generates a batch of sequences that will be used as training input and target sequences that the network should try to learn and reproduce, but the targets are prepended with a blank start-of-sequence vector:

Dimensions /@ genFun[data, {10, 6, 1}][<|"BatchSize" -> 3|>]
Out[72]= <|"Input" -> {3, 10, 5}, "Target" -> {3, 7, 5}|>

Next we put together our neural network:

Clear[encoderNetAttend, decoderNet, lossLayer, trainingNetAttention];
encoderNetAttend[n_, dropout_] := NetGraph[<|
    "lstm" -> LongShortTermMemoryLayer[n, "Dropout" -> dropout], 
    "last" -> SequenceLastLayer[],
    "post" -> 
     NetChain[{LinearLayer[n], ElementwiseLayer["SELU"], 
       DropoutLayer[], LinearLayer[]}]
   {"lstm" -> "last" -> "post" -> NetPort["StateInit"],  "lstm" -> NetPort["States"]}
decoderNet[n_, dropout_] := NetGraph[
    "part1" -> PartLayer[{All, 1}], "part2" -> PartLayer[{All, 2}],
    "lstm" -> LongShortTermMemoryLayer[n, "Dropout" -> dropout],
    "shape1" -> ReshapeLayer[{n, 1}],  "shape2" -> ReshapeLayer[{n, 1}], "catState" -> CatenateLayer[2]
    NetPort["Input"] -> NetPort["lstm", "Input"], 
    NetPort["lstm", "Output"] -> NetPort["Output"],
    NetPort["StateIn"] -> "part1" -> NetPort["lstm", "State"],
    NetPort["StateIn"] -> "part2" -> NetPort["lstm", "CellState"],
    NetPort["lstm", "State"] -> "shape1",  NetPort["lstm", "CellState"] -> "shape2",
    {"shape1", "shape2"} -> "catState" -> NetPort["StateOut"]
   "StateIn" -> {n, 2}
lossLayer["LogPrecision"] =  ThreadingLayer@Function[{yObserved, yPredicted, logPrecision},
    (yPredicted - yObserved)^2*Exp[logPrecision] - logPrecision
regressionNet[n_, dimIn_, dropout : _?NumericQ : 0.5] := 
     "encoder" -> encoderNetAttend[n, dropout], 
     "decoder" -> decoderNet[n, dropout],
     "attention" -> SequenceAttentionLayer["Bilinear"], 
     "catenate" -> CatenateLayer[2],
     "regression" -> NetMapOperator[LinearLayer[{2, dimIn}]],
     "part1" -> PartLayer[{All, 1}], "part2" -> PartLayer[{All, 2}]
     NetPort["Input"] -> "encoder",
     NetPort["encoder", "StateInit"] -> NetPort["decoder", "StateIn"],
      NetPort["encoder", "States"] -> NetPort["attention", "Input"],
     NetPort["Prediction"] -> NetPort["decoder", "Input"],
     NetPort[{"decoder", "Output"}] -> 
      NetPort["attention", "Query"], {"decoder", "attention"} -> 
      "catenate" -> "regression",
     "regression" -> "part1" -> NetPort["mu"], 
     "regression" -> "part2" -> NetPort["logTau"]
    "Input" -> {"Varying", dimIn},
    "Prediction" -> {"Varying", dimIn}
trainingNetAttention[n_, dimIn_, dropout : _?NumericQ : 0.5] := NetInitialize@NetGraph[
     "regression" -> regressionNet[n, dimIn, dropout],
     "loss" -> lossLayer["LogPrecision"],
     "total" -> SummationLayer[],  "most" -> SequenceMostLayer[],  "rest" -> SequenceRestLayer[]
     NetPort["Input"] -> NetPort["regression", "Input"],
     NetPort["Target"] ->  "most" -> NetPort["regression", "Prediction"],
     NetPort["Target"] -> "rest",
     {"rest", NetPort["regression", "mu"],  NetPort["regression", "logTau"]} -> "loss" -> "total" -> NetPort["Loss"]
    "Input" -> {"Varying", dimIn},
    "Target" -> {"Varying", dimIn}
nHidden = 250;
pDrop = 0.25;
NetInformation[encoderNetAttend[nHidden, pDrop], "SummaryGraphic"]
NetInformation[decoderNet[nHidden, pDrop], "SummaryGraphic"]
NetInformation[regressionNet[nHidden, 2, 0.25], "SummaryGraphic"]
NetInformation[ trainingNetAttention[nHidden, 2, 0.25], "SummaryGraphic"]

enter image description here

The idea here is that the encoder produces a state vector that the decoder uses to start making predictions. The decoder, in turn, accepts the networks' own predictions to iteratively crank out the next one. During the training, though, we don't have the actual predictions from the network, so instead we feed in the sequence of actual target values and offset the target sequence at the loss layer by one with a SequenceRest layer. This is called teacher forcing.

Time to train the network:

seqConfig = {15, 7, 1};
trainedTrainObject = NetTrain[
   trainingNetAttention[nHidden, Length[indices], pDrop],
   genFun[trainingData, seqConfig],
   TargetDevice -> "GPU",  LossFunction -> "Loss",
   TimeGoal -> 30*60, BatchSize -> 200,
   ValidationSet -> genFun[testData, seqConfig][<|"BatchSize" -> 200|>]
trainedNet = trainedTrainObject["TrainedNet"];

enter image description here

To make predictions, the network needs to eat its own results iteratively, which we can do with a NestList

makePredictorFunction[trainedNet_, evaluationMode : ("Test" | "Train") : "Test", device : ("CPU" | "GPU") : "CPU"] := With[{
   dim = Last@NetExtract[trainedNet, {"Input"}],
   encoder = NetExtract[trainedNet, {"regression", "encoder"}],
   regressionNet = 
    NetDelete[NetExtract[trainedNet, "regression"], "encoder"]
  Function[{input, n},
     encodedInput = encoder[input, NetEvaluationMode -> evaluationMode, TargetDevice -> device],
     vectorInputQ = Depth[input] === 4,
     length = Length[input],
    reshape = If[vectorInputQ, ConstantArray[#, Length[input]] &, Identity];
         <|"Input" -> encodedInput["States"], "Prediction" -> #mu, "StateIn" -> #StateOut|>,
          NetEvaluationMode -> evaluationMode, 
         TargetDevice -> device] &,
        "mu" -> reshape@{ConstantArray[0, dim]},
        "StateOut" -> encodedInput["StateInit"]
       ][[2 ;;, {"mu", "logTau"}, If[vectorInputQ, All, Unevaluated[Sequence[]]], 1]]

For example, we can predict 5 steps ahead based on 30 points of historical data:

predictor = makePredictorFunction[trainedNet, "Train", "CPU"];
Dimensions /@ predictor[data[[;; 30]], 7]

Out[299]= <|"mu" -> {7, 5}, "logTau" -> {7, 5}|>

Here's a manipulate to visualize the predictions against the actual data. We sample the network several times to get a sense of the uncertainty from the training weights and combine that uncertainty with the variance predicted by the network:

     predictions = Dot[
       {{1., 1.}, {1., 0.}, {1., -1.}},
          nstdev*Sqrt[Variance[#mu] + Mean[Exp[-#logTau]]]} &@Map[
          predictor[Table[dat[[Max[1, n - t] ;; UpTo[n]]], s], l]
          ][[{"mu", "logTau"}, All, All, index]]
     trueData = ListPlot[
       MapIndexed[{First[#2] + Max[1, n - t] - 1, #1} &,
        Flatten[dat[[Max[1, n - t] ;; UpTo[n + l], index]]]
       Joined -> True,
       PlotStyle -> Black
           {First[#2] + Max[1, n - t] - 1, #1} &,
             dat[[Max[1, n - t] ;; UpTo[n], index]],
           ] & /@ predictions,
        {StringForm["\[Mu] + `1` \[Sigma]", Dynamic[nstdev]], 
         "\[Mu]", StringForm["\[Mu] - `1` \[Sigma]", Dynamic[nstdev]]}
      Joined -> True,
      PlotStyle -> (Directive[#, Dashed] & /@ {Blue, Red, Blue}),
      ImageSize -> 500,
      Filling -> {1 -> {2}, 3 -> {2}}
     ImageSize -> Large,
     PlotRange -> All
  {{index, 1, "Index"}, Thread[Range[Length[indices]] -> indices]},
  {{n, Round[Length[dat]/10], "Position"}, 1, Length[dat], 1},
  {{t, 20, "History length"}, 1, 30, 1},
  {{l, 30, "Extrapolation"}, 1, 50, 1},
  {{s, 5, "Number of samples"}, 1, 20, 1},
  {{nstdev, 2, "Standard deviations"}, 0, 4, 0.1},
  Paneled -> False,
  ContinuousAction -> True,
  TrackedSymbols :> {index, n, t, l, s, nstdev},
  SynchronousInitialization -> False
 {{dat, trainingData, "Dataset"}, {trainingData -> "Training data", 
   testData -> "Test data"}},
 TrackedSymbols :> {dat},
 SynchronousInitialization -> False

enter image description here

The results on the validation set do not look great, but the comparison between the first 4 years of the data and the last one may not be fair. Or it may simply be necessary to involve many more indices before it becomes feasible to do extrapolations of this type. At any rate, I hope that this code is useful to other users here! I will probably tweak bits of code here and there over the next few days and maybe add more commentary, so be sure to check back occasionally if you're interested.

Update Some of the code needed to be updated slightly to work in the recently released Mathematica V12. The SequenceAttentionLayer is being phased out, but still works (see the "Properties & Relations" documentation for AttentionLayer for information about how to replace SequenceAttentionLayer with AttentionLayer). The financial data also needed a few more processing steps in V12 before it can be used. Please see the attached notebook for the V12 version of the code.

Update 2 (10 June 2019) I attached another notebook with code for a neural network that does not use teacher forcing but instead feeds its own predictions back into itself during training. This makes sure that the network is trained in exactly the same way as it is used afterwards. This also removes the need to use a NestList afterwards to generate the predictions. Training the network becomes trickier, though, and you need to experiment more with learning rate multipliers and gradient clipping to get convergence.

Update 3 (13 July 2020) I just updated the notebooks to make sure that all the code works correctly in Mathematica V12.1

To make this code work, you may need to update your neural network paclet first by evaluating:

20 Replies

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming, and consider contributing your work to the The Notebook Archive!

Posted 2 years ago

Sjoerd, could you update it for Mathematica 12 and provide a link to a notebook?

Thanks in advance.

Sure, I'd be happy to :). I can't guarantee that I can do that right away, but be sure to check back in a few days.

Posted 1 month ago

Dear Sjoerd,

May I have your email address? I have a question about the LSTM model.

Thank you so much.

Hi Alex. Please check my profile for my LinkedIn account. You can send me a personal message there and maybe set up a Zoom call. I prefer not to post my e-mail address publicly; I hope you understand. I'm also a bit busy at the moment, so please forgive me if I don't respond too quickly.

Posted 2 years ago

Thank you, looking forward to this.

I'm actually struggling with V12 myself, so apologies if it takes a bit longer than expected ;).

@jerome: I just updated the main post with a notebook with the V12 version of the code.

Posted 2 years ago

i am also stuck at the same thing looking forward for solution.

What are you stuck on? Have you checked the notebook I attached to the original post this morning? That code should work in V12.

Posted 2 years ago
Posted 2 years ago

@Sjoerd Smit How do you think the model can be improved?

One method that I want to try and implement once my V12 installation works properly again, is to change the way the network is trained. Right now, the output sequence is fed into the Prediction port with a delay to simulate the predictions the network will make. However, with the changes to NetFoldOperator (in particular the ability to feed constant vectors into these layers), it should be possible to train the network on its own predictions, making the training process more similar to the way it is used afterwards. That's what I want to try out next.

The most recent paclet update to the neural networks fixed a bug in NetFoldOperator that stopped me before from being able to use it for this application. I updated the main post with a new notebook, so please take a look if you're interested.

Excellent work! Now it works with my 12-version. But I had to update my Ndvidia driver. One more question. I have own time series Financial data and tried to work with that, but I did not manage to do it. Is there any suggestion how to use them? Thanks!

Hi Christos. Glad to hear it's working. As for using other financial data: without further details I can only recommend that you first try to mold your data into a matrix with the same shape as trainingData in my example. That is: a matrix with the time-axis (regularly sampled) as the first dimension and the different indices (or whatever you're trying to predict) in the 2nd dimension. After that, you should be able to generate inputs for the network with the line

trainingExamples = genFun[trainingData, {10, 6, 1}][<|"BatchSize" -> 3|>]

Next, verify that the loss network (i.e., the one that goes into NetTrain) initializes correctly with NetInitialize and returns real-valued losses when you apply it to the trainingExamples you just generated.

Thanks for your quick answer! I will try later and hope it will work (I correct exams right now...)

Posted 1 month ago

Hi Sjoerd I looked at the NetFoldOperator version of your seq2seq regression but couldn't get it to evaluate: decoderNetInternal function fails complaining about CatenateLayer "Type inconsistency in CatenateLayer: \!(\"specified level (2) cannot exceed rank of lowest-rank input (Min[1, Var[0]])\")." Might this be a version related issue? I'm running version 12.2.

Hello Steve. Yes, it seems there was a small change in how types are inferred in the network since V12.2. You should be able to fix the issue by changing:

{"rec", "attention"} ->  "catenate" -> "lin" -> NetPort["Prediction"]


{NetPort["rec", "Output"], "attention"} ->  "catenate" -> "lin" -> NetPort["Prediction"]

I updated the notebook accordingly. I haven't tested the rest of the code in 12.2, so thanks for bringing this to my attention!

Posted 1 month ago

Hi Sjoerd the changes you gave fixed the issue, thank you (although I haven't tried training it yet). I'll try it out with version 11.3 when that's available.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract