Message Boards Message Boards

NetDecoder[ ] failed to decode NetEncoder["AudioSTFT"] output?

Posted 2 years ago

Question:

Mathematica's ShortTimeFourier[] & InverseShortTimeFourier[] functions work well, e.g.:

audio = Import["C:\\folder\\file.wav"]
ShortTimeFourier[audio, 4096, 512, HannWindow]

gives a ShortTimeFourierData[] object and an audio file/signal is easily reconstructed from it using:

InverseShortTimeFourier[%, 4096,512, HannWindow]

How can an audio file/signal, encoded by NetEncoder["AudioSTFT"], be equivalently reconstructed by NetDecoder[]?.

The Mathematica Documentation reads:

NetDecoder[NetEncoder[...]] will create a decoder based on the parameters of an existing encoder.

But when I define a NetEncoder[], e.g.:

enc = NetEncoder[{"AudioSTFT", "WindowSize"->4096, "Offset ->2048, SampleRate->16000}]

and then try to define a NetDecoder[] according to that documentation, e.g.:

dec = NetDecoder[enc]

or

dec = NetDecoder[NetEncoder[{"AudioSTFT", "WindowSize"->4096, "Offset ->2048, SampleRate->16000}]

The output is always $Failed.

How can I write a NetDecoder[] that can decode NetEncoder["AudioSTFT"] output in the same way, e.g., NetDecoder["Image"] decodes NetEncoder["Image"] output, or InverseShortTimeFourier[] reconstructs an audio file/signal from the ShortTimeFourierData[] objects given by ShortTimeFourier[]?

Can someone please help me?

I've tried many different NetEncoder["Function"] / NetDecoder["Function"] variations to encode/decode input/output, e.g.,

STFenc[x_]:= ShortTimeFourier[x,4096,512,HannWindow]
ISTFdec[x_]:= InverseShortTimeFourier[x,4096,512,HannWindow]
enc = NetEncoder[{"Function", STFenc[#] &}]
dec = NetDecoder[{"Function", ISTFdec[#] &}]

But whenever/however I try wrapping the ShortTimeFourier[] /InverseShortTimeFourier[] into NetEncoder[]/NetDecoder[] using their "Function" option, I get errors like: "expecting a ShortTimeFourierData object or a numeric matrix instead of Numeric Array, etc."

Again, what I'm asking here is this: how do I reconstruct an audio file/signal using NetDecoder[] after it's been encoded using NetEncoder["AudioSTFT"] or some variation of NetEncoder[{"Function"}] where "Function" is ShortTimeFourier[]?

Please let me know if you have any info, tips, ideas.
Thank You.

POSTED BY: John M.
6 Replies
Posted 2 years ago

Thank you so much for taking the time to help me, Jérôme.

All the audio signals in the dataset are 1.024s@16000kHz (i.e., 16384 samples in length):

In[]:= test = Import["C:\\folder\\example.wav"];
In[]:= Information[test]

Out[]:=enter image description here

The reason my "c" is 2 is because those are the dimensions our given by the encoder:

In[]:= Dimensions[enc@test]
Out[]:= {256,256,2}

The 2 channels are the R & I parts:

     ...Block[
      {data = ShortTimeFourier[#, window, offset, HannWindow]["Data"]
       },
      ArrayPad[
       Transpose[
        {
         Re[data], Im[data]
         }, {3, 1, 2}
        ], {{pad}, {0}, {0}}
       ]
      ], {"Varying", window, 2}
    }...

If I even try to define my discriminator beginning from {256,256,1} using this encoder it throws this error:

NetChain::invspenc: NetEncoder[{"Function", \[Ellipsis]}] producing a n*256*2 array of real numbers, cannot be attached to port Input, which must be a 256*256*1 array.

Any ideas on how I could drop the I part & implement signal reconstruction from magnitude (e.g., Griffin-Lim, etc.) on the decoder end?

Passing the R & I parts BOTH through the net isn't the only thing that isn't in the paper. ArrayPad[]ing the {128, 256, 2} data to {256, 256, 2} in the encoder:

...ArrayPad[
 Transpose[
  {
   Re[data], Im[data]
   }, {3, 1, 2}
  ], {{pad}, {0}, {0}}
 ]...

also differs from the paper, & every implementation (e.g., MATLAB, Python, etc.) I've seen (i.e., because they somehow begin with {128,128,1} from 256winx128offset STFTs).

I can't figure out how a 256 window & a 128 offset can be cropped to 128x128 & still be reconstructed using the decoder/InverseShortTimeFourier[] (or w/o the passing the I parts through the net).

Here is the MATLAB example. I don't understand either language well enough to translate this into a Mathematica NetEncoder[], but I was able to underline where, firstly, it insures the training data is 128x128, & secondly it converts the data from {128,128,2} to {128,128,1}:

. . . . . . . enter image description here

It's definitely something with the encoder/decoder that's off right now (e.g., either {n,n,2} needs to be {n,n,1}, &or ArrayPad[] needs to be changed so it's 128x128 (& with 256win/128off/STFTs) rather than 256 x 256 (i.e., 128x256 padded to 256x256), that, &/or I'm still doing something doing wrong with the TransposeLayers[], but I tried your suggestion:

discriminator =
NetChain[
{
TransposeLayer[{2, 3, 1}]
...
},
"Input" -> enc
]

.

generator =
 NetChain[
  {
   ...
TransposeLayer[{3, 1, 2}]
  },
 "Input" -> 100,
 "Output" -> dec
 ]

&, unfortunately, it didn't help with the results at all.

The net architecture/transpositions should be easy enough, though, if I can get the encoder/decoders working such that data is correctly input & output of the net.

Here is the discriminator architecture in some other languages, MATLAB:

enter image description here

& Python:

enter image description here

& the generator architecture in MATLAB:

enter image description here

& Python

enter image description here

It's simple enough (indeed, the simplest) to achieve that in Mathemtica:

enter image description here enter image description here

I just need to figure out this encoder/decoder problem.

Any tips would be much appreciated. Thank you!

POSTED BY: John M.

Sorry, you should try TransposeLayer[{2, 3, 1}] in the discriminator instead of TransposeLayer[{3, 2, 1}] (which is the same as TransposeLayer[1 <-> 3]).

And TransposeLayer[{3, 1, 2}] in the generator.

BTW, when I try to use your EXAMPLE2.nb, I don't understand how it can fit the dimensions. I have this error:

NetInitialize[discriminator][Audio[File["ExampleData/car.mp3"]]]
During evaluation of In[17]:= NetChain::invindata3: Data supplied to port "Input" could not be encoded; "Function" encoder did not produce an output that was a 256*256*2 array of real numbers.
Out[17]= $Failed

because indeed the NetEncoder is not producing arrays of size {256,256,2} (the first dimension varies depending on the length of the signal):

Dimensions[enc[Audio[File["ExampleData/car.mp3"]]]]
Out[18]= {2693,256,2}

Do you use audio signals (FileNames["*.wav", NotebookDirectory[]]) that have all a particular length?

Also, do you get why your "c" seems to be 2 while it's 1 in the paper?

Happy to see GANs with audio in the Wolfram Language :)

Quick guess: Can you try TransposeLayer[{3, 1, 2}] instead of TransposeLayer[{1 <-> 3}] and TransposeLayer[{3 <-> 1}]

Posted 2 years ago

For sure!, I wish there was more examples of how to use NetGANOperator[] online, & I was excited when it was implemented.

I tried changing the TranspsoseLayers[] from {3 <-> 1} to {3, 1, 2} & it gave this error:

NetChain::valfail: Validation failed for ConvolutionLayer: kernel size 4*4 cannot exceed input size 1*128 plus padding size 2*2.

Then, I changed them to from {3 <-> 1} to {3, 2, 1} & I could evaluate the nets, but I still got bad from the generator results after training. I even tried adjusting my parameters:

kern = {4, 4};
chan = 128;
α = 0.2;

& restructuring the generator & discriminator more closely following the example :

discriminator =
 NetChain[
  {
   TransposeLayer[{3, 2, 1}, "Input" -> {256, 256, 2}],
   ConvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ReshapeLayer[{4*4*128*32, 1}],
   LinearLayer[{}]
   }, "Input" -> enc

  ]

.

generator =
NetChain[
{

LinearLayer[{4096*4*4 }],
ReshapeLayer[{4096, 4, 4}],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer[Tanh],
TransposeLayer[{3, 2, 1}]
},
"Input" -> 100,
"Output" -> dec
]

After training, though, the generator only generated noise. I'm certain it has something to do with the dimensions {256,256,2} getting somehow switched around in the net, but I don't know where/how. In the MATLAB example, the TransposeLayer[] equivalents come at the opposite ends of the generator & discriminator (i.e., BEFORE the DeconvolutionLayer[]s in the generator & AFTER the ConvolutionLayer[]s in the discriminator). I tried doing building the nets that way, but I get errors & can't evaluate the cells with my NetChain[]s until I do it in reverse. The dimensions in the NetChain] box are reverse of the way the dimensions are outlined in [the paper too, e.g.,

enter image description here _ _ _enter image description here

I'm sure it's just a simple transposition issue, any tips would be greatly appreciated, I'd love to get this going in Mathematica but there are obviously some details here I'm missing.

I've included an updated EXAMPLE notebook. Thanks.

Attachments:
POSTED BY: John M.

Indeed the NetDecoder[NetEncoder[...]] conversion is not implemented for audio features, because these features extraction are in general not invertible.

I don't know how to do with NetEncoder["AudioSTFT"] but the way to do it with NetEncoder/NetDecoder["Function"] is:

enc = NetEncoder[{"Function",
	Function @ Block[
		{data = ShortTimeFourier[#, 4096, 512, HannWindow]["Data"]},
		Transpose[{Re[data], Im[data]}, {3,1,2}]
	],
	{"Varying", 4096, 2}}];
dec = NetDecoder[{"Function",
	Function @ Block[
		{re = Normal[#[[All,All,1]]], im = Normal[#[[All,All,2]]]},
		Audio @ InverseShortTimeFourier[re + I * im, 4096, 512, HannWindow]
	]}];

Which can be tested with

dec @ enc @ Audio[File["ExampleData/car.mp3"]]

(that gives back almost the original audio signal, with a bit of distortion).

Note that the real and imaginary parts of the FT needs to be separated (neural networks need real numbers only).

Concerning the error you had "expecting a ShortTimeFourierData object or a numeric matrix instead of Numeric Array", it's just the subtlety that the input of the function f in NetDecoder[{"Function", f}] is a NumericArray. Which can be converted to an array using Normal.

Posted 2 years ago

Thank you much, Jérôme, the code you provide above works perfectly.

Unfortunately, however, I'm still having problems with NetTrain[] using these encoders/decoders. Maybe you, or someone at Wolfram Research, could help me out?

I'm attempting adversarial audio synthesis (as described this paper & demonstrated in this MATLAB example) using Mathematica.

I define my STFT parameters:

window = 256;
offset = 128;
pad = (window - offset)/2;
samplerate = 16000;

I define my NetEncoder[] using the code you provide above. Notice, however, I also include/use ArrayPad[] so the output is square (i.e., {256, 256, 2} instead of {128, 256, 2} outputs):

enc = NetEncoder[
   {"Function",
    Function@
     Block[
      {data = ShortTimeFourier[#, window, offset, HannWindow]["Data"]
       },
      ArrayPad[
       Transpose[
        {
         Re[data], Im[data]
         }, {3, 1, 2}
        ], {{pad}, {0}, {0}}
       ]
      ], {"Varying", window, 2}
    }
   ];

I define my NetDencoder[]. Notice, again, I also include/use ArrayPad[] so the encoder padding is removed (i.e., {128, 256, 2} instead of {256, 256, 2} outputs for the InverseShortTimeFourier[]):

dec = NetDecoder[
   {"Function",
    Function@
     Block[
      {re = ArrayPad[Normal[#[[All, All, 1]]], {{-pad}, {0}}], 
       im = ArrayPad[Normal[#[[All, All, 2]]], {{-pad}, {0}}]
       },
      Audio[
       InverseShortTimeFourier[
        re + I*im, window, offset, HannWindow
        ], SampleRate -> samplerate
       ]
      ]
    }
   ];

The test you provide above still works perfectly even with my ArrayPad[] changes:

dec @ enc @ Audio[File["ExampleData/car.mp3"]]

I start running into problems, however, when I attempt to train a net with these encoders/decoders, so, e.g., I define my net parameters:

kern = {4, 4};
chan = 64;
α = 0.2;

I define my discriminator (Table 5 in [1]), adding an additional ConvolutionLayer[] for 256x256:

discriminator =
 NetChain[
  {
   TransposeLayer[{3 <-> 1}, "Input" -> {256, 256, 2}],
   ConvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   LinearLayer[{}]
   },
  "Input" -> enc
  ]

out[]= ________enter image description here

I define my generator (Table 4 in [1]), adding an additional DeconvolutionLayer[] for 256x256:

generator =
NetChain[
{
LinearLayer[chan*(window*2)],
ReshapeLayer[{2048, 4, 4}],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer[Tanh],
TransposeLayer[{1 <-> 3}]
},
"Input" -> 100,
"Output" -> dec
]

out[]= _______enter image description here

I load my training data (a dataset of wav files, each 16384 samples in length [i.e., 1.024s@16000KHz]):

wav = FileNames["*.wav", NotebookDirectory[]];
trainingData = Import[#] & /@ wav;

I define my "latent" input for training (following the documentation for NetGANOperator[]):

RandomLatent[batchSize_] := 
  Map[NumericArray[#, "Real32"] &, 
   RandomVariate[NormalDistribution[], {batchSize, 100}]];
datagen = 
  Function[<|"Sample" -> RandomSample[trainingData, #BatchSize], 
    "Latent" -> RandomLatent[#BatchSize]|>];

I train the GAN:

gan = NetGANOperator[{generator, discriminator}];

trained = NetTrain[
  gan,
  {
   datagen,
   "RoundLength" -> Length[trainingData]
   },
  TrainingUpdateSchedule -> {"Discriminator", "Generator"},
  Method -> {"ADAM", "Beta1" -> 0.5, "LearningRate" -> 0.0002},
  BatchSize -> 64,
  MaxTrainingRounds -> 100,
  TargetDevice -> "GPU"]

However, when I generate new samples from the trained generator.:

trainedgen = NetExtract[trained, "Generator"];
trainedgen[RandomLatent[1]]

it's clear that something is wrong with my net architecture or my decoders/encoders or something because the generated samples don't resemble:

dec @ enc @ Audio[File["ExampleData/car.mp3"]]

they're either silent, or just noise, or etc. I can't figure it out.

I must be doing something wrong because I've used Mathematica's NetTrain[] on image datasets with similar architectures & I've never run into this problem (i.e., the generated samples always resemble the input).

Could it be I've set some parameter wrong?

Could it be the TransposeLayer[]s in my generator & discriminator are somehow making the data unusable?

Could it be the function (RandomLatent[]) I'm using for "latent" input for the generator is suitable only for image datasets & not for audio/STFT datasets?

Any help you, Jérôme, or anyone at Wolfram Research, could give me would be greatly appreciated. I'd really like to figure out how to NetTrain[] GANs with these encoders/decoders in Mathematica.

I've included an EXAMPLE notebook. Thank You.

Attachments:
POSTED BY: John M.
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract