Group Abstract

Message Boards

WOLFRAM COMMUNITY

5.6K Views

6 Replies

10 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

NetDecoder[ ] failed to decode NetEncoder["AudioSTFT"] output?

John M.

Posted 3 years ago

Question: Mathematica's `ShortTimeFourier[]` & `InverseShortTimeFourier[]` functions work well, e.g.: audio = Import["C:\\folder\\file.wav"] ShortTimeFourier[audio, 4096, 512, HannWindow] gives a `ShortTimeFourierData[]` object and an audio file/signal is easily reconstructed from it using: InverseShortTimeFourier[%, 4096,512, HannWindow] How can an audio file/signal, encoded by `NetEncoder["AudioSTFT"]`, be equivalently reconstructed by `NetDecoder[]?`. The Mathematica Documentation reads: `NetDecoder[NetEncoder[...]]` will create a decoder based on the parameters of an existing encoder. But when I define a `NetEncoder[]`, e.g.: enc = NetEncoder[{"AudioSTFT", "WindowSize"->4096, "Offset ->2048, SampleRate->16000}] and then try to define a `NetDecoder[]` according to that documentation, e.g.: dec = NetDecoder[enc] or dec = NetDecoder[NetEncoder[{"AudioSTFT", "WindowSize"->4096, "Offset ->2048, SampleRate->16000}] The output is always `$Failed`. How can I write a `NetDecoder[]` that can decode NetEncoder["AudioSTFT"] output in the same way, e.g., `NetDecoder["Image"]` decodes `NetEncoder["Image"]` output, or `InverseShortTimeFourier[]` reconstructs an audio file/signal from the `ShortTimeFourierData[]` objects given by `ShortTimeFourier[]`? Can someone please help me? I've tried many different `NetEncoder["Function"]` / `NetDecoder["Function"]` variations to encode/decode input/output, e.g., STFenc[x_]:= ShortTimeFourier[x,4096,512,HannWindow] ISTFdec[x_]:= InverseShortTimeFourier[x,4096,512,HannWindow] enc = NetEncoder[{"Function", STFenc[#] &}] dec = NetDecoder[{"Function", ISTFdec[#] &}] But whenever/however I try wrapping the `ShortTimeFourier[]` /`InverseShortTimeFourier[]` into `NetEncoder[]`/`NetDecoder[]` using their `"Function"` option, I get errors like: "expecting a ShortTimeFourierData object or a numeric matrix instead of Numeric Array, etc." Again, what I'm asking here is this: how do I reconstruct an audio file/signal using `NetDecoder[]` after it's been encoded using `NetEncoder["AudioSTFT"]` or some variation of `NetEncoder[{"Function"}]` where `"Function"` is `ShortTimeFourier[]`? Please let me know if you have any info, tips, ideas. Thank You.

POSTED BY: John M.

6 Replies

Sort By:

John M.

Posted 3 years ago

Thank you so much for taking the time to help me, Jérôme. All the audio signals in the dataset are 1.024s@16000kHz (i.e., 16384 samples in length): In[]:= test = Import["C:\\folder\\example.wav"]; In[]:= Information[test] Out[]:= The reason my "c" is 2 is because those are the dimensions our given by the encoder: In[]:= Dimensions[enc@test] Out[]:= {256,256,2} The 2 channels are the R & I parts: ...Block[ {data = ShortTimeFourier[#, window, offset, HannWindow]["Data"] }, ArrayPad[ Transpose[ { Re[data], Im[data] }, {3, 1, 2} ], {{pad}, {0}, {0}} ] ], {"Varying", window, 2} }... If I even try to define my discriminator beginning from {256,256,1} using this encoder it throws this error: NetChain::invspenc: NetEncoder[{"Function", \[Ellipsis]}] producing a n2562 array of real numbers, cannot be attached to port Input, which must be a 2562561 array. Any ideas on how I could drop the I part & implement signal reconstruction from magnitude (e.g., Griffin-Lim, etc.) on the decoder end? Passing the R & I parts BOTH through the net isn't the only thing that isn't in the paper. `ArrayPad[]`ing the {128, 256, 2} data to {256, 256, 2} in the encoder: ...ArrayPad[ Transpose[ { Re[data], Im[data] }, {3, 1, 2} ], {{pad}, {0}, {0}} ]... also differs from the paper, & every implementation (e.g., MATLAB, Python, etc.) I've seen (i.e., because they somehow begin with {128,128,1} from 256winx128offset STFTs). I can't figure out how a 256 window & a 128 offset can be cropped to 128x128 & still be reconstructed using the decoder/`InverseShortTimeFourier[]` (or w/o the passing the I parts through the net). Here is the MATLAB example. I don't understand either language well enough to translate this into a Mathematica `NetEncoder[]`, but I was able to underline where, firstly, it insures the training data is 128x128, & secondly it converts the data from {128,128,2} to {128,128,1}: . . . . . . . It's definitely something with the encoder/decoder that's off right now (e.g., either {n,n,2} needs to be {n,n,1}, &or `ArrayPad[]` needs to be changed so it's 128x128 (& with 256win/128off/STFTs) rather than 256 x 256 (i.e., 128x256 padded to 256x256), that, &/or I'm still doing something doing wrong with the `TransposeLayers[]`, but I tried your suggestion: discriminator = NetChain[ { TransposeLayer[{2, 3, 1}] ... }, "Input" -> enc ] . generator = NetChain[ { ... TransposeLayer[{3, 1, 2}] }, "Input" -> 100, "Output" -> dec ] &, unfortunately, it didn't help with the results at all. The net architecture/transpositions should be easy enough, though, if I can get the encoder/decoders working such that data is correctly input & output of the net. Here is the discriminator architecture in some other languages, MATLAB: & Python: & the generator architecture in MATLAB: & Python It's simple enough (indeed, the simplest) to achieve that in Mathemtica: I just need to figure out this encoder/decoder problem. Any tips would be much appreciated. Thank you!

POSTED BY: John M.

Jérôme Louradour

Jérôme Louradour, Wolfram Research

Posted 3 years ago

Sorry, you should try `TransposeLayer[{2, 3, 1}]` in the discriminator instead of `TransposeLayer[{3, 2, 1}]` (which is the same as `TransposeLayer[1 <-> 3]`). And `TransposeLayer[{3, 1, 2}]` in the generator. BTW, when I try to use your EXAMPLE2.nb, I don't understand how it can fit the dimensions. I have this error: NetInitialize[discriminator][Audio[File["ExampleData/car.mp3"]]] During evaluation of In[17]:= NetChain::invindata3: Data supplied to port "Input" could not be encoded; "Function" encoder did not produce an output that was a 2562562 array of real numbers. Out[17]= $Failed because indeed the NetEncoder is not producing arrays of size {256,256,2} (the first dimension varies depending on the length of the signal): Dimensions[enc[Audio[File["ExampleData/car.mp3"]]]] Out[18]= {2693,256,2} Do you use audio signals (`FileNames["*.wav", NotebookDirectory[]]`) that have all a particular length? Also, do you get why your "c" seems to be 2 while it's 1 in the paper?

POSTED BY: Jérôme Louradour

John M.

Posted 3 years ago

For sure!, I wish there was more examples of how to use `NetGANOperator[]` online, & I was excited when it was implemented. I tried changing the `TranspsoseLayers[]` from {3 <-> 1} to {3, 1, 2} & it gave this error: NetChain::valfail: Validation failed for ConvolutionLayer: kernel size 44 cannot exceed input size 1128 plus padding size 22. Then, I changed them to from {3 <-> 1} to {3, 2, 1} & I could evaluate the nets, but I still got bad from the generator results after training. I even tried adjusting my parameters: kern = {4, 4}; chan = 128; α = 0.2; & restructuring the generator & discriminator more closely following the example : discriminator = NetChain[ { TransposeLayer[{3, 2, 1}, "Input" -> {256, 256, 2}], ConvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1], ParametricRampLayer[{}, "Slope" -> \[Alpha]], ConvolutionLayer[chan2, kern, "Stride" -> 2, PaddingSize -> 1], ParametricRampLayer[{}, "Slope" -> \[Alpha]], ConvolutionLayer[chan4, kern, "Stride" -> 2, PaddingSize -> 1], ParametricRampLayer[{}, "Slope" -> \[Alpha]], ConvolutionLayer[chan8, kern, "Stride" -> 2, PaddingSize -> 1], ParametricRampLayer[{}, "Slope" -> \[Alpha]], ConvolutionLayer[chan16, kern, "Stride" -> 2, PaddingSize -> 1], ParametricRampLayer[{}, "Slope" -> \[Alpha]], ConvolutionLayer[chan32, kern, "Stride" -> 2, PaddingSize -> 1], ParametricRampLayer[{}, "Slope" -> \[Alpha]], ReshapeLayer[{4412832, 1}], LinearLayer[{}] }, "Input" -> enc ] . generator = NetChain[ { LinearLayer[{409644 }], ReshapeLayer[{4096, 4, 4}], ElementwiseLayer["ReLU"], DeconvolutionLayer[chan32, kern, "Stride" -> 2, PaddingSize -> 1], ElementwiseLayer["ReLU"], DeconvolutionLayer[chan16, kern, "Stride" -> 2, PaddingSize -> 1], ElementwiseLayer["ReLU"], DeconvolutionLayer[chan8, kern, "Stride" -> 2, PaddingSize -> 1], ElementwiseLayer["ReLU"], DeconvolutionLayer[chan4, kern, "Stride" -> 2, PaddingSize -> 1], ElementwiseLayer["ReLU"], DeconvolutionLayer[chan2, kern, "Stride" -> 2, PaddingSize -> 1], ElementwiseLayer["ReLU"], DeconvolutionLayer[2, kern, "Stride" -> 2, PaddingSize -> 1], ElementwiseLayer[Tanh], TransposeLayer[{3, 2, 1}] }, "Input" -> 100, "Output" -> dec ] After training, though, the generator only generated noise. I'm certain it has something to do with the dimensions {256,256,2} getting somehow switched around in the net, but I don't know where/how. In the MATLAB example, the `TransposeLayer[]` equivalents come at the opposite ends of the generator & discriminator (i.e., BEFORE the `DeconvolutionLayer[]`s in the generator & AFTER the `ConvolutionLayer[]`s in the discriminator). I tried doing building the nets that way, but I get errors & can't evaluate the cells with my `NetChain[]`s until I do it in reverse. The dimensions in the NetChain] box are reverse of the way the dimensions are outlined in [the paper too, e.g., _ _ _ I'm sure it's just a simple transposition issue, any tips would be greatly appreciated, I'd love to get this going in Mathematica but there are obviously some details here I'm missing. I've included an updated EXAMPLE notebook. Thanks. Attachments: EXAMPLE2.nb

For sure!, I wish there was more examples of how to use NetGANOperator[] online, & I was excited when it was implemented.

I tried changing the TranspsoseLayers[] from {3 <-> 1} to {3, 1, 2} & it gave this error:

NetChain::valfail: Validation failed for ConvolutionLayer: kernel size 4*4 cannot exceed input size 1*128 plus padding size 2*2.

Then, I changed them to from {3 <-> 1} to {3, 2, 1} & I could evaluate the nets, but I still got bad from the generator results after training. I even tried adjusting my parameters:

kern = {4, 4};
chan = 128;
α = 0.2;

& restructuring the generator & discriminator more closely following the example :

discriminator =
 NetChain[
  {
   TransposeLayer[{3, 2, 1}, "Input" -> {256, 256, 2}],
   ConvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ReshapeLayer[{4*4*128*32, 1}],
   LinearLayer[{}]
   }, "Input" -> enc

  ]

generator =
NetChain[
{

LinearLayer[{4096*4*4 }],
ReshapeLayer[{4096, 4, 4}],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer[Tanh],
TransposeLayer[{3, 2, 1}]
},
"Input" -> 100,
"Output" -> dec
]

After training, though, the generator only generated noise. I'm certain it has something to do with the dimensions {256,256,2} getting somehow switched around in the net, but I don't know where/how. In the MATLAB example, the TransposeLayer[] equivalents come at the opposite ends of the generator & discriminator (i.e., BEFORE the DeconvolutionLayer[]s in the generator & AFTER the ConvolutionLayer[]s in the discriminator). I tried doing building the nets that way, but I get errors & can't evaluate the cells with my NetChain[]s until I do it in reverse. The dimensions in the NetChain] box are reverse of the way the dimensions are outlined in [the paper too, e.g.,

enter image description here _ _ _

I'm sure it's just a simple transposition issue, any tips would be greatly appreciated, I'd love to get this going in Mathematica but there are obviously some details here I'm missing.

I've included an updated EXAMPLE notebook. Thanks.

POSTED BY: John M.

Jérôme Louradour

Jérôme Louradour, Wolfram Research

Posted 3 years ago

POSTED BY: Jérôme Louradour

John M.

Posted 3 years ago

Attachments: EXAMPLE.nb

POSTED BY: John M.

Jérôme Louradour

Jérôme Louradour, Wolfram Research

Posted 3 years ago

Indeed the `NetDecoder[NetEncoder[...]]` conversion is not implemented for audio features, because these features extraction are in general not invertible. I don't know how to do with `NetEncoder["AudioSTFT"]` but the way to do it with `NetEncoder/NetDecoder["Function"]` is: enc = NetEncoder[{"Function", Function @ Block[ {data = ShortTimeFourier[#, 4096, 512, HannWindow]["Data"]}, Transpose[{Re[data], Im[data]}, {3,1,2}] ], {"Varying", 4096, 2}}]; dec = NetDecoder[{"Function", Function @ Block[ {re = Normal[#[[All,All,1]]], im = Normal[#[[All,All,2]]]}, Audio @ InverseShortTimeFourier[re + I * im, 4096, 512, HannWindow] ]}]; Which can be tested with dec @ enc @ Audio[File["ExampleData/car.mp3"]] (that gives back almost the original audio signal, with a bit of distortion). Note that the real and imaginary parts of the FT needs to be separated (neural networks need real numbers only). Concerning the error you had `"expecting a ShortTimeFourierData object or a numeric matrix instead of Numeric Array"`, it's just the subtlety that the input of the function `f` in `NetDecoder[{"Function", f}]` is a `NumericArray`. Which can be converted to an array using `Normal`.

POSTED BY: Jérôme Louradour

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback