Group Abstract Group Abstract

Message Boards Message Boards

NetDecoder[ ] failed to decode NetEncoder["AudioSTFT"] output?

Posted 3 years ago

Question:

Mathematica's ShortTimeFourier[] & InverseShortTimeFourier[] functions work well, e.g.:

audio = Import["C:\\folder\\file.wav"]
ShortTimeFourier[audio, 4096, 512, HannWindow]

gives a ShortTimeFourierData[] object and an audio file/signal is easily reconstructed from it using:

InverseShortTimeFourier[%, 4096,512, HannWindow]

How can an audio file/signal, encoded by NetEncoder["AudioSTFT"], be equivalently reconstructed by NetDecoder[]?.

The Mathematica Documentation reads:

NetDecoder[NetEncoder[...]] will create a decoder based on the parameters of an existing encoder.

But when I define a NetEncoder[], e.g.:

enc = NetEncoder[{"AudioSTFT", "WindowSize"->4096, "Offset ->2048, SampleRate->16000}]

and then try to define a NetDecoder[] according to that documentation, e.g.:

dec = NetDecoder[enc]

or

dec = NetDecoder[NetEncoder[{"AudioSTFT", "WindowSize"->4096, "Offset ->2048, SampleRate->16000}]

The output is always $Failed.

How can I write a NetDecoder[] that can decode NetEncoder["AudioSTFT"] output in the same way, e.g., NetDecoder["Image"] decodes NetEncoder["Image"] output, or InverseShortTimeFourier[] reconstructs an audio file/signal from the ShortTimeFourierData[] objects given by ShortTimeFourier[]?

Can someone please help me?

I've tried many different NetEncoder["Function"] / NetDecoder["Function"] variations to encode/decode input/output, e.g.,

STFenc[x_]:= ShortTimeFourier[x,4096,512,HannWindow]
ISTFdec[x_]:= InverseShortTimeFourier[x,4096,512,HannWindow]
enc = NetEncoder[{"Function", STFenc[#] &}]
dec = NetDecoder[{"Function", ISTFdec[#] &}]

But whenever/however I try wrapping the ShortTimeFourier[] /InverseShortTimeFourier[] into NetEncoder[]/NetDecoder[] using their "Function" option, I get errors like: "expecting a ShortTimeFourierData object or a numeric matrix instead of Numeric Array, etc."

Again, what I'm asking here is this: how do I reconstruct an audio file/signal using NetDecoder[] after it's been encoded using NetEncoder["AudioSTFT"] or some variation of NetEncoder[{"Function"}] where "Function" is ShortTimeFourier[]?

Please let me know if you have any info, tips, ideas.
Thank You.

POSTED BY: John M.
6 Replies
Posted 3 years ago

Thank you so much for taking the time to help me, Jérôme.

All the audio signals in the dataset are 1.024s@16000kHz (i.e., 16384 samples in length):

In[]:= test = Import["C:\\folder\\example.wav"];
In[]:= Information[test]

Out[]:=enter image description here

The reason my "c" is 2 is because those are the dimensions our given by the encoder:

In[]:= Dimensions[enc@test]
Out[]:= {256,256,2}

The 2 channels are the R & I parts:

     ...Block[
      {data = ShortTimeFourier[#, window, offset, HannWindow]["Data"]
       },
      ArrayPad[
       Transpose[
        {
         Re[data], Im[data]
         }, {3, 1, 2}
        ], {{pad}, {0}, {0}}
       ]
      ], {"Varying", window, 2}
    }...

If I even try to define my discriminator beginning from {256,256,1} using this encoder it throws this error:

NetChain::invspenc: NetEncoder[{"Function", \[Ellipsis]}] producing a n*256*2 array of real numbers, cannot be attached to port Input, which must be a 256*256*1 array.

Any ideas on how I could drop the I part & implement signal reconstruction from magnitude (e.g., Griffin-Lim, etc.) on the decoder end?

Passing the R & I parts BOTH through the net isn't the only thing that isn't in the paper. ArrayPad[]ing the {128, 256, 2} data to {256, 256, 2} in the encoder:

...ArrayPad[
 Transpose[
  {
   Re[data], Im[data]
   }, {3, 1, 2}
  ], {{pad}, {0}, {0}}
 ]...

also differs from the paper, & every implementation (e.g., MATLAB, Python, etc.) I've seen (i.e., because they somehow begin with {128,128,1} from 256winx128offset STFTs).

I can't figure out how a 256 window & a 128 offset can be cropped to 128x128 & still be reconstructed using the decoder/InverseShortTimeFourier[] (or w/o the passing the I parts through the net).

Here is the MATLAB example. I don't understand either language well enough to translate this into a Mathematica NetEncoder[], but I was able to underline where, firstly, it insures the training data is 128x128, & secondly it converts the data from {128,128,2} to {128,128,1}:

. . . . . . . enter image description here

It's definitely something with the encoder/decoder that's off right now (e.g., either {n,n,2} needs to be {n,n,1}, &or ArrayPad[] needs to be changed so it's 128x128 (& with 256win/128off/STFTs) rather than 256 x 256 (i.e., 128x256 padded to 256x256), that, &/or I'm still doing something doing wrong with the TransposeLayers[], but I tried your suggestion:

discriminator =
NetChain[
{
TransposeLayer[{2, 3, 1}]
...
},
"Input" -> enc
]

.

generator =
 NetChain[
  {
   ...
TransposeLayer[{3, 1, 2}]
  },
 "Input" -> 100,
 "Output" -> dec
 ]

&, unfortunately, it didn't help with the results at all.

The net architecture/transpositions should be easy enough, though, if I can get the encoder/decoders working such that data is correctly input & output of the net.

Here is the discriminator architecture in some other languages, MATLAB:

enter image description here

& Python:

enter image description here

& the generator architecture in MATLAB:

enter image description here

& Python

enter image description here

It's simple enough (indeed, the simplest) to achieve that in Mathemtica:

enter image description here enter image description here

I just need to figure out this encoder/decoder problem.

Any tips would be much appreciated. Thank you!

POSTED BY: John M.

Sorry, you should try TransposeLayer[{2, 3, 1}] in the discriminator instead of TransposeLayer[{3, 2, 1}] (which is the same as TransposeLayer[1 <-> 3]).

And TransposeLayer[{3, 1, 2}] in the generator.

BTW, when I try to use your EXAMPLE2.nb, I don't understand how it can fit the dimensions. I have this error:

NetInitialize[discriminator][Audio[File["ExampleData/car.mp3"]]]
During evaluation of In[17]:= NetChain::invindata3: Data supplied to port "Input" could not be encoded; "Function" encoder did not produce an output that was a 256*256*2 array of real numbers.
Out[17]= $Failed

because indeed the NetEncoder is not producing arrays of size {256,256,2} (the first dimension varies depending on the length of the signal):

Dimensions[enc[Audio[File["ExampleData/car.mp3"]]]]
Out[18]= {2693,256,2}

Do you use audio signals (FileNames["*.wav", NotebookDirectory[]]) that have all a particular length?

Also, do you get why your "c" seems to be 2 while it's 1 in the paper?

Posted 3 years ago

For sure!, I wish there was more examples of how to use NetGANOperator[] online, & I was excited when it was implemented.

I tried changing the TranspsoseLayers[] from {3 <-> 1} to {3, 1, 2} & it gave this error:

NetChain::valfail: Validation failed for ConvolutionLayer: kernel size 4*4 cannot exceed input size 1*128 plus padding size 2*2.

Then, I changed them to from {3 <-> 1} to {3, 2, 1} & I could evaluate the nets, but I still got bad from the generator results after training. I even tried adjusting my parameters:

kern = {4, 4};
chan = 128;
α = 0.2;

& restructuring the generator & discriminator more closely following the example :

discriminator =
 NetChain[
  {
   TransposeLayer[{3, 2, 1}, "Input" -> {256, 256, 2}],
   ConvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ConvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> \[Alpha]],
   ReshapeLayer[{4*4*128*32, 1}],
   LinearLayer[{}]
   }, "Input" -> enc

  ]

.

generator =
NetChain[
{

LinearLayer[{4096*4*4 }],
ReshapeLayer[{4096, 4, 4}],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer[Tanh],
TransposeLayer[{3, 2, 1}]
},
"Input" -> 100,
"Output" -> dec
]

After training, though, the generator only generated noise. I'm certain it has something to do with the dimensions {256,256,2} getting somehow switched around in the net, but I don't know where/how. In the MATLAB example, the TransposeLayer[] equivalents come at the opposite ends of the generator & discriminator (i.e., BEFORE the DeconvolutionLayer[]s in the generator & AFTER the ConvolutionLayer[]s in the discriminator). I tried doing building the nets that way, but I get errors & can't evaluate the cells with my NetChain[]s until I do it in reverse. The dimensions in the NetChain] box are reverse of the way the dimensions are outlined in [the paper too, e.g.,

enter image description here _ _ _enter image description here

I'm sure it's just a simple transposition issue, any tips would be greatly appreciated, I'd love to get this going in Mathematica but there are obviously some details here I'm missing.

I've included an updated EXAMPLE notebook. Thanks.

Attachments:
POSTED BY: John M.
Posted 3 years ago
Attachments:
POSTED BY: John M.

Indeed the NetDecoder[NetEncoder[...]] conversion is not implemented for audio features, because these features extraction are in general not invertible.

I don't know how to do with NetEncoder["AudioSTFT"] but the way to do it with NetEncoder/NetDecoder["Function"] is:

enc = NetEncoder[{"Function",
	Function @ Block[
		{data = ShortTimeFourier[#, 4096, 512, HannWindow]["Data"]},
		Transpose[{Re[data], Im[data]}, {3,1,2}]
	],
	{"Varying", 4096, 2}}];
dec = NetDecoder[{"Function",
	Function @ Block[
		{re = Normal[#[[All,All,1]]], im = Normal[#[[All,All,2]]]},
		Audio @ InverseShortTimeFourier[re + I * im, 4096, 512, HannWindow]
	]}];

Which can be tested with

dec @ enc @ Audio[File["ExampleData/car.mp3"]]

(that gives back almost the original audio signal, with a bit of distortion).

Note that the real and imaginary parts of the FT needs to be separated (neural networks need real numbers only).

Concerning the error you had "expecting a ShortTimeFourierData object or a numeric matrix instead of Numeric Array", it's just the subtlety that the input of the function f in NetDecoder[{"Function", f}] is a NumericArray. Which can be converted to an array using Normal.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard