Thank you much, Jérôme, the code you provide above works perfectly. 
Unfortunately, however, I'm still having problems with NetTrain[] using these encoders/decoders.  Maybe you, or someone at Wolfram Research, could help me out?
I'm attempting adversarial audio synthesis (as described this paper & demonstrated in this MATLAB example) using Mathematica.
I define my STFT parameters:
window = 256;
offset = 128;
pad = (window - offset)/2;
samplerate = 16000;
I define my NetEncoder[] using the code you provide above.  Notice, however, I also include/use ArrayPad[] so the output is square (i.e., {256, 256, 2} instead of {128, 256, 2} outputs): 
enc = NetEncoder[
   {"Function",
    Function@
     Block[
      {data = ShortTimeFourier[#, window, offset, HannWindow]["Data"]
       },
      ArrayPad[
       Transpose[
        {
         Re[data], Im[data]
         }, {3, 1, 2}
        ], {{pad}, {0}, {0}}
       ]
      ], {"Varying", window, 2}
    }
   ];
I define my NetDencoder[].  Notice, again, I also include/use ArrayPad[] so the encoder padding is removed (i.e., {128, 256, 2} instead of {256, 256, 2} outputs for the InverseShortTimeFourier[]): 
dec = NetDecoder[
   {"Function",
    Function@
     Block[
      {re = ArrayPad[Normal[#[[All, All, 1]]], {{-pad}, {0}}], 
       im = ArrayPad[Normal[#[[All, All, 2]]], {{-pad}, {0}}]
       },
      Audio[
       InverseShortTimeFourier[
        re + I*im, window, offset, HannWindow
        ], SampleRate -> samplerate
       ]
      ]
    }
   ];
The test you provide above still works perfectly even with my ArrayPad[] changes:
dec @ enc @ Audio[File["ExampleData/car.mp3"]]
I start running into problems, however, when I attempt to train a net with these encoders/decoders, so, e.g., I define my net parameters:
kern = {4, 4};
chan = 64;
α = 0.2;
I define my discriminator (Table 5 in [1]), adding an additional ConvolutionLayer[] for 256x256:
discriminator =
 NetChain[
  {
   TransposeLayer[{3 <-> 1}, "Input" -> {256, 256, 2}],
   ConvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   ConvolutionLayer[chan*32, kern, "Stride" -> 2, PaddingSize -> 1],
   ParametricRampLayer[{}, "Slope" -> α],
   LinearLayer[{}]
   },
  "Input" -> enc
  ]
out[]= ________
I define my generator (Table 4 in [1]), adding an additional DeconvolutionLayer[] for 256x256:
generator =
NetChain[
{
LinearLayer[chan*(window*2)],
ReshapeLayer[{2048, 4, 4}],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*16, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*8, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*4, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan*2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[chan, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer["ReLU"],
DeconvolutionLayer[2, kern, "Stride" -> 2, PaddingSize -> 1],
ElementwiseLayer[Tanh],
TransposeLayer[{1 <-> 3}]
},
"Input" -> 100,
"Output" -> dec
]
out[]= _______
I load my training data (a dataset of wav files, each 16384 samples in length [i.e., 1.024s@16000KHz]): 
wav = FileNames["*.wav", NotebookDirectory[]];
trainingData = Import[#] & /@ wav;
I define my "latent" input for training (following the documentation for NetGANOperator[]):
RandomLatent[batchSize_] := 
  Map[NumericArray[#, "Real32"] &, 
   RandomVariate[NormalDistribution[], {batchSize, 100}]];
datagen = 
  Function[<|"Sample" -> RandomSample[trainingData, #BatchSize], 
    "Latent" -> RandomLatent[#BatchSize]|>];
I train the GAN:
gan = NetGANOperator[{generator, discriminator}];
trained = NetTrain[
  gan,
  {
   datagen,
   "RoundLength" -> Length[trainingData]
   },
  TrainingUpdateSchedule -> {"Discriminator", "Generator"},
  Method -> {"ADAM", "Beta1" -> 0.5, "LearningRate" -> 0.0002},
  BatchSize -> 64,
  MaxTrainingRounds -> 100,
  TargetDevice -> "GPU"]
However, when I generate new samples from the trained generator.:
trainedgen = NetExtract[trained, "Generator"];
trainedgen[RandomLatent[1]]
it's clear that something is wrong with my net architecture or my decoders/encoders or something because the generated samples don't resemble:
dec @ enc @ Audio[File["ExampleData/car.mp3"]]
they're either silent, or just noise, or etc.  I can't figure it out.
I must be doing something wrong because I've used Mathematica's NetTrain[] on image datasets with similar architectures & I've never run into this problem (i.e., the generated samples always resemble the input). 
Could it be I've set some parameter wrong? 
Could it be the TransposeLayer[]s in my generator & discriminator are somehow making the data unusable? 
Could it be the function (RandomLatent[]) I'm using for "latent" input for the generator is suitable only for image datasets & not for audio/STFT datasets? 
Any help you, Jérôme, or anyone at Wolfram Research, could give me would be greatly appreciated.  I'd really like to figure out how to NetTrain[] GANs with these encoders/decoders in Mathematica.
I've included an EXAMPLE notebook. Thank You.
				
					
				
				
					
					
						
							 Attachments:
							Attachments: