Message Boards Message Boards

GROUPS:

Music Generation with GAN MidiNet

Posted 3 months ago
1452 Views
|
4 Replies
|
8 Total Likes
|

I generate a music with reference to MidiNet. Most neural network models for music generation use recurrent neural networks. However, MidiNet use convolutional neural networks. There are three models in MidiNet. Model 1 is Melody generator, no chord condition. Model 2,3 are Melody generators with chord condition. I try Model 1, because it is most interesting in the three models compared in the paper.

Get MIDI data

My favorite Jazz bassist is Jaco Pastorius. I get MIDI data from here. For example, I get MIDI data of "The Chicken".

url = "http://www.midiworld.com/download/1366";
notes = Select[Import[url, {"SoundNotes"}], Length[#] > 0 &];

There are some styles in the notes. I get base style from them.

notes[[All, 3, 3]]
Sound[notes[[1]]]

enter image description here enter image description here

I change MIDI data to Image data. I fix the smallest note unit to be the sixteenth note. I divide the MIDI data into the sixteenth note period and select the sound found at the beginning of each period. And the pitch of SoundNote function is from 1 to 128. So, I change one bar to grayscale image(h=128*w=16). First, I create the rule to change each note pitch(C-1,...,G9) to number(1,...,128), C4 -> 61.

codebase = {"C", "C#", "D", "D#", "E" , "F", "F#", "G", "G#" , "A", 
   "A#", "B"};
num = ToString /@ Range[-1, 9];
pitch2numberrule = 
 Take[Thread[
   StringJoin /@ Reverse /@ Tuples[{num, codebase}] -> 
    Range[0, 131] + 1], 128]

enter image description here

Next, I change each bar to image (h = 128*w = 16).

tempo = 108;
note16 = 60/(4*tempo);  (* length(second) of 1the sixteenth note *)
select16[snlist_, t_] := 
 Select[snlist, (t <= #[[2, 1]] <= t + note16) || (t <= #[[2, 2]] <= 
      t + note16) || (#[[2, 1]] < t && #[[2, 2]] > t + note16) &, 1]
selectbar[snlist_, str_] := 
 select16[snlist, #] & /@ Most@Range[str, str + note16*16, note16]
selectpitch[x_] := If[x === {}, 0, x[[1, 1]]] /. pitch2numberrule
pixelbar[snlist_, t_] := Module[{bar, x, y},
  bar = selectbar[snlist, t];
  x = selectpitch /@ bar;
  y = Range[16];
  Transpose[{x, y}]
  ]
imagebar[snlist_, t_] := Module[{image},
  image = ConstantArray[0, {128, 16}];
  Quiet[(image[[129 - #[[1]], #[[2]]]] = 1) & /@ pixelbar[snlist, t]];
  Image[image]
  ]
soundnote2image[soundnotelist_] := Module[{min, max, data2},
  {min, max} = MinMax[#[[2]] & /@ soundnotelist // Flatten];
  data2 = {#[[1]], #[[2]] - min} & /@ soundnotelist;
  Table[imagebar[data2, t], {t, 0, max - min, note16*16}]
  ]

(images1 = soundnote2image[notes[[1]]])[[;; 16]]

enter image description here

Create the training data

First, I drop images1 to an integer multiple of the batch size. Its length is 128 bars and about 284 seconds with a batch size of 16.

batchsize = 16;
getbatchsizeimages[i_] := i[[;; batchsize*Floor[Length[i]/batchsize]]]
imagesall = Flatten[Join[getbatchsizeimages /@ {images1}]];
{Length[imagesall], Length[imagesall]*note16*16 // N}

enter image description here

MidiNet proposes a novel conditional mechanism to use music from the previous bar to condition the generation of the present bar to take into account the temporal dependencies across a different bar. So, each training data of MidiNet (Model 1: Melody generator, no chord condition) consists of three "noise", "prev", "Input". "noise" is a 100-dimensions random vector. "prev" is an image data(112816) of the previous bar. "Input" is an image data(112816) of the present bar. The first "prev" of each batch is all 0. I generate training data with a batch size of 16 as follows.

randomDim = 100;
n = Floor[Length@imagesall/batchsize];
noise = Table[RandomReal[NormalDistribution[0, 1], {randomDim}], 
   batchsize*n];
input = ArrayReshape[ImageData[#], {1, 128, 16}] & /@ 
   imagesall[[;; batchsize*n]];
prev = Flatten[
   Join[Table[{{ConstantArray[0, {1, 128, 16}]}, 
      input[[batchsize*(i - 1) + 1 ;; batchsize*i - 1]]}, {i, 1, n}]],
    2];
trainingData = 
  AssociationThread[{"noise", "prev", 
       "Input"} -> {#[[1]], #[[2]], #[[3]]}] & /@ 
   Transpose[{noise, prev, input}];

Create GAN

I create generator with reference to MidiNet.

generator = NetGraph[{
   1024, BatchNormalizationLayer[], Ramp, 256, 
   BatchNormalizationLayer[], Ramp, ReshapeLayer[{128, 1, 2}],
   DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], 
   BatchNormalizationLayer[], Ramp, 
   DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], 
   BatchNormalizationLayer[], Ramp, 
   DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], 
   BatchNormalizationLayer[], Ramp, 
   DeconvolutionLayer[1, {128, 1}, "Stride" -> {2, 1}], 
   LogisticSigmoid,                                  
   ConvolutionLayer[16, {128, 1}, "Stride" -> {2, 1}], 
   BatchNormalizationLayer[], Ramp, 
   ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], 
   BatchNormalizationLayer[], Ramp,    
   ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], 
   BatchNormalizationLayer[], Ramp,    
   ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], 
   BatchNormalizationLayer[], Ramp,     CatenateLayer[], 
   CatenateLayer[], CatenateLayer[], 
   CatenateLayer[]},                             {NetPort["noise"] -> 
    1, NetPort["prev"] -> 19, 
   19 -> 20 -> 
     21 -> 22 -> 23 -> 24 -> 25 -> 26 -> 27 -> 28 -> 29 -> 30,
   1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7, {7, 30} -> 31, 
   31 -> 8 -> 9 -> 10, {10, 27} -> 32, 
   32 -> 11 -> 12 -> 13, {13, 24} -> 33, 
   33 -> 14 -> 15 -> 16, {16, 21} -> 34, 34 -> 17 -> 18}, 
  "noise" -> {100}, "prev" -> {1, 128, 16}
     ]

enter image description here

I create discriminator which does not have BatchNormalizationLayer and LogisticSigmoid, because I use Wasserstein GAN easy to stabilize the training.

discriminator = NetGraph[{
   ConvolutionLayer[64, {89, 4}, "Stride" -> {1, 1}], Ramp,
   ConvolutionLayer[64, {1, 4}, "Stride" -> {1, 1}], Ramp,
   ConvolutionLayer[16, {1, 4}, "Stride" -> {1, 1}], Ramp,
   1},
  {1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7}, "Input" -> {1, 128, 16}
     ]

enter image description here

I create Wasserstein GAN network.

ganNet = NetInitialize[NetGraph[<|"gen" -> generator,
    "discrimop" -> NetMapOperator[discriminator], 
    "cat" -> CatenateLayer[], 
    "reshape" -> ReshapeLayer[{2, 1, 128, 16}], 
    "flat" -> ReshapeLayer[{2}], 
    "scale" -> ConstantTimesLayer["Scaling" -> {-1, 1}], 
    "total" -> SummationLayer[]|>,
   {{NetPort["noise"], NetPort["prev"]} -> "gen" -> "cat", 
    NetPort["Input"] -> "cat", 
    "cat" -> 
     "reshape" -> "discrimop" -> "flat" -> "scale" -> "total"}, 
   "Input" -> {1, 128, 16}]]

enter image description here

NetTrain

I train by using the training data created before. I use RMSProp as the method of NetTrain according to Wasserstein GAN. It take about one hour by using GPU.

net = NetTrain[ganNet, trainingData, All, LossFunction -> "Output", 
  Method -> {"RMSProp", "LearningRate" -> 0.00005, 
    "WeightClipping" -> {"discrimop" -> 0.01}}, 
  LearningRateMultipliers -> {"scale" -> 0, "gen" -> -0.2}, 
  TargetDevice -> "GPU", BatchSize -> batchsize, 
  MaxTrainingRounds -> 50000]

enter image description here

Create MIDI

I create image data of 16 bars by using generator of trained network.

bars = {};
newbar = Image[ConstantArray[0, {1, 128, 16}]];
For[i = 1, i < 17, i++,
 noise1 = RandomReal[NormalDistribution[0, 1], {randomDim}];
 prev1 = {ImageData[newbar]};
 newbar = 
  NetDecoder[{"Image", "Grayscale"}][
   NetExtract[net["TrainedNet"], "gen"][<|"noise" -> noise1, 
     "prev" -> prev1|>]];
 AppendTo[bars, newbar]
 ]
bars

enter image description here

I select only the pixel having the max value among each column of the image, because there is a feature that the image generated by Wasserstein GAN is blurred. I clear the images.

clearbar[bar_, threshold_] := Module[{i, barx, col, max},
  barx = ConstantArray[0, {128, 16}];
  col = Transpose[bar // ImageData];
  For[i = 1, i < 17, i++,
   max = Max[col[[i]]];
   If[max >= threshold,
    barx[[First@Position[col[[i]], max, 1], i]] = 1]
   ];
  Image[barx]
  ]
 bars2 = clearbar[#, 0.1] & /@ bars

enter image description here

I change the image to SoundNote. I concatenate the same continuous pitches.

number2pitchrule = Reverse /@ pitch2numberrule;
images2soundnote[img_, start_] := 
 SoundNote[(129 - #[[2]]) /. 
     number2pitchrule, {(#[[1]] - 1)*note16, #[[1]]*note16} + start, 
    "ElectricBass", SoundVolume -> 1] & /@ 
  Sort@(Reverse /@ Position[(img // ImageData) /. (1 -> 1.), 1.])
snjoinrule = {x___, SoundNote[s_, {t_, u_}, v_, w_], 
    SoundNote[s_, {u_, z_}, v_, w_], y___} -> {x, 
    SoundNote[s, {t, z}, v, w], y};

I generate music and attach its mp3 file.

Sound[Flatten@
  MapIndexed[(images2soundnote[#1, note16*16*(First[#2] - 1)] //. 
      snjoinrule) &, bars2]]

enter image description here

Conclusion

I try music generation with GAN. I am not satisfied with the result. I think that the causes are various, poor training data, poor learning time, etc. Jaco is gone. I hope Neural Networks will be able to express Jaco's base.

Attachments:
4 Replies

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

Dear @Kotaro Okazaki, thank you for sharing this interesting work and trying out Wolfram Neural Network framework! I am looking forward to more of your contributions!

Perhaps you would also be interested in taking a look at a related project that was done in Wolfram Summer School: "Generating Music with Expressive Timing and Dynamics":

http://community.wolfram.com/groups/-/m/t/1380021

It also works with MIDI, but, as you already noticed with the most approaches, it uses Recurrent Neural Networks.

Dear Kaurov, thank you for the information. Apisov’'s post is very helpful. I used a MIDI file as I input and output. The reason is because it is easy to handle as data. However, I also feel that there is a limit to express Jazz. In order to generate Jazz vividly, I think whether input and output need to be raw data (wav,mp3), or End-to-End.

Nice example, I really hope that generalized GAN training is supported in version 12!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract