Group Abstract

Message Boards

WOLFRAM COMMUNITY

21.4K Views

4 Replies

8 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Music Generation with GAN MidiNet

Kotaro Okazaki

Kotaro Okazaki, Fsas Technologies

Posted 7 years ago

I generate a music with reference to MidiNet. Most neural network models for music generation use recurrent neural networks. However, MidiNet use convolutional neural networks. There are three models in MidiNet. Model 1 is Melody generator, no chord condition. Model 2,3 are Melody generators with chord condition. I try Model 1, because it is most interesting in the three models compared in the paper. Get MIDI data My favorite Jazz bassist is Jaco Pastorius. I get MIDI data from here. For example, I get MIDI data of "The Chicken". url = "http://www.midiworld.com/download/1366"; notes = Select[Import[url, {"SoundNotes"}], Length[#] > 0 &]; There are some styles in the notes. I get base style from them. notes[[All, 3, 3]] Sound[notes[[1]]] I change MIDI data to Image data. I fix the smallest note unit to be the sixteenth note. I divide the MIDI data into the sixteenth note period and select the sound found at the beginning of each period. And the pitch of SoundNote function is from 1 to 128. So, I change one bar to grayscale image(h=128w=16). First, I create the rule to change each note pitch(C-1,...,G9) to number(1,...,128), C4 -> 61. codebase = {"C", "C#", "D", "D#", "E" , "F", "F#", "G", "G#" , "A", "A#", "B"}; num = ToString /@ Range[-1, 9]; pitch2numberrule = Take[Thread[ StringJoin /@ Reverse /@ Tuples[{num, codebase}] -> Range[0, 131] + 1], 128] Next, I change each bar to image (h = 128w = 16). tempo = 108; note16 = 60/(4tempo); ( length(second) of 1the sixteenth note ) select16[snlist_, t_] := Select[snlist, (t <= #[[2, 1]] <= t + note16) \|\| (t <= #[[2, 2]] <= t + note16) \|\| (#[[2, 1]] < t && #[[2, 2]] > t + note16) &, 1] selectbar[snlist_, str_] := select16[snlist, #] & /@ Most@Range[str, str + note1616, note16] selectpitch[x_] := If[x === {}, 0, x[[1, 1]]] /. pitch2numberrule pixelbar[snlist_, t_] := Module[{bar, x, y}, bar = selectbar[snlist, t]; x = selectpitch /@ bar; y = Range[16]; Transpose[{x, y}] ] imagebar[snlist_, t_] := Module[{image}, image = ConstantArray[0, {128, 16}]; Quiet[(image[[129 - #[[1]], #[[2]]]] = 1) & /@ pixelbar[snlist, t]]; Image[image] ] soundnote2image[soundnotelist_] := Module[{min, max, data2}, {min, max} = MinMax[#[[2]] & /@ soundnotelist // Flatten]; data2 = {#[[1]], #[[2]] - min} & /@ soundnotelist; Table[imagebar[data2, t], {t, 0, max - min, note1616}] ] (images1 = soundnote2image[notes[[1]]])[[;; 16]] Create the training data* First, I drop images1 to an integer multiple of the batch size. Its length is 128 bars and about 284 seconds with a batch size of 16. batchsize = 16; getbatchsizeimages[i_] := i[[;; batchsizeFloor[Length[i]/batchsize]]] imagesall = Flatten[Join[getbatchsizeimages /@ {images1}]]; {Length[imagesall], Length[imagesall]note1616 // N} MidiNet proposes a novel conditional mechanism to use music from the previous bar to condition the generation of the present bar to take into account the temporal dependencies across a different bar. So, each training data of MidiNet (Model 1: Melody generator, no chord condition) consists of three "noise", "prev", "Input". "noise" is a 100-dimensions random vector. "prev" is an image data(112816) of the previous bar. "Input" is an image data(112816) of the present bar. The first "prev" of each batch is all 0. I generate training data with a batch size of 16 as follows. randomDim = 100; n = Floor[Length@imagesall/batchsize]; noise = Table[RandomReal[NormalDistribution[0, 1], {randomDim}], batchsizen]; input = ArrayReshape[ImageData[#], {1, 128, 16}] & /@ imagesall[[;; batchsizen]]; prev = Flatten[ Join[Table[{{ConstantArray[0, {1, 128, 16}]}, input[[batchsize(i - 1) + 1 ;; batchsizei - 1]]}, {i, 1, n}]], 2]; trainingData = AssociationThread[{"noise", "prev", "Input"} -> {#[[1]], #[[2]], #[[3]]}] & /@ Transpose[{noise, prev, input}]; Create GAN* I create generator with reference to MidiNet. generator = NetGraph[{ 1024, BatchNormalizationLayer[], Ramp, 256, BatchNormalizationLayer[], Ramp, ReshapeLayer[{128, 1, 2}], DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], BatchNormalizationLayer[], Ramp, DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], BatchNormalizationLayer[], Ramp, DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], BatchNormalizationLayer[], Ramp, DeconvolutionLayer[1, {128, 1}, "Stride" -> {2, 1}], LogisticSigmoid, ConvolutionLayer[16, {128, 1}, "Stride" -> {2, 1}], BatchNormalizationLayer[], Ramp, ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], BatchNormalizationLayer[], Ramp, ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], BatchNormalizationLayer[], Ramp, ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], BatchNormalizationLayer[], Ramp, CatenateLayer[], CatenateLayer[], CatenateLayer[], CatenateLayer[]}, {NetPort["noise"] -> 1, NetPort["prev"] -> 19, 19 -> 20 -> 21 -> 22 -> 23 -> 24 -> 25 -> 26 -> 27 -> 28 -> 29 -> 30, 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7, {7, 30} -> 31, 31 -> 8 -> 9 -> 10, {10, 27} -> 32, 32 -> 11 -> 12 -> 13, {13, 24} -> 33, 33 -> 14 -> 15 -> 16, {16, 21} -> 34, 34 -> 17 -> 18}, "noise" -> {100}, "prev" -> {1, 128, 16} ] I create discriminator which does not have BatchNormalizationLayer and LogisticSigmoid, because I use Wasserstein GAN easy to stabilize the training. discriminator = NetGraph[{ ConvolutionLayer[64, {89, 4}, "Stride" -> {1, 1}], Ramp, ConvolutionLayer[64, {1, 4}, "Stride" -> {1, 1}], Ramp, ConvolutionLayer[16, {1, 4}, "Stride" -> {1, 1}], Ramp, 1}, {1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7}, "Input" -> {1, 128, 16} ] I create Wasserstein GAN network. ganNet = NetInitialize[NetGraph[<\|"gen" -> generator, "discrimop" -> NetMapOperator[discriminator], "cat" -> CatenateLayer[], "reshape" -> ReshapeLayer[{2, 1, 128, 16}], "flat" -> ReshapeLayer[{2}], "scale" -> ConstantTimesLayer["Scaling" -> {-1, 1}], "total" -> SummationLayer[]\|>, {{NetPort["noise"], NetPort["prev"]} -> "gen" -> "cat", NetPort["Input"] -> "cat", "cat" -> "reshape" -> "discrimop" -> "flat" -> "scale" -> "total"}, "Input" -> {1, 128, 16}]] NetTrain I train by using the training data created before. I use RMSProp as the method of NetTrain according to Wasserstein GAN. It take about one hour by using GPU. net = NetTrain[ganNet, trainingData, All, LossFunction -> "Output", Method -> {"RMSProp", "LearningRate" -> 0.00005, "WeightClipping" -> {"discrimop" -> 0.01}}, LearningRateMultipliers -> {"scale" -> 0, "gen" -> -0.2}, TargetDevice -> "GPU", BatchSize -> batchsize, MaxTrainingRounds -> 50000] Create MIDI I create image data of 16 bars by using generator of trained network. bars = {}; newbar = Image[ConstantArray[0, {1, 128, 16}]]; For[i = 1, i < 17, i++, noise1 = RandomReal[NormalDistribution[0, 1], {randomDim}]; prev1 = {ImageData[newbar]}; newbar = NetDecoder[{"Image", "Grayscale"}][ NetExtract[net["TrainedNet"], "gen"][<\|"noise" -> noise1, "prev" -> prev1\|>]]; AppendTo[bars, newbar] ] bars I select only the pixel having the max value among each column of the image, because there is a feature that the image generated by Wasserstein GAN is blurred. I clear the images. clearbar[bar_, threshold_] := Module[{i, barx, col, max}, barx = ConstantArray[0, {128, 16}]; col = Transpose[bar // ImageData]; For[i = 1, i < 17, i++, max = Max[col[[i]]]; If[max >= threshold, barx[[First@Position[col[[i]], max, 1], i]] = 1] ]; Image[barx] ] bars2 = clearbar[#, 0.1] & /@ bars I change the image to SoundNote. I concatenate the same continuous pitches. number2pitchrule = Reverse /@ pitch2numberrule; images2soundnote[img_, start_] := SoundNote[(129 - #[[2]]) /. number2pitchrule, {(#[[1]] - 1)note16, #[[1]]note16} + start, "ElectricBass", SoundVolume -> 1] & /@ Sort@(Reverse /@ Position[(img // ImageData) /. (1 -> 1.), 1.]) snjoinrule = {x___, SoundNote[s_, {t_, u_}, v_, w_], SoundNote[s_, {u_, z_}, v_, w_], y___} -> {x, SoundNote[s, {t, z}, v, w], y}; I generate music and attach its mp3 file. Sound[Flatten@ MapIndexed[(images2soundnote[#1, note1616(First[#2] - 1)] //. snjoinrule) &, bars2]] Conclusion I try music generation with GAN. I am not satisfied with the result. I think that the causes are various, poor training data, poor learning time, etc. Jaco is gone. I hope Neural Networks will be able to express Jaco's base. Attachments: jaco.mp3

I generate a music with reference to MidiNet. Most neural network models for music generation use recurrent neural networks. However, MidiNet use convolutional neural networks. There are three models in MidiNet. Model 1 is Melody generator, no chord condition. Model 2,3 are Melody generators with chord condition. I try Model 1, because it is most interesting in the three models compared in the paper.

Get MIDI data

My favorite Jazz bassist is Jaco Pastorius. I get MIDI data from here. For example, I get MIDI data of "The Chicken".

url = "http://www.midiworld.com/download/1366";
notes = Select[Import[url, {"SoundNotes"}], Length[#] > 0 &];

There are some styles in the notes. I get base style from them.

notes[[All, 3, 3]]
Sound[notes[[1]]]

enter image description here

I change MIDI data to Image data. I fix the smallest note unit to be the sixteenth note. I divide the MIDI data into the sixteenth note period and select the sound found at the beginning of each period. And the pitch of SoundNote function is from 1 to 128. So, I change one bar to grayscale image(h=128*w=16). First, I create the rule to change each note pitch(C-1,...,G9) to number(1,...,128), C4 -> 61.

codebase = {"C", "C#", "D", "D#", "E" , "F", "F#", "G", "G#" , "A", 
   "A#", "B"};
num = ToString /@ Range[-1, 9];
pitch2numberrule = 
 Take[Thread[
   StringJoin /@ Reverse /@ Tuples[{num, codebase}] -> 
    Range[0, 131] + 1], 128]

enter image description here

Next, I change each bar to image (h = 128*w = 16).

tempo = 108;
note16 = 60/(4*tempo);  (* length(second) of 1the sixteenth note *)
select16[snlist_, t_] := 
 Select[snlist, (t <= #[[2, 1]] <= t + note16) || (t <= #[[2, 2]] <= 
      t + note16) || (#[[2, 1]] < t && #[[2, 2]] > t + note16) &, 1]
selectbar[snlist_, str_] := 
 select16[snlist, #] & /@ Most@Range[str, str + note16*16, note16]
selectpitch[x_] := If[x === {}, 0, x[[1, 1]]] /. pitch2numberrule
pixelbar[snlist_, t_] := Module[{bar, x, y},
  bar = selectbar[snlist, t];
  x = selectpitch /@ bar;
  y = Range[16];
  Transpose[{x, y}]
  ]
imagebar[snlist_, t_] := Module[{image},
  image = ConstantArray[0, {128, 16}];
  Quiet[(image[[129 - #[[1]], #[[2]]]] = 1) & /@ pixelbar[snlist, t]];
  Image[image]
  ]
soundnote2image[soundnotelist_] := Module[{min, max, data2},
  {min, max} = MinMax[#[[2]] & /@ soundnotelist // Flatten];
  data2 = {#[[1]], #[[2]] - min} & /@ soundnotelist;
  Table[imagebar[data2, t], {t, 0, max - min, note16*16}]
  ]

(images1 = soundnote2image[notes[[1]]])[[;; 16]]

enter image description here

Create the training data

First, I drop images1 to an integer multiple of the batch size. Its length is 128 bars and about 284 seconds with a batch size of 16.

batchsize = 16;
getbatchsizeimages[i_] := i[[;; batchsize*Floor[Length[i]/batchsize]]]
imagesall = Flatten[Join[getbatchsizeimages /@ {images1}]];
{Length[imagesall], Length[imagesall]*note16*16 // N}

enter image description here

MidiNet proposes a novel conditional mechanism to use music from the previous bar to condition the generation of the present bar to take into account the temporal dependencies across a different bar. So, each training data of MidiNet (Model 1: Melody generator, no chord condition) consists of three "noise", "prev", "Input". "noise" is a 100-dimensions random vector. "prev" is an image data(112816) of the previous bar. "Input" is an image data(112816) of the present bar. The first "prev" of each batch is all 0. I generate training data with a batch size of 16 as follows.

randomDim = 100;
n = Floor[Length@imagesall/batchsize];
noise = Table[RandomReal[NormalDistribution[0, 1], {randomDim}], 
   batchsize*n];
input = ArrayReshape[ImageData[#], {1, 128, 16}] & /@ 
   imagesall[[;; batchsize*n]];
prev = Flatten[
   Join[Table[{{ConstantArray[0, {1, 128, 16}]}, 
      input[[batchsize*(i - 1) + 1 ;; batchsize*i - 1]]}, {i, 1, n}]],
    2];
trainingData = 
  AssociationThread[{"noise", "prev", 
       "Input"} -> {#[[1]], #[[2]], #[[3]]}] & /@ 
   Transpose[{noise, prev, input}];

Create GAN

I create generator with reference to MidiNet.

generator = NetGraph[{
   1024, BatchNormalizationLayer[], Ramp, 256, 
   BatchNormalizationLayer[], Ramp, ReshapeLayer[{128, 1, 2}],
   DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], 
   BatchNormalizationLayer[], Ramp, 
   DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], 
   BatchNormalizationLayer[], Ramp, 
   DeconvolutionLayer[64, {1, 2}, "Stride" -> {2, 2}], 
   BatchNormalizationLayer[], Ramp, 
   DeconvolutionLayer[1, {128, 1}, "Stride" -> {2, 1}], 
   LogisticSigmoid,                                  
   ConvolutionLayer[16, {128, 1}, "Stride" -> {2, 1}], 
   BatchNormalizationLayer[], Ramp, 
   ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], 
   BatchNormalizationLayer[], Ramp,    
   ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], 
   BatchNormalizationLayer[], Ramp,    
   ConvolutionLayer[16, {1, 2}, "Stride" -> {1, 2}], 
   BatchNormalizationLayer[], Ramp,     CatenateLayer[], 
   CatenateLayer[], CatenateLayer[], 
   CatenateLayer[]},                             {NetPort["noise"] -> 
    1, NetPort["prev"] -> 19, 
   19 -> 20 -> 
     21 -> 22 -> 23 -> 24 -> 25 -> 26 -> 27 -> 28 -> 29 -> 30,
   1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7, {7, 30} -> 31, 
   31 -> 8 -> 9 -> 10, {10, 27} -> 32, 
   32 -> 11 -> 12 -> 13, {13, 24} -> 33, 
   33 -> 14 -> 15 -> 16, {16, 21} -> 34, 34 -> 17 -> 18}, 
  "noise" -> {100}, "prev" -> {1, 128, 16}
     ]

enter image description here

I create discriminator which does not have BatchNormalizationLayer and LogisticSigmoid, because I use Wasserstein GAN easy to stabilize the training.

discriminator = NetGraph[{
   ConvolutionLayer[64, {89, 4}, "Stride" -> {1, 1}], Ramp,
   ConvolutionLayer[64, {1, 4}, "Stride" -> {1, 1}], Ramp,
   ConvolutionLayer[16, {1, 4}, "Stride" -> {1, 1}], Ramp,
   1},
  {1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7}, "Input" -> {1, 128, 16}
     ]

enter image description here

I create Wasserstein GAN network.

ganNet = NetInitialize[NetGraph[<|"gen" -> generator,
    "discrimop" -> NetMapOperator[discriminator], 
    "cat" -> CatenateLayer[], 
    "reshape" -> ReshapeLayer[{2, 1, 128, 16}], 
    "flat" -> ReshapeLayer[{2}], 
    "scale" -> ConstantTimesLayer["Scaling" -> {-1, 1}], 
    "total" -> SummationLayer[]|>,
   {{NetPort["noise"], NetPort["prev"]} -> "gen" -> "cat", 
    NetPort["Input"] -> "cat", 
    "cat" -> 
     "reshape" -> "discrimop" -> "flat" -> "scale" -> "total"}, 
   "Input" -> {1, 128, 16}]]

enter image description here

NetTrain

I train by using the training data created before. I use RMSProp as the method of NetTrain according to Wasserstein GAN. It take about one hour by using GPU.

net = NetTrain[ganNet, trainingData, All, LossFunction -> "Output", 
  Method -> {"RMSProp", "LearningRate" -> 0.00005, 
    "WeightClipping" -> {"discrimop" -> 0.01}}, 
  LearningRateMultipliers -> {"scale" -> 0, "gen" -> -0.2}, 
  TargetDevice -> "GPU", BatchSize -> batchsize, 
  MaxTrainingRounds -> 50000]

enter image description here

Create MIDI

I create image data of 16 bars by using generator of trained network.

bars = {};
newbar = Image[ConstantArray[0, {1, 128, 16}]];
For[i = 1, i < 17, i++,
 noise1 = RandomReal[NormalDistribution[0, 1], {randomDim}];
 prev1 = {ImageData[newbar]};
 newbar = 
  NetDecoder[{"Image", "Grayscale"}][
   NetExtract[net["TrainedNet"], "gen"][<|"noise" -> noise1, 
     "prev" -> prev1|>]];
 AppendTo[bars, newbar]
 ]
bars

enter image description here

I select only the pixel having the max value among each column of the image, because there is a feature that the image generated by Wasserstein GAN is blurred. I clear the images.

clearbar[bar_, threshold_] := Module[{i, barx, col, max},
  barx = ConstantArray[0, {128, 16}];
  col = Transpose[bar // ImageData];
  For[i = 1, i < 17, i++,
   max = Max[col[[i]]];
   If[max >= threshold,
    barx[[First@Position[col[[i]], max, 1], i]] = 1]
   ];
  Image[barx]
  ]
 bars2 = clearbar[#, 0.1] & /@ bars

enter image description here

I change the image to SoundNote. I concatenate the same continuous pitches.

number2pitchrule = Reverse /@ pitch2numberrule;
images2soundnote[img_, start_] := 
 SoundNote[(129 - #[[2]]) /. 
     number2pitchrule, {(#[[1]] - 1)*note16, #[[1]]*note16} + start, 
    "ElectricBass", SoundVolume -> 1] & /@ 
  Sort@(Reverse /@ Position[(img // ImageData) /. (1 -> 1.), 1.])
snjoinrule = {x___, SoundNote[s_, {t_, u_}, v_, w_], 
    SoundNote[s_, {u_, z_}, v_, w_], y___} -> {x, 
    SoundNote[s, {t, z}, v, w], y};

I generate music and attach its mp3 file.

Sound[Flatten@
  MapIndexed[(images2soundnote[#1, note16*16*(First[#2] - 1)] //. 
      snjoinrule) &, bars2]]

enter image description here

Conclusion

I try music generation with GAN. I am not satisfied with the result. I think that the causes are various, poor training data, poor learning time, etc. Jaco is gone. I hope Neural Networks will be able to express Jaco's base.

POSTED BY: Kotaro Okazaki

4 Replies

Sort By:

Michael Sollami

Michael Sollami, Salesforce

Posted 7 years ago

Nice example, I really hope that generalized GAN training is supported in version 12!

POSTED BY: Michael Sollami

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 7 years ago

Dear @Kotaro Okazaki, thank you for sharing this interesting work and trying out Wolfram Neural Network framework! I am looking forward to more of your contributions! Perhaps you would also be interested in taking a look at a related project that was done in Wolfram Summer School: "Generating Music with Expressive Timing and Dynamics": http://community.wolfram.com/groups/-/m/t/1380021 It also works with MIDI, but, as you already noticed with the most approaches, it uses Recurrent Neural Networks.

POSTED BY: Vitaliy Kaurov

Kotaro Okazaki

Kotaro Okazaki, Fsas Technologies

Posted 7 years ago

Dear Kaurov, thank you for the information. Apisov's post is very helpful. I used a MIDI file as I input and output. The reason is because it is easy to handle as data. However, I also feel that there is a limit to express Jazz. In order to generate Jazz vividly, I think whether input and output need to be raw data (wav,mp3), or End-to-End.

POSTED BY: Kotaro Okazaki

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 7 years ago

- Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback