Message Boards Message Boards

[WSS18] Generating Dynamic Piano Performances with Deep Learning


Goal of the project

The idea is to generate a sequence of notes that represent a musical performance. One of the ways to generate music is algorithmic, where a musical sequence is generated with the help of a long list of rules.

The approach in this project is different - I build a neural network that knows nothing about music and I try to teach it with thousands of songs given in MIDI format. Apart from just generating a meaningful sequence of notes I also want to add dynamics in note loudness and humanlike mistakes in timing.

  • Dynamics and timing?
    There is no human who is able to play on music instrument with precisely the same loudness and strictly to the timing. People do mistakes and in the case of music, they are helping in creating what we call more alive music. Simply saying dynamic music with slight time shifts sounds more interesting.

  • Why performances?
    The dataset that I use for the project contains performances of Yamaha e-piano competition participants. This gives us a possibility to learn dynamics and mistakes in timings.

Here's an example generated by our model.

All examples will be attached to this post as files just in case.


This is not an original work and mostly it's an attempt to recreate the work of Magenta team from their blog post. In this post, I will try to add more details to the preprocessing steps and how you can build a similar neural network model in Wolfram Language.

Getting Data

As it was mentioned before, a part of our dataset is the Yamaha e-piano performances but their website doesn't give you a possibility to download all performances at once so we've used another resource that has the same performances but also contains a set of classic and jazz compositions.

In the original work Magenta team has used only the Yamaha dataset but with a heavy augmentation on top of that: Time-stretching (making each performance up to 5% faster or slower), Transposition (raising or lowering the pitch of each performance by up to a major third).

You can create your own list of MIDI files and build a dataset with the help of code provided below in the post. Here are links to find a lot of free MIDI songs: The Lakh MIDI Dataset(very well prepared a dataset for ML projects), MidiWorld and FreeMidi


MIDI is short for Musical Instrument Digital Interface. It’s a language that allows computers, musical instruments, and other hardware to communicate. MIDI carries event messages that specify musical notation, pitch, velocity, vibrato, panning, and clock signals (which set tempo).

For the project, we need only events that denote where is every note starts/ends and with what are velocity and pitch.

Preprocessing Data

Even though MIDI is already a digital representation of music, we can't just take raw bytes of a file and feed it to an ML model as in the case of the models working with images. First of all, images and music are conceptually different tasks: the first is a single event(data) per item(image), the second is a sequence of events per item(song, composition, etc). Another reason is that raw MIDI representation and a single MIDI event itself contain a lot of irrelevant information to our task.

Thus I need a special data representation, a MIDI-like stream of musical events. Specifically, I use the following set of events:

  • 88 note-on events, one for each of the 88 MIDI pitches of piano range. These events start a new note.
  • 88 note-off events, one for each of the 88 MIDI pitches of piano range. These events release a note.
  • 100 time-shift events in increments of 10 ms up to 1 second. These events move forward in time to the next note event.
  • 34 velocity events, corresponding to MIDI velocities quantized into 32 bins. These events change the velocity applied to subsequent notes.

The neural network operates on a one-hot encoding over these 310 different events. This is very the same representation as in the original work but the number of note-on/note-off is different, I encode 88 notes in piano range instead of 127 notes in MIDI pitch range to reduce one-hot encoding vector size and make the process of learning easier.

Let's dive into the code of importing and preprocessing MIDI file

Wolfram Language has a built-in support of MIDI files what is really simplifying initial work. To get data from MIDI file you need to import it with specific elements. In the code belove I also extract and calculate needed information related to a tempo of a song.

{raw, header} = Import[path, #]& /@ {"RawData", "Header"};
tempos = Cases[Flatten[raw], HoldPattern["SetTempo" -> tempo_] :> tempo];
microsecondsPerBeat = If[Length@tempos > 0, First[tempos], 500000];

ticksPerBeat = First@Cases[header,HoldPattern["TimeDivision"->x_] :> x];
secondsPerTick = (microsecondsPerBeat/ticksPerBeat) * 10^-6.;

One MIDI event in WL representation looks like this

{56, {9, 0}, {46, 83}}

  • 56 represents our time-shift event
  • 9 is a status byte which helps distinguish between MIDI events( 9,8 are note-on, note-off respectively)
  • 0 is MIDI channel(irrelevant for us)
  • 46 indicates what is a pitch of this note(related to note-on/note-off events)
  • 83 is a number we encode in a velocity event

If you want to understand how a real raw MIDI data structured, this blog is particularly useful.

So, what we need is to parse a sequence of MIDI events and filter them only for events that are note-on, note-off and all events, excluding meta-messages, that contain time-shift greater than 0.

After I filtered MIDI data for the needed events I use the code below to encode each MIDI event into our events representation:

encode[track_, secondsPerTick_] := Block[{lastVelocity = 0},
       Block[{list = {}},
         (* Add time shifts when needed *)
         If[timeShiftByte[#] > 0, list = Join[list, encodeTimeShift[timeShiftByte[#, secondsPerTick]]]];

         (* Procced with logic only if it's a note event *)
         If[statusByte[#] == NoteOnByte || statusByte[#] == NoteOffByte,

          (* Add velocity if it's different from the last seen *)
          If[lastVelocity != quantizedVelocity[velocityByte[#]] && statusByte[#] == NoteOnByte,

              lastVelocity = quantizedVelocity[velocityByte[#]];
              list = Join[list, List[encodeVelocity[velocityByte[#]]]];

          (* Add note event *)
          list = Join[list, List[encodeNote[noteByte[#], statusByte[#] == NoteOnByte]]];

         (* Return encoded list*)
, 1]];

For sanity's sake, I won't flood this post with full implementations of code but you can find the whole code on GitHub.

Building a Model

Because music data is a sequence of events we need an architecture that knows how to remember. This is exactly what Recurrent Neural Networks try to do I won't be explaining here how they work but I recommend to watch this introduction video. Also, Wolfram Documentation has a very detailed guide about sequence learning with neural networks.

Particularly in my case, I use modified architecture from Wolfram English Character-Level Language Model V1 with teaching forcing for more efficient training.

Comparatively to the original language model I've removed one layer of GRU and also Dropout layer due to not so big dataset.

Neural network architecture

After the first 10 rounds of training the model was slowly but steadily converging:

enter image description here

Blue points are training errors and orange are validation errors.

Here are another examples after these training rounds: 2, 3

I'll continue and update this post in a while soon

POSTED BY: Pavlo Apisov
5 days ago

Group Abstract Group Abstract