Message Boards Message Boards

[WSS19] OCR for Sheet Music

Posted 5 years ago

Overview

Given an image of some sheet music, generate some internal representation of it that can then be played, reprinted or exported to some external musical notation software.

In order to do this, first we split the sheet music into images of each staff, and remove the staff lines; we then find bounding boxes for each musical component using image segmentation, and classify them. This is all then passed to a parser that parses the final sheet music.

Finding the Staffs

Staff lines are long, horizontal and thin; as such, they can be highlighted by using a bottom hat transform with a short vertical kernel, and the extracted with ImageLines. Then, by looking for close groups of five such lines, we can extract an image for each staff in the original score. We also find the median distance between the staff lines, both for padding the staff images and for reusing later.

detectStaffLineImages[image_] := Module[{
   staffLines =
    Select[ImageLines[BottomHatTransform[image, BoxMatrix[{1, 0}]],
       Automatic, 0.0065][[All,
      1]], (Abs@VectorAngle[Subtract @@ #, {-1, 0}] <
        10 \[Degree]) &]
   },
  \[CapitalDelta] =
   Median@Differences@Sort@Map[Mean, staffLines[[All, All, 2]]];
  staffLineGroups =
   Gather[Sort[
      staffLines, (Last@Mean@#1 > Last@Mean@#2) &], (Last@Mean@#1 -
         Last@Mean@#2 < 5 \[CapitalDelta]) &] // Echo;
  {Map[
    ImageTrim[image, Flatten[#, 1], {0, 2.25 \[CapitalDelta]}] &,
    staffLineGroups
    ],
   \[CapitalDelta]}
  ]

Original page image

One of the split staff images

Removing the Staff Lines

Staff lines are dark, long, thin and horizontal. In order to remove them, we use a dilation with a short vertical kernel and the bottom hat transform of the image with a long horizontal kernel; by considering their closing together with taking the pixel-wise minimum and maximum of the resulting image with its morphological binarization, we get an image of the staff with no staff lines and not many artefacts.

removeStaffLines[image_, \[CapitalDelta]_] :=
 With[
  {withoutStaffLines =
    imageMin[Closing[image, BoxMatrix[{\[CapitalDelta]/6, 0}]],
     ColorNegate@
      BottomHatTransform[image, BoxMatrix[{0, \[CapitalDelta]}]]]},
  imageMax[withoutStaffLines,
   MorphologicalBinarize[withoutStaffLines, {0.66, 0.33}]]
  ]

Staffs removed

Finding Bounding Boxes for the Musical Components

Image Segmentation

For the image segmentation, we use a SegNet [^segnet] trained on the DeepScores [^deepscores] [Segmentation Dataset](https://drive.google.com/drive/folders/1KFxqi0rO-bJrd03rLk87fF1iOmnjpaoG), with some preprocessing: Each pair of images (original and segmented), was separated into individual staffs and the symbol colors were negated (so that there would be more contrast with the black background).

SegNet architecture

Segmented staff <code>Binarize@*Blur of the segmented staff

Finding the Bounding Boxes

In order to find the bounding boxes from the segmented image, we look for the its morphological components using ComponentMeasurements. Before using ComponentMeasurements, we blur and binarize the segmented image, as it usually yields more accurate results; we also discard any bounding boxes with area less than 25 pixels, as that is very likely just noise.

Bounding boxes

Classifying the Musical Components

For the component classification, we use a ClassifierFunction trained on the DeepScores [^deepscores] [Classification Dataset](https://drive.google.com/file/d/1bdBrX0dAX734I3MA_6-wH_-N2eqq_tf_/view), cropping the images in relation to a lookup table - this is so that the images contain only the actual symbol, and the classifier doesn't learn the context instead of the symbol.

Unfortunately, the classifier trained during the Wolfram Summer Program misclassifies some images; however, it does still provide something that can be worked with.

Classified symbols

Parsing the Sheet Music

Very limited parsing could be done during the Wolfram Summer Program: we sort the notes and pitch them, and add rests in as well.

Finding the Pitches

Pitching the notes depends on their distance to the bottom of the staff and a reference pitch, which is determined by the clef. Because of the way we trimmed the staff images, we know that the first staff line (from the bottom) has the Y coordinate $2.25 \Delta$, where $\Delta$ is the distance between the staff lines in the original image. We also know, from the way we trimmed the staffs from the original page, that the distance between the staff lines in the resized image is fixed - and is $\delta = 10$.

PitchNumber calculates the distance from the note centroid to the bottom of the staff, normalizing by half of the distance between the staff lines. PitchToLetter finds the name for the given output of PitchNumber in relation to some reference pitch.

Centroids of the correctly-classified notes

PitchNumber[noteCentroid_, \[Delta]_, bottom_] := (noteCentroid[[2]] - bottom)/(0.5 \[Delta])

naturalNotes =
  Flatten[Table[# <> ToString[i] & /@ {"A", "B", "C", "D", "E", "F", "G"}, {i, 1, 7}], 1]

PitchToLetter[pitchNumber_, referenceNumber_, referenceLetter_] :=
 RotateLeft[naturalNotes,
   FirstPosition[naturalNotes, referenceLetter] - 1][[
  Round[pitchNumber - referenceNumber] + 1]]

Basic Parsing

In this case, the parsing is a simple sorting - there are no dynamics, articulation or even accidentals taken into consideration.

ParseSheetMusic[pitchedNotes_, rests_] :=
 SortBy[Join[pitchedNotes, restsNotation], NotationCentroid[#][[1]] &]

In the end, this is our final result:

{SheetMusicNote[{66.5, 64.}, "B5"],
 SheetMusicRest[{97., 45.5}, "restHBar"],
 SheetMusicRest[{97.5, 64.5}, "restHalf"],
 SheetMusicNote[{121., 64.}, "B5"],
 SheetMusicNote[{159., 64.}, "B5"],
 SheetMusicNote[{196.5, 64.}, "B5"],
 SheetMusicRest[{287., 49.5}, "rest32nd"],
 SheetMusicNote[{327.5, 50.}, "F4"],
 SheetMusicNote[{366., 59.5}, "A5"],
 SheetMusicRest[{456., 75.}, "restQuarter"]}
POSTED BY: Daniel Csillag
10 Replies
Posted 5 years ago

Wow, I thought the notebook had been uploaded here. In any case, it's available at https://github.com/dccsillag/wss-omr.

Classification dataset link: works, just download and ignore Google Drive being unable to show *.zip files. Segmentation dataset link: available at https://repository.cloudlab.zhaw.ch/artifactory/deepscores/archives/2017/ (download the first archive) (DeepScores itself is now rooted at https://tuggeluk.github.io/downloads/). The SegNet implementation should work fine.

As for inputting a PDF, you need to convert it to a raster image. Maybe use xpdf/mupdf?

This still isn't a robust system. I do plan on working on it at a later date though.

POSTED BY: Daniel Csillag

I have not been able to do anything with these functions. For instance removeStaffLines has undefined functions imageMin and imageMax. But there is a reference to SegNet and its GitHub source file at http://mi.eng.cam.ac.uk/projects/segnet/tutorial.html. The problem is that it is written for Python not Mathematica. This Post needs to be updated and explained in more detail.

POSTED BY: Michael Kelly
Posted 5 years ago

Has anyone been able to create anything that would be helpful?

POSTED BY: Larry Lange

Daniel

Thanks for writing up this material. Is it possible that you could also supply the notebook that you used for this post. It would help immeasurably in replicating your results. I am having difficulty finding and applying the DeepScores NN

Thank you,

Michael kelly

POSTED BY: Michael Kelly

As Larry identified above, neither of us can access the segnet page on Google at https://drive.google.com/drive/folders/1KFxqi0rO-bJrd03rLk87fF1iOmnjpaoG)

POSTED BY: Michael Kelly
Posted 5 years ago

I have sheet music in a PDF format. I don't how to get into your program, Can someone provide an example?

POSTED BY: Larry Lange
Posted 5 years ago

Could you place the method you used to put the image to begin the process and the music sheet you used?

POSTED BY: Larry Lange
Posted 5 years ago

Classifying the Musical Components For the component classification, we use a ClassifierFunction trained on the DeepScores [^deepscores] [Classification Dataset](https://drive.google.com/file/d/1bdBrX0dAX734I3MA_6-wH_-N2eqq_tf_/view), cropping the images in relation to a lookup table - this is so that the images contain only the actual symbol, and the classifier doesn't learn the context instead of the symbol.

Unfortunately, the classifier trained during the Wolfram Summer Program misclassifies some images; however, it does still provide something that can be worked with.

I am having a hard time getting to the

Classification Dataset](https://drive.google.com/file/d/1bdBrX0dAX734I3MA_6-wH_-N2eqq_tf_/view)

POSTED BY: Larry Lange

Thank you very much Daniel. Wonderful presentation!

Posted 5 years ago

Thank you very much. It i a big help to me. It shows me the great potential of the language. Since I am a complete novice in using the language and music theory. I was struggling with creating user functions.

POSTED BY: Larry Lange
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract