# [WSS19] OCR for Sheet Music

Posted 1 year ago
4593 Views
|
10 Replies
|
18 Total Likes
|

# Overview

Given an image of some sheet music, generate some internal representation of it that can then be played, reprinted or exported to some external musical notation software.

In order to do this, first we split the sheet music into images of each staff, and remove the staff lines; we then find bounding boxes for each musical component using image segmentation, and classify them. This is all then passed to a parser that parses the final sheet music.

# Finding the Staffs

Staff lines are long, horizontal and thin; as such, they can be highlighted by using a bottom hat transform with a short vertical kernel, and the extracted with ImageLines. Then, by looking for close groups of five such lines, we can extract an image for each staff in the original score. We also find the median distance between the staff lines, both for padding the staff images and for reusing later.

detectStaffLineImages[image_] := Module[{
staffLines =
Select[ImageLines[BottomHatTransform[image, BoxMatrix[{1, 0}]],
Automatic, 0.0065][[All,
1]], (Abs@VectorAngle[Subtract @@ #, {-1, 0}] <
10 \[Degree]) &]
},
\[CapitalDelta] =
Median@Differences@Sort@Map[Mean, staffLines[[All, All, 2]]];
staffLineGroups =
Gather[Sort[
staffLines, (Last@Mean@#1 > Last@Mean@#2) &], (Last@Mean@#1 -
Last@Mean@#2 < 5 \[CapitalDelta]) &] // Echo;
{Map[
ImageTrim[image, Flatten[#, 1], {0, 2.25 \[CapitalDelta]}] &,
staffLineGroups
],
\[CapitalDelta]}
]


# Removing the Staff Lines

Staff lines are dark, long, thin and horizontal. In order to remove them, we use a dilation with a short vertical kernel and the bottom hat transform of the image with a long horizontal kernel; by considering their closing together with taking the pixel-wise minimum and maximum of the resulting image with its morphological binarization, we get an image of the staff with no staff lines and not many artefacts.

removeStaffLines[image_, \[CapitalDelta]_] :=
With[
{withoutStaffLines =
imageMin[Closing[image, BoxMatrix[{\[CapitalDelta]/6, 0}]],
ColorNegate@
BottomHatTransform[image, BoxMatrix[{0, \[CapitalDelta]}]]]},
imageMax[withoutStaffLines,
MorphologicalBinarize[withoutStaffLines, {0.66, 0.33}]]
]


# Finding Bounding Boxes for the Musical Components

## Image Segmentation

For the image segmentation, we use a SegNet [^segnet] trained on the DeepScores [^deepscores] [Segmentation Dataset](https://drive.google.com/drive/folders/1KFxqi0rO-bJrd03rLk87fF1iOmnjpaoG), with some preprocessing: Each pair of images (original and segmented), was separated into individual staffs and the symbol colors were negated (so that there would be more contrast with the black background).

## Finding the Bounding Boxes

In order to find the bounding boxes from the segmented image, we look for the its morphological components using ComponentMeasurements. Before using ComponentMeasurements, we blur and binarize the segmented image, as it usually yields more accurate results; we also discard any bounding boxes with area less than 25 pixels, as that is very likely just noise.

# Classifying the Musical Components

For the component classification, we use a ClassifierFunction trained on the DeepScores [^deepscores] [Classification Dataset](https://drive.google.com/file/d/1bdBrX0dAX734I3MA_6-wH_-N2eqq_tf_/view), cropping the images in relation to a lookup table - this is so that the images contain only the actual symbol, and the classifier doesn't learn the context instead of the symbol.

Unfortunately, the classifier trained during the Wolfram Summer Program misclassifies some images; however, it does still provide something that can be worked with.

# Parsing the Sheet Music

Very limited parsing could be done during the Wolfram Summer Program: we sort the notes and pitch them, and add rests in as well.

## Finding the Pitches

Pitching the notes depends on their distance to the bottom of the staff and a reference pitch, which is determined by the clef. Because of the way we trimmed the staff images, we know that the first staff line (from the bottom) has the Y coordinate $2.25 \Delta$, where $\Delta$ is the distance between the staff lines in the original image. We also know, from the way we trimmed the staffs from the original page, that the distance between the staff lines in the resized image is fixed - and is $\delta = 10$.

PitchNumber calculates the distance from the note centroid to the bottom of the staff, normalizing by half of the distance between the staff lines. PitchToLetter finds the name for the given output of PitchNumber in relation to some reference pitch.

PitchNumber[noteCentroid_, \[Delta]_, bottom_] := (noteCentroid[[2]] - bottom)/(0.5 \[Delta])

naturalNotes =
Flatten[Table[# <> ToString[i] & /@ {"A", "B", "C", "D", "E", "F", "G"}, {i, 1, 7}], 1]

PitchToLetter[pitchNumber_, referenceNumber_, referenceLetter_] :=
RotateLeft[naturalNotes,
FirstPosition[naturalNotes, referenceLetter] - 1][[
Round[pitchNumber - referenceNumber] + 1]]


## Basic Parsing

In this case, the parsing is a simple sorting - there are no dynamics, articulation or even accidentals taken into consideration.

ParseSheetMusic[pitchedNotes_, rests_] :=
SortBy[Join[pitchedNotes, restsNotation], NotationCentroid[#][[1]] &]


In the end, this is our final result:

{SheetMusicNote[{66.5, 64.}, "B5"],
SheetMusicRest[{97., 45.5}, "restHBar"],
SheetMusicRest[{97.5, 64.5}, "restHalf"],
SheetMusicNote[{121., 64.}, "B5"],
SheetMusicNote[{159., 64.}, "B5"],
SheetMusicNote[{196.5, 64.}, "B5"],
SheetMusicRest[{287., 49.5}, "rest32nd"],
SheetMusicNote[{327.5, 50.}, "F4"],
SheetMusicNote[{366., 59.5}, "A5"],
SheetMusicRest[{456., 75.}, "restQuarter"]}

10 Replies
Sort By:
Posted 1 year ago
 Thank you very much. It i a big help to me. It shows me the great potential of the language. Since I am a complete novice in using the language and music theory. I was struggling with creating user functions.
Posted 1 year ago
 Thank you very much Daniel. Wonderful presentation!
Posted 1 year ago
 Classifying the Musical Components For the component classification, we use a ClassifierFunction trained on the DeepScores [^deepscores] [Classification Dataset](https://drive.google.com/file/d/1bdBrX0dAX734I3MA_6-wH_-N2eqq_tf_/view), cropping the images in relation to a lookup table - this is so that the images contain only the actual symbol, and the classifier doesn't learn the context instead of the symbol.Unfortunately, the classifier trained during the Wolfram Summer Program misclassifies some images; however, it does still provide something that can be worked with.I am having a hard time getting to the Classification Dataset](https://drive.google.com/file/d/1bdBrX0dAX734I3MA_6-wH_-N2eqq_tf_/view)
Posted 1 year ago
 Could you place the method you used to put the image to begin the process and the music sheet you used?
Posted 1 year ago
 I have sheet music in a PDF format. I don't how to get into your program, Can someone provide an example?
Posted 1 year ago
 As Larry identified above, neither of us can access the segnet page on Google at https://drive.google.com/drive/folders/1KFxqi0rO-bJrd03rLk87fF1iOmnjpaoG)
Posted 1 year ago
 DanielThanks for writing up this material. Is it possible that you could also supply the notebook that you used for this post. It would help immeasurably in replicating your results. I am having difficulty finding and applying the DeepScores NNThank you,Michael kelly
Posted 1 year ago
 Has anyone been able to create anything that would be helpful?