It's a small step, but emcee-3PO
can now identify the
staves in an image of sheet music for my single test case of "My Darling Clementine." I need to include
hundreds more test cases, and I plan to when I implement code to make the tests mark the sheet music
with what emcee3po detected so I can visually inspect the accuracy.
Ichiro Fujinaga's "Optical Music Recognition using Projections"
explains the process in detail, but it turns out to be relatively simple.
To locate the staves:
Do a y-projection on the image.
A projection just reduces the number of dimensions in an image. In this case, we just take
the number of dark-colored pixels in a row of the image. It's similar in theory to
3D projection, but instead of projecting
three dimensions onto a plane, we're projecting a plane onto a line.
I used a threshold of 50% to determine if a pixel was dark enough to include in the projection.
R+G+B < (FF+FF+FF) / 2, I count the pixel as dark.
Find the local maxima.
We want to find the places where the number of dark pixels in a row is highest - those will indicate the horizontal
lines on the staff. To do that, we find all the places where the number of pixels stops growing and starts getting smaller -- or where the
slope changes from positive to negative. To ignore noise, we set a threshold as Fujinaga suggests at
the average of each row, so we don't include anything less than that in our collection of local maxima.
Find the tightest groups of 5.
We want to find all the places where 5 local maxima are the smallest distance apart, which should indicate
the 5 lines in a staff. This part is accomplished by examining each 5-element window in the array of
local maxima, and finding the one with the smallest distance between its points. Then you can remove
all the windows that include any of those points, and continue until there are no more windows.
Expand those indexes to collect the places where the notes fall outside the staff lines.
I don't remember Fujinaga mentioning this in the paper I linked to above, but I'm thinking it must be in there.
Essentially, since the local maxima get us only what's in between the 5 lines of the staff, we need
to expand it a bit so we can get the notes that don't fall directly between the 5 lines. Right now,
I've used 1/4 of the average of the rows in the projection, but I think it will need to be
an even smaller threshold because I'm still not reliably getting all of the notes.
Up next: reading the notes on the staves. That's going to be cool.
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
Shouldn't this be a prime example for Hough Transform for lines? Projecting along the y axis will screw up if the image is not straight (or if you have keystone effect due to distance parallax).
Posted by Dat Chu
on Oct 28, 2011 at 11:33 AM UTC - 5 hrs
Leave a comment
I think if the image is reasonably straight (meaning the slope of the staves is low) it corrects for itself, because it's looking for maxima in groups of 5, and then expands outward until the magnitude of the projection is lower than some threshold.
That said, I do have an example of a scan-gone-wrong that made the music not straight, so when I get a chance, I'll run that through it to see how well it does.
Also, it does sound like a great example to use Hough Transform. I'll look into doing an implementation of that too, if for no other reason than the fun of it.
Thanks for pointing it out to me!
Posted by Sammy Larbi
on Oct 31, 2011 at 02:18 PM UTC - 5 hrs