Message Boards Message Boards


"Reading Handwritten Numbers on Data Sheet"

Posted 9 years ago
2 Replies
2 Total Likes

This is a sample data sheet "I have a lot of these and the task is relatively simple. I want to read the Image into Mathematica, then extract a table of "0" and "1" values where the underlines on the data sheet are. If each underline on the sheet is a cell, then the rule is: Assign a "0" whereever the cell is blank (only has underline). Assign a "1" if there is a mark that looks like a "1" in the cell. Export the table to a data file or spreadsheet.

does anyone have experience with this type of task?" Thanks

POSTED BY: William Shankle
2 Replies

I'm not an expert on this area by any measure, but let's take a look at one part of your image:

enter image description here

Given an image with some horizontal lines, how can tell which ones have a check over them?

  1. First, let's clean up the image. I like LocalAdaptiveBinarize for this. Finding the right coefficients for the function can done nicely with Manipulate

     LocalAdaptiveBinarize[image, 10, {a, b, c}],
     {{a, .6}, -2, 2}, {{b, 2}, -2, 2}, {{c, 0}, -1, 1}]
    clean = LocalAdaptiveBinarize[image, 10, {-0.15, 2, 0.05}]

    enter image description here

  2. Get rid of the horizontal lines in the image with a Sobel filter. I basically stole this form the documentation. Then clean up the results a bit:

    checks = DeleteSmallComponents@ImageConvolve[clean, {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}}] 

    enter image description here

  3. "Subtract" the checks from the original image to get an image with just the horizontal lines. I had to make the checks a bit bigger with dilation to make sure they were completely removed. ImageAdd was used because the previous result was a negative.

    lines = ImageAdd[clean, Dilation[checks, 1]]

    enter image description here

  4. The last image called "lines" needs to be cleaned up a bit. This is done by removing some small components from the image, dilating the results to ensure that lines are connected and then thinning them out again.

        Dilation[DeleteSmallComponents[ColorNegate@lines, 10], 5]

    enter image description here

  5. We want to get the ends of each of those lines. Here's how to get a simple bounding box for each from left to right, top to bottom:

    boxes = ComponentMeasurements[%,  "BoundingBox"]

    Let's average out the vertical components of each bounding box, since they're supposed to be horizontal lines. This should give us a list of end points for each line:

    boxes2 = boxes[[All, 2]] /. {{h1_, v1_}, {h2_, v2_}} :> {{h1, Mean[{v1, v2}]}, {h2,  Mean[{v1, v2}]}}

    Here's what the points look like (in red)

    HighlightImage[clean, Partition[Flatten@boxes2, 2]]

    enter image description here

  6. Add thirty or so pixels to the height of each box to get a region where you'd expect a check mark:

    regionBoxes = boxes2 /. {{{x_, y_}, p2_} :> {{x, y + 30}, p2}}
    HighlightImage[clean, Partition[Flatten@regionBoxes, 2]]

    enter image description here

  7. Now we take those regionBoxes out of the image with the checks in them:

    separatedChecks = ImageTrim[checks, #] & /@ regionBoxes

    enter image description here

  8. A reasonable rule of thumb is that there's a check if over 100 of the pixels are white:

    checkedQ[img_] := 
     Total@Flatten@ImageData[img] > 100 (*Threshold number of points*)

    For your actual application, we'd either use the TextRecognize function or use the Classify function to make a digit recognizer. There are examples in the documentation on how to make a digit recognizer with the Classify function

    Since the components are stored from left to right and top to bottom, we can easily visualize the results and compare with the original image:

      Partition[checkedQ /@ separatedChecks, 5] /. {False -> \[EmptySquare], True -> \[FilledSquare]}

    enter image description here

We would probably want to build into this some checks to make sure that the image was being recognized properly. Maybe some extra code to handle such problems. For example, this wouldn't have worked so easily if the horizontal lines weren't as easy to find. Also we really aren't using the fact that the lines are in a grid so there's a lot we could possibly do for your application. If the images are all very consistent, we might not even have to go about finding where the checks/numbers should be for each image.

POSTED BY: Sean Clarke

Dear Sean Thank you so much. There are a few more issues to figure out but I will see what I can do. You have helped us evaluate the largest database on the planet for Alzheimer's. Greatly appreciated. Rod Shankle

POSTED BY: William Shankle
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract