What is Optical Character Recognition (OCR)?

Optical Character Recognition (OCR) is something you do every day without noticing. When you read a sign or an email, you are using your eyes (optical) to recognize all the letters (characters). You are able to read them as words. (You are doing it right now). However, if a computer looks at a scanned document, the document is converted into an image file. To the computer, the image is just a bunch of pixels that have no discernible meaning.
That is where Optical Character Recognition comes in. OCR is an advanced software technology. With OCR, the image you scan is no longer just a bunch of pixels. OCR software is capable of recognizing the individual characters and converting them into a useful, editable format.

How does OCR software work?

The challenge of creating a software program that can “read” text is similar to the challenge we all face when trying to read certain people’s handwriting: everyone writes differently. Even with documents made on a computer, every font is slightly different from every other font. OCR software reads text by using two methods.

1. Pattern Recognition

The software evaluates and reads the text by recognizing a character in its entirety. For example, OCR is programmed to recognize all the different variations of “A” by each font.

2. Feature Detection

OCR software evaluates and identifies specific features–the lines and strokes–that make up the character. For example, if the software scans a document with an “A” that is from a font it does not recognize, then pattern recognition will not work. Instead, the OCR can use feature detection. The software will look at the lines of the character and discern that it has the same lines as a typical “A”, and therefore correctly identifies the character as an “A”.

Can OCR software recognize handwriting?

While it is more challenging for OCR to recognize handwriting than computer designed fonts, some OCR programs are capable of using special feature detection methods to read handwriting. Success will vary, depending on the handwriting of the individual. For example, cursive is extremely difficult for software to read.

If you will be scanning numerous hand-written forms and want to use OCR to convert the handwriting into a text file, there are things you can do to increase the success rate of the software. Design your forms with comb fields where people have to write each letter in a separate box. This tends to improve peoples’ printing. Adding a dropout color (a special color different from the handwritten ink) to comb fields on the form, allows the software to easily separate the blue or black ink of someone’s handwriting.

How to Use OCR Software

1. Start with a High Quality, Readable Document or Image.

Select the best version of the document before scanning it. In other words, a highly readable document will increase the success of the OCR software. Readability requires that the document is clean with a strong contrast between the text and the white of the page. If the document is dirty or has anything else wrong with it, try making a photocopy of the document and bumping up the contrast. The copied version with increased contrast may make for a clearer, more readable scan.

2. Scan The Document (or Take a Picture).

You will most likely use a flatbed scanner or a scanner with a sheet-feed. If you are scanning multiple pages, a scanner with a feeder will be much quicker and more efficient. The OCR software recognizes the text as each page is scanned. Alternatively, you may be able to take still photographs of your documents with a good digital camera and then upload them into an OCR program.

3. OCR software converts documents to black and white.

Once your document is scanned, the OCR software will convert the document to black and white. Anything that is white will be ignored. Anything that is black will be read and turned into text. It is important to have a strong contrast on your original document before scanning it. Stains, smudges or marks on the document, may be turned to black and the software will try to read them as text.

4. Optical Character Recognition

After converting the document to black and white, the optical character recognition process begins. The software recognizes patterns and features character by character, word by word, and line by line, converting them into text.

5. Error Correction

Once OCR is complete, many software programs will allow you to review the text. There is an automatic spellcheck feature that will highlight any misspelled words if the text is not recognized correctly. If you do not want to go through that process, you can turn off that feature. Advanced OCR software will have additional features to check for errors, such as near-neighbor analysis. Near-neighbor analysis deciphers and determines words that are likely to occur in the given context.

6. Layout Analysis

Advanced OCR software can automatically analyze complicated page layouts, including tables, images, graphics, etc. The software should be able to split up text into proper columns, depending on the layout, and convert graphics into images.

7. Proofreading

Once everything is complete, always proofread your documents. No matter how good your OCR software is, mistakes can occur, especially if your original scan is of low-grade quality.

