All notes
Ocr

Basics

Best OCR results depend on various factors, the most important being font and size used for OCR. Best font size for OCR is 10 to 12 points. Use of smaller font size has led to bad OCR. Font sizes greater than 72 are considered images, and thus should be avoided.

Using very large font sizes like 24 and higher don’t add any reliabiltiy, they just take more space.

Other noted factors are color, contrast, brightness, and density of content. Usually dark, bold letters on a light background and vice versa yield good results. The textual content should be ideally placed, with good spacing between words and sentences. If the source file contains Asian languages, it is recommended to scan with 300 dpi for accurate results.

Fonts

Tahoma is present on any Windows system and the main reason that this font works so well is that there is enough difference between look-a-like characters which make the OCR engine interpret each character correctly even without any context.

The popular Arial font for example does not show any difference between an l (lowercase L) and an I (uppercase i) which does not make it a good candidate for OCR.

OCR A and OCR B

OCR-A uses simple, thick strokes to form recognizable characters. E.g. 0 and o is well distinguishable. Although OCR algo advancement makes such simple fonts no longer necessary, its usage remains widespread in the encoding of cheques around the world. Some lockbox companies still insist that the account number and amount owed on a bill return form be printed in OCR-A.[13] Also, because of its unusual look, it is sometimes used in advertising and display graphics.

OCR B font is an upgraded version of OCR A. It is easier for the human eye and brain to read and it has a less technical look.

Concepts

MICR

It is used mainly by the banking industry to ease the processing and clearance of cheques and other documents.

Unlike barcodes and similar technologies, MICR characters can be read easily by humans.

There are two major MICR fonts in use: E-13B and CMC-7. The MICR E-13B font has been adopted as the international standard in ISO 1004:1995, but the CMC-7 font is widely used in Europe, Brazil and Mexico.

CMC-7 consists of 10 digits and 5 more control characters: internal, terminator, amount, routing, and an unused character.

These have been added to UNICODE.

The ink used in the printing is a magnetic ink or toner, usually containing iron oxide. The ink in the plane of the paper is first magnetized. Then the characters are passed over a MICR read head, a device similar to the playback head of a tape recorder.

The error rate for the magnetic scanning of a typical cheque is smaller than with optical character recognition systems.

Others

MOR: Multi-Lingual Omnifont Recognition.

accusoft.com: OmniMOR.

FAQ

Which type is preferred: color or grayscale?

ABBYY Technologies use color information for detecting areas and objects on the image. So, if complex layouts have to be processed, it is recommend to use color or at least, grayscale images.

The character recognition is always executed on a bitonal image that contains only black & white.

In the most cases, using the grayscale mode during document scanning is the most optimal to achieve the best recognition and time-consuming results.

wcfNote: so in the phase of char recognition, the input is only bi-tonal, which means, the conventional way is always to extract the structure from the gray/color image input, and focus on the structure/topology instead of the intensity pattern in the char recognition phase. 在最后的文字分类阶段,输入都是双值图,因为这个时候只看重拓扑结构;灰度信息在二值化的时候就充分利用用来转化成为结构。

Why 300 DPI is a gold standard

ScanSnapCommunity.com: why is OCR at 300 DPI a standard.

Most leading OCR and Automated Forms Processing software companies recommend scanning at a minimum resolution of 300 dots per inch for effective data extraction.

Some people think you can scan at a lower dpi, such as 200 dpi, and then use scanner software to increase the resolution through interpolation. However, interpolation doesn’t actually provide a meaningful benefit for OCR. Your image always loses some clarity and quality. You’re better off just scanning your document at 300 dpi to begin with.

So to answer this question quite simply, higher resolution scanning equates to improved automated OCR accuracy.