2008年10月7日星期二

This is all what you want to know about OCR

A Matlab Project in Optical Character Recognition (OCR)

Vijay Dhaka


Introduction: What is OCR?

The goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained in a digital image) corresponding to alphanumeric or other characters. The process of OCR involves several steps including segmentation, feature extraction, and classification. Each of these steps is a field unto itself, and is described briefly here in the context of a Matlab implementation of OCR.

One example of OCR is shown below. A portion of a scanned image of text, borrowed from the web, is shown along with the corresponding (human recognized) characters from that text.

of descriptive bibliographies of authors and presses. His ubiquity in the broad field of bibliographical and textual study, his seemingly com-plete possession of it, distinguished him from his illustrious predeces-sors and made him the personification of bibliographical scholarship in his time.


A few examples of OCR applications are listed here. The most common for use OCR is the first item; people often wish to convert text documents to some sort of digital representation.

1. People wish to scan in a document and have the text of that document available in a word processor.
2. Recognizing license plate numbers
3. Post Office needs to recognize zip-codes

Other Examples of Pattern Recognition:

Facial feature recognition (airport security) – Is this person a bad-guy?
Speech recognition – Translate acoustic waveforms into text.
A Submarine wishes to classify underwater sounds – A whale? A Russian sub? A friendly ship?

The Classification Process:

(Classification in general for any type of classifier) There are two steps in building a classifier: training and testing. These steps can be broken down further into sub-steps.

Training

Pre-processing – Processes the data so it is in a suitable form for…
Feature extraction – Reduce the amount of data by extracting relevant information—Usually results in a vector of scalar values. (We also need to NORMALIZE the features for distance measurements!)
Model Estimation – from the finite set of feature vectors, need to estimate a model (usually statistical) for each class of the training data

Testing

Pre-processing
Feature extraction – (both same as above)
Classification – Compare feature vectors to the various models and find the closest match. One can use a distance measure.

Training
Data
Pre-processing
Feature
Extraction
Model
Estimation
Test
Data
Pre-processing
Feature
Extraction
Classification
1. Training
2. Recognition
(Testing)
OCR – Pre-processing

These are the pre-processing steps often performed in OCR
Binarization – Usually presented with a grayscale image, binarization is then simply a matter of choosing a threshold value.
Morphological Operators – Remove isolated specks and holes in characters, can use the majority operator.
Segmentation – Check connectivity of shapes, label, and isolate. Can use Matlab 6.1’s bwlabel and regionprops functions. Difficulties with characters that aren’t connected, e.g. the letter i, a semicolon, or a colon (; or :).
Segmentation is by far the most important aspect of the pre-processing stage. It allows the recognizer to extract features from each individual character. In the more complicated case of handwritten text, the segmentation problem becomes much more difficult as letters tend to be connected to each other.

OCR – Feature extraction (see reference [2])

Given a segmented (isolated) character, what are useful features for recognition?
1. Moment based features
Think of each character as a pdf. The 2-D moments of the character are:
From the moments we can compute features like:
Total mass (number of pixels in a binarized character)
Centroid - Center of mass
Elliptical parameters
i. Eccentricity (ratio of major to minor axis)
ii. Orientation (angle of major axis)
Skewness
Kurtosis
Higher order moments
2. Hough and Chain code transform
3. Fourier transform and series

OCR - Model Estimation (see reference [1])
Given labeled sets of features for many characters, where the labels correspond to the particular classes that the characters belong to, we wish to estimate a statistical model for each character class. For example, suppose we compute two features for each realization of the characters 0 through 9. Plotting each character class as a function of the two features we have:
Figure 3: Character classes plotted as a function of two features
Each character class tends to cluster together. This makes sense; a given number should look about the same for each realization (provided we use the size font type and size). We might try to estimate a pdf (or pdf parameters such as mean and variance) for each character class. For example, in Figure 3, we can see that the 7’s have a mean Orientation of 90 and HPSkewness of 0.033.
OCR – Classification (see reference [1])
ccording to Tou and Gonzalez, “The principal function of a pattern recognition system is to yield decisions concerning the class membership of the patterns with which it is confronted.” In the context of an OCR system, the recognizer is confronted with a sequence feature patterns from which it must determine the character classes.
A rigorous treatment of pattern classification is beyond the scope of this paper. We’ll simply note that if we model the character classes by their estimated means, we can use a distance measure for classification. The class to which a test character is assigned is that with the minimum distance.

The Matlab Implementation:

The Character Classifier Graphical User Interface (GUI)
A Matlab GUI was written to encapsulate the steps involved with training an OCR system. This GUI permits the user to load images, binarize and segment them, compute and plot features, and save these features for future analysis. The file is called train.m, and is available at:
http://www.uri.edu/~hansenj/projects/ele585/OCR/

Figure 4: The Character Classifier Graphical User Interface
Loading an Image
Images can be imported into the GUI by clicking on the Image menu and selecting Open. Both TIF and JPG file formats are supported. Most of the testing was done with grayscale TIF images (with no LZW compression).
Binarize and Segment
After opening an image, it can be converted to black and white and segmented by clicking in the button in the upper right corner of the window (see Figure 4). This button will also extract the various features.

Labeling the Characters
Once the training image is segmented, a character will appear below the text box titled Class Label. It’s the user’s job to label each segmented character appropriately. Once a character label has been entered into the text box, click “>>” to move to the next character. One can navigate back by clicking “<<”. Figure 5: Labeling the characters. Saving and Loading the Features, Labels, etc. Segmented images, character features, and labels can be saved by clicking on the Data menu and selecting Save. The characters need not be labeled for data saving to occur. Load image data (features, etc.) by clicking on the Data menu and selecting Load. See Figure 4. Plotting Class/Features Information Figure 6: Select features to plotAll the characters must be labeled before class/feature information can be plotted. If the characters are labeled, select two of the features by checking the appropriate boxes. Next, click on the unlabeled button to plot the characters classes as a function of the features. If more than two boxes are checked, only the first two selected features will be used. References [1] J.T. Tou and R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1974 [2] M. Szmurlo, Masters Thesis, Oslo, May 1995, (users.info.unicaen.fr/~szmurlo/papers/masters/master.thesis.ps.gz)

没有评论:

发表评论

欢迎访问、交流!对本博客有何建议,请
来信告知!
本博内容来源于网络,如有不当或侵犯权益,请来信告知,将及时撤除!
如引用博客内容、论文,请注明原作者!

Google一下本博客

  • 《Getting Things Done》读书笔记 - 本文来自 inertial 原创投稿。 我第一次听说《Getting Things Done》这本书的时候误以为它和世面上的那些成功学书籍没什么区别,后来在不少书中看到了这个名字,也看见了很多人的推荐,由此产生了很大的兴趣。上个月正好有不少空闲,就抽时间把这本书读完了。 本来打算读英文原版,但是原版的生...
    5 年前
  • [原]Linux下编译使用boost库 - Boost库是一个可移植、提供源代码的C++库,作为标准库的后备,是C++标准化进程的开发引擎之一。 Boost库由C++标准委员会库工作组成员发起,其中有些内容有望成为下一代C++标准库内容。在C++社区中影响甚大,是不折不扣的“准”标准库。Boost由于其对跨平台的强调,对标准C++的强调,与...
    6 年前
  • [原]猎头、培训与咨询的价值(2)【补1】——北漂18年(93) - 【上期用手机写的,同时用语音输入转化成文字,错字较多,经好友霍师傅提醒本期重写,并增加一部分新内容】 简单谈下我对猎头、培训与咨询的看法。三样都干过,算是有些浅见。 猎头 简单的说就是人才中介。虽然在公司看来是可以直接解决现有企业问题的一个直接方法,但很多时候都不太管用。 猎头费一般是人才的一个月月...
    7 年前
  • OpenCV統計應用-Mahalanobis距離 - Mahalanobis距離是一個可以準確找出資料分布上面極端值(Outliers)的統計方法,使用線性迴歸的概念,也就是說他使用的是共變數矩陣以及該資料分布的平均數來找尋極端值的產生,而可以讓一群資料系統具有穩健性(Robust),去除不必要的雜訊訊息,這邊拿前面共變數矩陣的資料為例,並且新增了兩個點座標向量來做...
    15 年前
  • 努力推进模式识别实际产品的开发与应用 - Salu 无论是手写体识别、文档处理、人脸识别、基于内容的图片搜索、嵌入人工智能的搜索技术、虚拟网络社区、还是其它相关新科技下的信息整合领域,现在都在努力实用化。 前两年、即使现在还有很多人在抱怨说人脸的方法都不能用,但是就今年出现的和正在做的有关人脸识别实际应用的各种形式的产品可以说如雨后春笋。这是一个趋...
    16 年前