In the field of document image analysis, handwritten word-spotting refers to the problem of detecting specific keywords in handwritten document images, Until now, the most popular framework for handwritten word-spotting is the so-called query-by-example (QBE). In this framework, an image is provided as query, and the goal is to retrieve all the occurrences in a word image database (or regions of a document collection) that are close to the query in terms of a specific dissimilarity measure. Many works have shown success in applying this idea using dynamic time warping (DTW) as image dissimilarity. Historical document collections have been a primary field of application for word spotting. This is because of the practical value of making this type of documents accessible and searchable, which is very enriching from the cultural heritage point of view. However, word-spotting for business documents has remained less explored. Corporations and large institutions receive a large volume of correspondence in paper form; and a surprisingly high fraction is in handwritten form. Word-spotting could help detecting keywords, such as "urgent" or "cancellation", that provide additional information about the contents of the documents in an automatic way, as opposed to the high amount of human intervention that is currently required for this task.
In this thesis, we thus face the challenging problem of detecting keywords in real business documents. This is a difficult problem due to the large amount of writer styles, and other intricacies inherent to modern documents such as spontaneous writing, spelling mistakes or crossed-out words. We show that the traditional QBE approach with DTW gives a limited performance in this type of data, probably toolow for most applications of practical value. This is because most previous works in handwritten word-spotting have considered small collections of historical documents, with typically one or few writers, which represents a simplified problem.
The present thesis has developed several contributions to overcome this difficulty and to robustly spot keywords in the documents of the explained type. The basis contribution is to extend the QBE approach. In our application of interest, a large amount of incoming documents is processed during long periods of time in order to search keywords that are known a priori. Therefore, we can afford to collect several queries and combine them into a statistical model. This gives more robust results because of the properties of statistical models for compressing information into parameters and learning to generalize. In addition, each candidate image is processed only once, in contrast to the multiple computations that would be required by a QBE approach using multiple queries.
By exploiting this idea, we construct a statistical framework for handwritten word spotting that obtains very robust performance in challenging data. The framework is composed by word scoring in terms of hidden Markov models (HMMs), a novel score normalization with Gaussian mixture models, and a novel feature set called local gradient histogram (LGH) features.
Of particular importance in this framework is the special case of the semi-continuous HMM (SC-HMM). In that type of HMM, the Gaussian parameters are constrained to a Gaussian codebook, referred to as the universal vocabulary. The research carried out in this thesis demonstrates that the universal vocabulary has the power of incorporating a priori information of the problem of interest. A first result shows that the SC-HMM is superior to the traditional continuous HMM (C-HMM) for small training sets, which indicates the resistance of the SC-HMM to overfitting. More surprisingly, the SC-HMM performs better than DTW even for a single training sample. Although training with a single sample is normally useless in most statistical models, in this case the information of a single sample is combined with the universal vocabulary.
This very important finding suggests that, by incorporating the adequate information in the universal vocabulary, different types of problems can be solved. This new way of looking at things is the reason why we are referring to the proposed research as a "framework" and not as a "system". In particular, we have employed this framework to solve three problems with novel contributions. The first one is a novel writer adaptation method for word spotting, in which the universal vocabulary is statistically adapted to derive vocabularies that represent different writer styles. Using this concept, one can derive word models that are tuned to the writing style of each document, with an overall increase in performance. Secondly, we show the importance of the prior information in situations where there exists a clear mismatch between training and test samples. We demonstrate that the proposed statistical framework is a non-trivial and effective solution for matching typed words with handwritten words, something in which one would expected poor results when using traditional techniques. Thirdly, taking the advantage that the SC-HMM works very well even with a single training sample, we define a model-based similarity measure between word images. Here, the feature sequences are converted to SC-HMMs and then a distance is computed in the parameter space. This novel distance outperforms other existing distances, including DTW, precisely because of the prior information incorporated in the SC-HMM.