Pattern Recognit. | 2021

Holistic word descriptor for lexicon reduction in handwritten arabic documents

 

Abstract


Abstract Most of word recognition systems rely on a pre-defined lexicon in aims to achieve high performance. Recently, the availability of training /testing data allows to include a huge number of words in the lexicon to recognize. However, this leads to high computation cost as the lexicon is grown. In addition, including more and more word-classes may lead to increase the burden on classification methods and degrade the recognition rate. In this work, we propose a holistic word descriptor for word lexicon reduction in Arabic handwritten documents. The proposed descriptor represents geometrical features of word shape through three main feature sets, defined from multi-scale convexity concavity analysis. The first two sets are dedicated to defined the number of peaks and their intensity levels of convexity/concavity peaks, respectively. In contrast, the last set is dedicated to define a region codes of the peaks by analyzing their regions according to their spatial information. Given a query word and lexicon(reference dataset), the lexicon reduction system is applied by first defining the holistic word descriptor for both query word and each word in the lexicon. The lexicon is then indexed according to its distances to the query word descriptor. Finally, the reduced lexicon is formulated from the first k t h entries of the indexed lexicon. The proposed system has been evaluated under two well-known Arabic datasets, namely Ibn Sina and IFN/ENIT. Reported results show superior performance compared to prior art, with 93.7 % and 91.2 % reduction efficacy for Ibn Sina and IFN/ENIT, respectively.

Volume 119
Pages 108072
DOI 10.1016/J.PATCOG.2021.108072
Language English
Journal Pattern Recognit.

Full Text