Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval
In digitisation in the context of document preservation in large archives, OCR systems have become essential tools as a compromise between good character recognition and low cost. However, they are still lacking in training on historical and religious texts and in particular with non-Latin alphabets. Large amounts of unstructured data characterised by scatter and noise expose the limitations of text mining techniques on several levels. These limitations add up to contextual variables hindering OCR systems from achieving optimal character recognition. This represents a real problem when considering such systems as central to the development of applications involving the downstream use of other NLP techniques. From this point of view, an error analysis is part of a postprocessing phase that can have corrective effects on the output in order to improve the recognition result especially when combined with a context analysis. Greater attention to post-processing on both glyphs and graphemes could bring about a considerable improvement in OCR effectiveness by significantly advancing the current state of the art. The brief study presented here aims to outline common traits of errors committed by such OCRs tested in the Digital Maktaba project.
