This website uses only technical or equivalent cookies.
For more information click here.

Search
Section: Documents and discussions
Publication date: 2025-12-22

Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval

Authors

In digitisation in the context of document preservation in large archives, OCR systems have become essential tools as a compromise between good character recognition and low cost. However, they are still lacking in training on historical and religious texts and in particular with non-Latin alphabets. Large amounts of unstructured data characterised by scatter and noise expose the limitations of text mining techniques on several levels. These limitations add up to contextual variables hindering OCR systems from achieving optimal character recognition. This represents a real problem when considering such systems as central to the development of applications involving the downstream use of other NLP techniques. From this point of view, an error analysis is part of a postprocessing phase that can have corrective effects on the output in order to improve the recognition result especially when combined with a context analysis. Greater attention to post-processing on both glyphs and graphemes could bring about a considerable improvement in OCR effectiveness by significantly advancing the current state of the art. The brief study presented here aims to outline common traits of errors committed by such OCRs tested in the Digital Maktaba project.

Downloads

Authors

Riccardo Amerigo Vigliermo - Università di Modena e Reggio Emilia – FSCIRE https://orcid.org/0000-0001-9914-3295

How to Cite

Vigliermo, R. A. (2025). Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval. DigItalia, 20(2), 179–202. https://doi.org/10.36181/digitalia-00150
  • Viewed - 94 times
  • PDF (Italiano) downloaded - 49 times
Share on

Authors

Riccardo Amerigo Vigliermo - Università di Modena e Reggio Emilia – FSCIRE https://orcid.org/0000-0001-9914-3295

How to Cite

Vigliermo, R. A. (2025). Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval. DigItalia, 20(2), 179–202. https://doi.org/10.36181/digitalia-00150
  • Viewed - 94 times
  • PDF (Italiano) downloaded - 49 times
Share on