Alfabeto arabo e OCR open source: un’analisi grafica e linguistica dei risultati di elaborazione per il recupero di dati catalografici

Riccardo Amerigo Vigliermo

doi:10.36181/digitalia-00150

10.36181/digitalia-00150

Section: Documents and discussions

Publication date: 2025-12-22

Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval

Riccardo Amerigo Vigliermo

Keywords: Digitisation, OCR, digital libraries

In digitisation in the context of document preservation in large archives, OCR systems have become essential tools as a compromise between good character recognition and low cost. However, they are still lacking in training on historical and religious texts and in particular with non-Latin alphabets. Large amounts of unstructured data characterised by scatter and noise expose the limitations of text mining techniques on several levels. These limitations add up to contextual variables hindering OCR systems from achieving optimal character recognition. This represents a real problem when considering such systems as central to the development of applications involving the downstream use of other NLP techniques. From this point of view, an error analysis is part of a postprocessing phase that can have corrective effects on the output in order to improve the recognition result especially when combined with a context analysis. Greater attention to post-processing on both glyphs and graphemes could bring about a considerable improvement in OCR effectiveness by significantly advancing the current state of the art. The brief study presented here aims to outline common traits of errors committed by such OCRs tested in the Digital Maktaba project.

PDF (ITA)

Authors

Riccardo Amerigo Vigliermo - Università di Modena e Reggio Emilia – FSCIRE https://orcid.org/0000-0001-9914-3295

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Copyright

How to Cite

Vigliermo, R. A. (2025). Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval. DigItalia, 20(2), 179–202. https://doi.org/10.36181/digitalia-00150

Viewed - 134 times
PDF (Italiano) downloaded - 84 times

Authors

Riccardo Amerigo Vigliermo - Università di Modena e Reggio Emilia – FSCIRE https://orcid.org/0000-0001-9914-3295

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Copyright

How to Cite

Viewed - 134 times
PDF (Italiano) downloaded - 84 times

Arabic Script and Open-Source OCR: a Graphic and Linguistic Analysis of Processing Results for Catalog Data Retrieval

Authors

Riccardo Amerigo Vigliermo

Downloads

Authors

License

Copyright

How to Cite

Share on

Authors

License

Copyright

How to Cite

Share on