New Approaches to OCR for Early Printed Books
DOI:
https://doi.org/10.36181/digitalia-00015Keywords:
History of the Book, Font Group Recognition, OCR, Document Analysis, Neural Network, Early Printed BooksAbstract
Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data. The OCR-D project, consisting of book historians and computer scientists, aims to address this deficiency by focussing on three major issues. Our first target was to create a tool that identifies font groups automatically in images of historical documents. We concentrated on Gothic font groups that were commonly used in German texts printed in the 15th and 16th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It can not only differentiate between the above-mentioned font groups but also Hebrew, Greek, Antiqua and Italic. It can also identify woodcut images and irrelevant data (book covers, empty pages, etc.). In a second step, we created an online training infrastructure (okralact), which allows for the use of various open source OCR engines such as Tesseract, OCRopus, Kraken and Calamari. At the same time, it facilitates training for specific models of font groups. The high accuracy of the recognition tool paves the way for the unprecedented opportunity to differentiate between the fonts used by individual printers. With more training data and further adjustments, the tool could help to fill a major gap in historical research.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 DigItalia

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
The Authors publishing their contributions on this journal agree to the following conditions:
- The Authors detain intellectual property rights of their work and transfer the right of first publication of the work to the journal, under the following Licence: Attribution-ShareAlike 3.0 Italy (CC BY-SA 3.0 IT). This Licence allows third parties to share the work by attributing it to the Authors and clarifying that the work has been first published on this journal.
- Authors can sign other, non-exclusive licence agreements for the dissemination of the published word (e.g. to deposit it in an institutional archive or publish it in a monography), provided that they state that the work has been first published on this journal.
- Authors can disseminate their work online (e.g. in institutional repositories or on their personal websites) after its publication, to potentially enhance knowledge sharing, foster productive intellectual exchange and increase citations (see The Effect of Open Access).
