FP-THD: Full page transcription of historical documents
arXiv:2601.17040v1 Announce Type: new
Abstract: The transcription of historical documents written in Latin in XV and XVI centuries has special challenges as it must maintain the characters and special symbols that have distinct meanings to ensure that historical texts retain their original style and significance. This work proposes a pipeline for the transcription of historical documents preserving these special features. We propose to extend an existing text line recognition method with a layout analysis model. We analyze historical text images using a layout analysis model to extract text lines, which are then processed by an OCR model to generate a fully digitized page. We showed that our pipeline facilitates the processing of the page and produces an efficient result. We evaluated our approach on multiple datasets and demonstrate that the masked autoencoder effectively processes different types of text, including handwritten, printed and multi-language.