Training a Tesseract model for East Cree syllabics — looking for advice on fine-tuning workflow [p]
Hey all,
I’m working on an OCR project for East Cree, a Canadian Indigenous language that uses a syllabic writing system. There’s currently no Tesseract model for East Cree, but I’ve been getting decent results using the Inuktitut (iku) trained model as a starting point since the scripts share a lot of the same syllabic characters.
Right now, running the iku engine against high-quality scans of East Cree text, I’m seeing roughly ~70% character accuracy, which honestly is better than I expected given it’s a different language. The shared Unicode block for Canadian Syllabics is doing a lot of the heavy lifting here.
The plan:
We have a growing dataset of OCR output from these runs paired with manually corrected ground truth; human-verified, character-by-character corrections. The goal is to use these paired datasets to fine-tune the iku model into a proper East Cree model via tesstrain.
Where I’m looking for guidance:
∙ For fine-tuning from an existing .traineddata, is it better to use lstmtraining –continue_from on the iku model, or should I be extracting the lstm component with combine_tessdata -e first and working from there?
∙ What’s a realistic minimum number of ground truth lines/pages before fine-tuning starts to meaningfully improve over the base model? We’re still building out the corrected dataset.
∙ Any tips on handling syllabic-specific issues? Things like finals (superscript characters), ring modifiers, and the long vowel dot — these seem to be where most of the iku model’s errors concentrate.
∙ Is anyone aware of other projects fine-tuning Tesseract for Canadian Syllabics languages? Would love to compare notes.
submitted by /u/ARollingShinigami
[link] [comments]