Tinkering with Tesseract

I have recently been experimenting with Tesseract, an Optical Character Recognition (OCR) engine developed by Google. My primary objective was to extract text from scans of a 1920s Armenian newspaper and execute search queries on it. Terms like պատերազմ (war) or Ֆրանսիա (France) for instance are likely to be discovered within the document. Some initial observations on the document : Image segmentation : there are a lot of different text blocks in the raw document, and distinguishing between them might be challenging....

October 13, 2023 · 4 min · 782 words · v4nn4