OCR Explained: How Scanned PDFs Become Searchable

How optical character recognition turns scanned PDFs into searchable, copyable text — the technology, accuracy expectations, and tips for getting the best results.

What OCR actually does

OCR — Optical Character Recognition — converts pictures of text into actual, selectable, searchable text. Modern engines combine convolutional neural networks with language models to reach >99% accuracy on clean scans.

When you OCR a PDF, the visible page doesn't change. What changes is that an invisible text layer is added underneath the image. Your viewer can now copy/paste, search with Ctrl+F, and feed the text to screen readers — but to your eyes the document looks identical.

A 30-second history

OCR research goes back to the 1920s, but the field really took off in the 2010s with deep learning. Tesseract — the open-source OCR engine PDFWix uses for many languages — went from ~85% accuracy to >99% on clean documents in that decade.

Today's best-in-class engines (Google Vision, AWS Textract) handle tables, handwriting, and 100+ languages. Open-source equivalents like Tesseract 5 are now within a few percentage points on printed text.

How to get the best accuracy

Scan at 300 DPI in greyscale. Below 200 DPI, accuracy drops sharply on small fonts. Above 400 DPI, you're just making the file bigger without measurable accuracy gain.

Pick the correct language. Multi-language OCR is slower and slightly less accurate per language; if you know the document is purely English, don't enable French and German.

OCR for accessibility

A scanned PDF is invisible to a screen reader — it's just a picture. OCR makes the document accessible to blind and low-vision users by giving the screen reader something to read aloud.

For full accessibility (PDF/UA compliance), OCR is necessary but not sufficient — you also need tagged structure (headings, paragraphs, alt text on images). OCR is step one; tagging is step two.

When OCR fails

Handwriting is the hardest case. Even cutting-edge engines hit only 70-90% on neat handwriting and much lower on cursive. For high-stakes handwritten content, manual transcription remains the standard.

Tables with merged cells, multi-column layouts with figures, and equations all degrade accuracy. Use 'layout-aware' OCR (Tesseract's --psm options, or a service like AWS Textract) for these.

Related tools

Frequently asked questions

Does OCR change how the PDF looks?

No. The original page image is preserved. OCR only adds an invisible text layer that makes the document searchable and accessible.

What accuracy can I expect?

On clean, 300-DPI printed text in a supported language, expect 99%+ accuracy. On poor scans, expect 80-95%. On handwriting, expect 70-90% for neat printing.

Can OCR be reversed?

The text layer can be removed without affecting the original images, leaving you back where you started. Most PDF tools just ignore the text layer if you don't need it.