Document Management System with OCR
This article will explain by example what is Optical Character Reconition (in short OCR) techology, what applications it has and how it empowers such document management systems as Papermerge.
OCR is abbreviation from optical character recognition. OCR is the process of extracting plain text (and associated information) from an image, photo or a picture. Here is an example from daily life: John takes a photo with his mobile phone of a paper based bank statement, which is basically a document. Let’s say IBAN number appears on that document. From resulted photo - filename bank-statement.jpeg - John won’t be able to copy IBAN number and paste it to another online web form. On the other hand, if the same bank statement's photo is processed using optical character recognition technology (OCR) - the text is extracted from the photo (for example as bank-statement.txt file) and John can open bank-statement.txt file, select IBAN number and copy/paste it to online web form.
OCR technology has widespread usage across many areas. It enables computers to understand text from pictures. If computers understand what text is inside images, then users can search for specific terms across photos. Scanned document is a just photo of the document - usually of higher quality than photos taken with mobile phones for example. Described with informal terms scanners are specialized devices for taking photos of the documents.
Papermerge is a document management system which takes full advantage of OCR techonlogy. Papermerge extracts text from scanned documents - be that PDF files, jpeg, png or TIFF images. Extracted text is used first of all to index documents - this enpowers users to quickly find any archive or document. Also, extracted text is then sort of mapped as additional layer over the document pages - this enables users to copy text directly from text. This is very practical feature - imagine that your document contains 16 digits license number and you need to copy that long number into another application - typing it that license number chracter by character would be slow inneficient - copying that text on the other hand is snappy operation.