How is document capture software different than OCR?
While Optical Character recognition (OCR) technology captures all text in images and files, document capture goes one step further and converts text into structured data. Examples of structured data in images and documents include key value pairs (e.g. bank account numbers, customer names in invoices) and tables
What is document capture software?
Document capture software specialize in extracting data out of unstructured data.
There are 3 types of data: Structured, semi-structured and unstructured:
- Structured data forms 5-10% of all data. It is in tabular form and is processable without errors by machines. Structured data include most excel tables, data in SQL databases, XML or JSON files that follow strict structure requirements
- Semi-structured data forms 5-10% of all data. It is not in tabular form but still has a structure though this structure is not explicitly declared and not followed 100% of the time. Semi-structured data can be processed with low error rates but achieving zero errors is challenging. Semi-structured data include invoice slips, most PDF forms, XML or JSON files which do not follow strict structure requirements
- Unstructured data forms ~80% of all data. It includes free text and images that do not follow any explicit structure. It is challenging to extract structured data out of these documents with low error rates. If unstructured data is actually found to follow a structure and that structure is identified, it can be correctly categorized as semi/structured data based on the strictness by which the identified structure is followed throughout the document.
What is the error rate?
Error rate in data extraction can be measured in a few ways but not every error has the same cost. Imagine making an incorrect payment because your data extractor made an incorrect character reading with high confidence. This is a costly error. However, failing to read a character and flagging it as unreadable is a less costly issue. Therefore it is important to focus on cases where data extraction tools make extraction errors while claiming a high level of confidence. These should be minimized.