Planning an Architecture for OCR and Text Extraction

Documentation
Article Library

Date: 7/31/2014
Applies to Version: v4.7 and v4.8

OCR and Text extraction can be resource intensive and needs the appropriate system architecture in order to achieve the throughput the business is looking for. This article provides some information to assist with the planning:

Document Analysis

The first step is to analyze the number and type of documents by gathering the following information:

  • Number of documents
  • Average number of pages per document
  • Image Color (Black & White, Gray scale, Full Color)
  • Type of input file (TIFF, PDF …)
  • Typical resolution (200dpi, 300dpi…)
  • Typical page size (US Letter, A4, A3…)
  • Typical text density

Calculating the Required Resources

OCR and text extraction can be highly CPU-intensive so er recommend using a high-performance server with Intel i5 or better. You can use the OCR connector on any DocuNECT station server, however, you will need to take into account the existing utilization of the station resources. To maximize throughput, we recommend using a dedicated server for the OCR connector.

The number of hours required is
P / (2.5 Seconds * PR)

Where:

  • P = Number of Pages
  • PR = Number of Processes (Number of OCR Connector instances)

Note, 2.5 seconds a page include OCR, text extraction and associate I/O operations.

For example:
A medium complexity conversion job to process 500,000 pages with the job making use of two OCR connector instances:

((5,000 pages * 2.5 seconds per page) / 2 processes) = 6,250 seconds = 104 hours.

Multi-Processor Capabilities

In v4.8 the Document Processor Connector now has multi-processor capabilities (with an additional license) that can help with the through-put of OCR'd documents. Each instance of the connector will increase the through-put. It is recommended to add one CPU core to each instance to cover the additional resources required.