Planning an Architecture for OCR and Text Extraction

Article Library

Date: 7/31/2014
Applies to Version: v4.7, v4.8, v5.1 and v5.2

OCR and Text extraction can be resource intensive and needs the appropriate system architecture in order to achieve the throughput the business is looking for. This article provides some information to assist with the planning.

Document Analysis

The first step is to analyze the number and type of documents by gathering the following information:

  • Number of documents
  • Average number of pages per document
  • Image Color (Black & White, Gray scale, Full Color)
  • Type of input file (TIFF, PDF …)
  • Typical resolution (200dpi, 300dpi…)
  • Typical page size (US Letter, A4, A3…)
  • Typical text density

Calculating the Required Resources

OCR and text extraction can be highly CPU-intensive and we recommend using a high-performance server with Intel i7 or better. You can use the OCR connector on any DocuNECT station server, however, you will need to take into account the existing utilization of the station resources. To maximize throughput, we recommend using a dedicated server for the Document Process connector.

The number of hours required is
P / (2.5 Seconds * PR)


  • P = Number of Pages
  • PR = Number of Processes (Number of OCR Connector instances)

Note, 2.5 seconds a page include OCR, text extraction and associate I/O operations.

For example:
A medium complexity conversion job to process 500,000 pages with the job making use of two OCR connector instances:

((5,000 pages * 2.5 seconds per page) / 2 processes) = 6,250 seconds = 104 hours.

Multi-Processor Capabilities

The Document Processor Connector now has multi-processor capabilities that can help with the through-put of OCR'd documents. Each instance of the connector will increase the through-put. It is recommended to add one CPU core to each instance to cover the additional resources required.