Document Discovery

What is DocuNECT?



Classifying and Extracting Data

We first need to understand what type of document we are dealing with. For example, are we dealing with an invoice or a loan document. This process is called Classification. If document types are contained in one document "blob" then it will need to be separated into the different types so it can be processed. In order to classify and index documents, we need to gather information about the content.

Extracting the Text From Documents

The business rules work on the content of the document. DocuNECT can extract the text of the document from multiple formats:

  • Extract data from the content of text based documents (Microsoft Office, Adobe PDF Text, Adobe PDF Forms, Text, XML, HTML)
  • Full-text OCR image based documents (TIFF, Adobe PDF Image, JPG, BMP, GIF)

Separation, Classification and Indexing

  • Document Separation - Documents can be separated by using a number of different methods:
    • Manual (A user can split and re-arrange pages within the classification UI)
    • Barcodes (Code 11, Code 39, Code 128, CodaBar, Inter2of5, EAN13, EAN18, UPCE, Add2, PDF417, ReadDataMatrix, ReadQRCode)*
    • Patch Codes
    • Business Rules (using values in the content of the document to determine where to separate)
  • Document Classification - Once the document has been separated (if required) then the document can be classified:
    • Manual (A user can split and re-arrange pages within the classification UI)
    • Barcodes (Code 11, Code 39, Code 128, CodaBar, Inter2of5, EAN13, EAN18, UPCE, Add2, PDF417, ReadDataMatrix, ReadQRCode)
    • Business Rules (Using business rules to identify text, tags and properties of the document to determine the type)
  • Document Indexing - Once we have identified the type of document then we can assign/extract the index data:
    • Manual Indexing
    • Powerful database lookup functionality to gather information about the document from external databases and business applications
    • Extract data from an external reference file (text file, Microsoft Excel, XML etc)
    • Business Rules (using the content of the document to extract data)

How DocuNECT stores and classifies documents


Using Business Rules

Business rules can be assign to the Separation, Classification and Indexing stages. Percentage confidence levels thresholds can be set to flag potential issues to users who can verify the information. The business rules for extraction uses DocuNECT's DocScript capabilities, which allows for comprehensive rule sets to be created.

For example, if we have a Promissory Note document type where the loan interest rate is buried in a paragraph, the DocScript text extraction capabilities can identify and extract the interest rate based on proximity information. Additionally we can then apply business rules to make sure, for example, the interest rate is within a specific range or if the interest rate is 304% then perhaps the decimal point was missed during the OCR process.

Document Information

For each document, DocuNECT generates document information that stores the following information for each index value:

  • The business rule that was used to extract the information
  • The text location of where the information was found (Page number and location)
  • A description of the source to provide reference information on how and why the value was extracted
  • If the value was corrected by a user and what the value was corrected to. Note, this information is used for the DocuNECT Machine Learning module.
docinfo.png

Web Based Indexing and Classification

DocuNECT has a web-based classification review and indexing module. This allows users to review the information extracted by the business rules and make manual adjustment. To support the manual indexing, DocuNECT also has a Data Grab feature that allows users to grab data directly from the image itself to save typing.

discoverui.png