Lifecycle DocScript Discovery Functions in v5.2

Building Lifecycle Applications for v5.0



The Discovery DocScript functions are used in conjunction with the Discovery module and are typically used within the Classification Rules and Indexing Rules.


Understanding Search Results Blocks

DocuNECT has a powerful document architecture called DocInfo that stores different characteristics of the document. Discovery function return a Search Result Block, which is a DocScript array that has the follow elements:

This is return by the document indexing functions and has the following elements:

  • Position 0 - Extracted value from the text.
  • Position 1 - The page number the value was extracted from.
  • Position 2 - OCR block x position.
  • Position 3 - OCR block y position.
  • Position 4 - OCR block width position.
  • Position 5 - OCR block height position.

This is part of the Information that is displayed against the field and the field zoning as shown in the screenshot:

SRB.png

Classification Functions

The following table details the functions related to the automated Classification and Indexing rules:

Function Description
[Classification.ApplyRules](pagetext)

This command it automatically run when the Execute Classification Rules is set to Yes in the lifecycle. However, if you want to control when rules are run, then disable the flag and use this command in the DocScript.

Parameters

  • pagetext - The page text of the document.

Indexing Functions (Finding Information in Documents)

The following table details the functions related to the automated Classification and
Indexing rules:

Function Description
[Indexing.ApplyRules]() This command it automatically run when the Execute Indexing Rules is set to Yes in the lifecycle. However, if you want to control when rules are run, then disable the flag and use this command in the DocScript.
[Indexing.FindText](string, mode, page)

Finds text within a specific page.

Returns Search Result Block array related to the text found.

Parameters

  • string - String to find.
  • mode - Either true - search using RegEx, or false - search using text. Default is true.
  • page - If left blank, then the whole document will be searched by default. For pages, you can specify a single page, or page ranges such as 1-5 (pages 1 through 5), or 1,2,3,4 (pages 1,2,3 and 4).
[Indexing.FindAround](tag, above, below, left, right, removetag, includeintersections)

Finds the text around an anchor. For example, if you are trying to find a Loan No. you can search for the label "Loan No" and then look at the text above, below, to the left and right of the label to extract the value.

Returns Search Result Block array related to the text found.

Parameters

  • anchor - The anchor text to center the find around.
  • above - Specifies the zone above (pixels) to find the search value.
  • below - Specifies the zone below (pixels) to find the search value.
  • left - Specifies the zone to the left (pixels) to find the search value.
  • right - Specifies the zone to the right (pixels) to find the search value.
  • removetag - This removes the tag text in the result. 1 or 0, true or false.
  • includeintersections - Indicates whether to include where OCR blocks intersect or not. 1 or 0 true or false, default is false.
  • page - Targets a specific page number. If left blank then the whole document is searched.
[Indexing.FindBetween](string1,string2)

Find text between two anchors and returns Search Result Block array related to the text found.

Parameters

  • string1 - Start string.
  • string2 - End string. Note, use 'eol' if there is no text before the end of the line.
[Indexing.FindValue](anchordictionary, formatdictionary, samplevalue, wholepage)

This function is designed to get specific values from documents such as Invoice Nos, Purchase Order Nos, Order Nos etc. For more unstructured text extraction scenarios, then use the FindText, FindBetween, or FindAround functions.

Returns Search Result Block array related to the text found. The anchor used is positioned at the end of the array.

Parameters

  • anchordictionary - The Anchor Dictionary is used to identify the areas on the page where the value can be extracted from and is a DocScript array. For example, if you were trying to find the Invoice No. on an invoice this could be represented using a number of possible anchors (depending on how the vendor formats the invoice). For example, a dictionary for Invoice No could be:
{anchorDictionary[0]} = 'Invoice No'
{anchorDictionary[1]} = 'INVOICE NO'
{anchorDictionary[2]} = 'Invoice Number'
{anchorDictionary[3]} = 'INVOICE NUMBER;
{anchorDictionary[4]} = 'Invoice #;
{anchorDictionary[5]} = 'INVOICE #;

This parameter is optional, but is useful to target specific text blocks in the document. If this value is not specified then the whole page text is used.

  • formatDictionary - A DocScript array of possible regular expression formats. For example, you can create an array of possible date format regular expressions to extract. This parameter is optional.
  • page - The specific page to target.
  • samplevalue - You can use this parameter to pass a value that has been used in the document before and this will be turned into a regular expression. For example, in our Invoice No example you can pass in a previous invoice number used and it will use the format (converted to a regular expression) to identify the value in the page. If no Format Dictionary or Sample Value is specified then the system will try and search for values that have a length of more than 6 characters (no spaces) and are at least 50% numeric.
  • wholepage - If no value has been found using the more specific methods, then the system tries to extract the generic value from the whole page. This is either true or false with the default being false.
[Indexing.GetOCRBlocksInRegion](pagenumber,
rectangle, includeintersections)

Returns text found on pagenumber at the specified rectangle.

Parameters

  • pagenumber - Target page number to find the text.
  • rectangle - This is an array that contains 4 elements: x, y, width, height
  • includeintersections - Indicates whether to include where OCR blocks intersect or not.
{rectangle[0]} = x
{rectangle[1]} = y
{rectangle[2]} = width
{rectangle[3]} = height
includeIntersections = 1 or 0 //true or false

Best used in conjunction with one of the Find functions


Page Functions

The following table details the functions related to document pages:

Function Description
[Page.GetSize](page)

Returns the page size as an array with two values:

Position 1: Page Width
Position 2: Page Height

Parameters

  • page - The target page no.
[Page.GetOrientation](page)

Returns the page orientation in degrees.

Parameters

  • page - The target page no.

Match Functions

The following table details the functions related to (fuzzy) matching text within a string.

Function Description
[Match.FuzzySearch](searchvalue, intext, difference, ignorecase)

This function performs a fuzzy text search and allows for a number of different characters in the text to be different. For example, if you OCR a bad quality document some text values may be misread. The phrase "The dog is in the bed" could be mis-OCR'd as "The dog is in the bel". Obviously if you search for the word "bed" it will not be found. If you add a difference of 1 it will search for the word in the phrase with one character difference. Another example of this function is to search for name that could be represented slightly differently. Example, searching for "John Smith", but the name is represented in the document as "John A. Smith".

Parameters

  • searchvalue - The value to search.
  • intext - The text search in which to search the value.
  • difference - The character difference from the values.
  • ignorecase - True ignores the case and false includes case sensitivity. The default is true.
  • ignorespaces - True ignores the spaces and false does not ignore spaces. The default is false.
[Match.GetValueRegEx](text)

This will convert the text passed in to a regular expression.

Parameters

  • text - The text to convert to a regular expression.
[Match.GetDateRegEx](datetext, delimiter)

This will convert a date text passed to a regular expression. In bad quality documents the dates could be misread. For example, with dates formatted as "01/01/2019', the '/' can be misread as '1' which invalidates the date. This function allows a regular expression to be built to help find/extract date values.

Parameters

  • datetext - The date text to convert.
  • delimeter - The date delimiter. e.g. '/', '-'.
[Match.GetDateRegEx](sampledate, dateextracted)

This validates a date extracted against a sample date.

Parameters

  • sampledate - A sample date.
  • dateextracted - The date text to be validated.
[Match.WordToNumber](string)

Converts string to number i.e. eight -> 8

Parameters

  • string - Number as a string.