System and method for format-agnostic document ingestion
Inventors
Assignees
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A system for format-agnostic document ingestion including a document ingestion server and a database is disclosed. The server is configured to receive an image of a document comprising text in an unknown format, convert the image, using OCR, into a plurality of text elements a content, a size, and an absolute position. The server is also configured to retrieve data detectors from the database, each associated with a data type anticipated to be in the document, and comprising at least one identifier and direction, and at least one validation criteria. The server is also configured to identify a potential descriptor by comparing the content of each text element with the at least one identifier, and then determine if the text element pointed to by the data detector meets the validation criteria. Finally, the server is configured to associate the validated text element with the data detector, and store the content.
Core Innovation
A format-agnostic document ingestion system uses a document ingestion server to receive an image of a document having text arranged in an unknown format. The system converts the image of the document into a plurality of text elements using optical character recognition, and each text element includes content, a size, and an absolute position within the document.
The system identifies a document type by searching the content of each text element for a plurality of distinguishing strings, where each distinguishing string is unique to one document type, and retrieves a plurality of data detectors from the database based on the document type. Each data detector is associated with a data type anticipated to be in the document and includes at least one identifier, at least one direction describing a potential relative direction of a text element having a label associated with the data detector, and at least one validation criteria describing a valid format or a valid range.
The system determines a source of the document by comparing identifiers of data detectors whose data type is unique among potential document sources with the content of each text element. It orders identifiers and directions according to a history stored in the database and associated with the source, updates the history for each data detector based on which identifier and direction matched the most text elements of the data type, identifies a table by calculating relative positions of neighboring text elements using absolute positions, and associates validated header and row or column text elements with the detecting data detectors.
The system identifies a potential descriptor for non-table text by comparing content to data detector identifiers, validates descriptor text elements using the direction and validation criteria, stores associated text element content in the database, and updates detector histories.
Claims Coverage
The partial content includes three independent claims. Across these independent claims, the inventive features center on OCR text elements with absolute positioning, detector-based document type and source determination, identifier- and direction-based validation and association of extracted text, and per-source history updates; broader scope adds table/header extraction, postal address-based source determination, and optional machine-learning detector replacement and expanded validation criteria.
Format-agnostic ingestion with OCR text elements and absolute positioning
A document ingestion server receives an image of a document with text arranged in an unknown format and converts the image into a plurality of text elements using optical character recognition, where each text element comprises content, a size, and an absolute position within the document.
Distinguishing-string document type selection of data detectors
The processor identifies a document type by searching the content of each text element for distinguishing strings unique to one document type and retrieves a plurality of data detectors from the database based on the document type.
Identifier- and direction-based data detector matching with validation criteria
Each data detector comprises at least one identifier that is one of a potential label and a potential format, at least one direction describing a potential relative direction of a text element having a label associated with the data detector, and at least one validation criteria describing a valid format or a valid range. The processor identifies a potential descriptor by comparing content of text elements not part of a table with identifiers, determines if the text element pointed to by the direction meets the validation criteria, and associates the validated text element with the data detector.
Source determination by unique identifiers and per-source detector history
The processor determines a source of the document by comparing an identifier of a data detector associated with a data type unique among potential document sources with the content of each text element. For each data detector, it orders identifiers and directions according to a history stored in the database and associated with the source, and it updates the history for each data detector based on which identifier and direction matched the most text elements of the data type described by the data detector.
Table detection via relative positions from absolute coordinates and header association
The processor identifies a table by calculating for each text element a relative position of at least one neighboring text element using the absolute position of the text element and comparing the relative positions. It locates a header by comparing content within the table with identifiers of the data detectors to identify the data type, where the header is one of a row and a column, validates header text elements using validation criteria, associates validated header text elements with validated text elements within the corresponding row and column, and associates the validated text elements with the data detector that identified the header elements.
Potential descriptor association and storage of validated content
The processor associates validated text elements with data detectors and stores for each text element associated with one data detector the content of the text element in the database.
Detector-based ingestion without distinguishing strings and with validation criteria restricted to valid format
A format-agnostic document ingestion system receives an image of a document with text arranged in an unknown format, converts it with OCR into text elements with content, size, and absolute position, retrieves a plurality of data detectors each having identifiers as potential labels, directions for potential relative direction, and validation criteria describing a valid format, identifies a potential descriptor by comparing content to identifiers, validates the text element pointed to by the direction against the validation criteria, associates the validated text element with the data detector, stores the associated text element content, and determines source using unique identifiers and updates per-source detector history by ordering identifiers and directions.
Method version of detector-based ingestion with source determination and history updating
A method for format-agnostic document ingestion receives an image of a document, converts it using OCR into text elements with content, size, and absolute position, retrieves data detectors associated with anticipated data types having identifiers as potential labels, directions, and validation criteria describing a valid format, identifies potential descriptors by comparing text content with identifiers, determines whether the text element pointed to by a direction meets validation criteria, associates validated text elements with the data detectors, stores associated text content, and determines source by comparing identifiers of data detectors unique among potential document sources with text element content while ordering and updating per-source detector history based on which identifier and direction matched the most text elements.
Across the independent claims, the core inventive coverage is the combination of OCR-produced text elements with absolute position, detector-driven identification and validation using identifiers and direction, source determination via identifiers unique to sources, per-source history ordering and updating, and table detection and header/row/column association using relative positions computed from absolute coordinates.
Stated Advantages
Not explicitly described in patent.
Documented Applications
Not explicitly described in patent.
Interested in licensing this patent?