Protected indexing and querying of large sets of textual data
Inventors
Rogers, Daniel J. • Carbone, Tyler • Blazakis, Dionysus
Assignees
Terbium Labs LLC • Deloitte Development LLC
Publication Number
US-9552494-B1
Publication Date
2017-01-24
Expiration Date
2036-08-10
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A protected querying technique involves creating shingles from a query and then fingerprinting the shingles. The documents to be queried are also shingled and then fingerprinted. The overlap between adjacent shingles for the query and the documents to be queried is different, there being less, or no overlap for the document shingles. The query fingerprint is compared to the fingerprints of the documents to be queried to determine whether there are any matches.
Core Innovation
The invention provides a protected querying technique in which both queries and documents are transformed into fingerprints generated from shingles of their underlying text. The key distinction lies in the different overlaps between adjacent shingles: the query uses greater or maximal overlap between shingles, while the documents (referred to as artifacts) use lesser or no overlap. Both query and artifact shingles are then cryptographically hashed to produce fingerprints, which are stored in a database.
When a querying entity submits a query (in protected form as fingerprints), the system compares the query fingerprints with those of the stored artifacts to determine if any matches exist. If a match is found, the querying entity is provided with an indication of the entity that owns the corresponding artifact, without revealing the plaintext of the matched document itself. The approach emphasizes artifact protection while allowing identification of potential matches, accommodating scenarios where artifact confidentiality is paramount but the security of the query is less critical.
The invention addresses challenges in existing search systems and Private Set Intersection (PSI) protocols that either require both queries and documents to be in plaintext, or enforce complex protocols to protect both. This approach simplifies searching in circumstances where only the artifacts require robust protection, improving flexibility and efficiency in cross-entity query scenarios, such as law enforcement data sharing.
Claims Coverage
The patent provides several inventive features covering methods for secure querying and matching of textual artifacts using fingerprinting with varying shingle overlaps and cryptographic hashing.
Fingerprinting artifacts with shingling and cryptographic hashing
Artifacts are processed by generating shingles from their text content, with a defined overlap between adjacent shingles, and then applying cryptographic hashing to produce artifact fingerprints. These fingerprints are stored in a database for future query matching.
Query fingerprinting with greater shingle overlap
A query is processed by generating shingles with a higher degree of character overlap than used for artifacts, then cryptographically hashing these shingles to produce query fingerprints. This difference in overlap enhances security for the artifacts.
Cosine distance-based fingerprint matching
The determination of matches between query fingerprints and stored artifact fingerprints involves computing vector representations from the hashed shingles and calculating cosine distances to assess similarity, optionally using median or thresholding to identify significant matches.
Output of artifact provider identification without revealing document plaintext
Upon detecting a match, the system outputs to the querying entity an identifier for the artifact’s providing entity, enabling contact for further information, while the plaintext of the underlying matched artifact remains protected and unrevealed.
Distributed key-value store implementation and control flexibility
The fingerprints may be stored in a distributed key-value store, facilitating scalable management, and the system can be controlled and maintained by either an independent third party, querying entity, or artifact providers.
Enhanced artifact security by removing stop words and common fingerprints
Artifact shingles that match a stop word list are removed before hashing, and fingerprints corresponding to artificially common values are filtered out, further strengthening the privacy of the artifact data and reducing the chance of false matches.
Collectively, these inventive features enable secure, privacy-preserving querying of large sets of textual data while allowing artifact provenance to be indicated without exposure of the document content.
Stated Advantages
The system enables identification of matches between queries and protected documents without revealing plaintext document content to the querying entity.
It allows for artifact protection while supporting insecure or lower-security queries, facilitating search in sensitive data environments where artifact confidentiality matters.
The approach reduces protocol complexity compared to conventional PSI systems which protect both queries and artifacts, offering a flexible balance between security and practicality.
The method allows for distributed, scalable storage and matching using key-value databases, enabling efficient handling of large data volumes.
Documented Applications
Searching law enforcement databases for information about suspects without exposing underlying sensitive artifact data.
Identifying matches between queries and large sets of public or private textual artifacts, such as determining if a quoted passage is present in a corpus of novels.
De-duplication or identification of similar or derivative email messages in large email datasets, such as those published in regulatory investigations.
- Interested in licensing this patent?