Protected indexing and querying of large sets of textual data
Inventors
Rogers, Daniel Jordan • Moore, Michael • Blazakis, Dionysus
Assignees
Publication Number
US-9171173-B1
Publication Date
2015-10-27
Expiration Date
2035-03-10
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A protected querying technique involves creating shingles from a query and then fingerprinting the shingles. The documents to be queried are also shingled and then fingerprinted. The overlap between adjacent shingles for the query and the documents to be queried is different, there being less, or no overlap for the query shingles. The query fingerprint is compared to the fingerprints of the documents to be queried to determine whether there are any matches.
Core Innovation
Exemplary embodiments of the present invention are directed to identifying matches between a query and one or more documents in a database, wherein the query is protected but the database can be relatively insecure because the underlying data in the database is already publicly available. The invention involves creating shingles from both the query and the documents, fingerprinting these shingles via cryptographic hashing, and comparing the resulting fingerprints to identify matches, all while minimizing potential exposure of the plaintext query.
The problem addressed is that existing search techniques, including private set intersection (PSI) protocols, often require both the search query and the searched documents to be in plaintext form or necessitate complex protocols with third-party involvement for high security. However, this level of complexity is not always necessary, especially when only the query needs to remain confidential and the documents are public. The present invention allows searching and matching queries against large, possibly public datasets while maintaining the confidentiality of the search terms.
The core technique operates by generating shingles from artifact text with at least one character of overlap between adjacent shingles, cryptographically hashing the shingles, and storing the resulting fingerprints in a database. The query is processed into shingles, often with less or no overlap, optionally filtered to remove stop words or common fingerprints, and hashed to form secure query fingerprints. These are then compared against the database fingerprints to determine matches using methods such as cosine distance calculations, enabling matches to be found without revealing the query's actual content.
Claims Coverage
The independent claims cover four main inventive features in the patent.
Protected querying via shingling and cryptographic fingerprinting
A method that obtains a plurality of artifacts in plaintext form, generates shingles from artifact text with at least one character overlap, cryptographically hashes these shingles to produce artifact fingerprints, stores the fingerprints in a database, receives at least one hashed fingerprint from a querying party, compares the received fingerprint(s) to artifact fingerprints, and outputs the artifact in plaintext form upon a match.
Separation of query and artifact processing between different systems
The processor and database are controlled by a first entity that stores artifact fingerprints, while a different querying party generates query fingerprints and submits hashed versions for matching, ensuring that the system holding the artifacts does not need access to plaintext queries.
Determination of similarity by vector-based cosine distance calculation
A method wherein the comparison of query and artifact fingerprints employs generation of query vectors and artifact vectors from cryptographically hashed shingles, performs cosine distance calculations to generate distance scores, and selects the closest match based on these scores.
Security through reduced shingle overlap and optional filtering
In generating fingerprints from the query, the character overlap between adjacent query shingles is less than that of artifact shingles (potentially no overlap), and additional protections such as stop word filtering and removal of common fingerprints are performed prior to matching.
The inventive features focus on enabling protected querying by using differential shingling and fingerprinting, secure separation between query and artifact systems, and vector-based similarity calculations to detect matches while maintaining privacy of the query.
Stated Advantages
The invention enables matches to be identified between a query and a set of artifacts while ensuring security of the plaintext query, so that the operator of the database typically cannot determine the original query.
Using reduced or no overlap for query shingles, and filtering such as stop word removal, significantly improves query security and lowers the risk of hash reversal or brute force attacks.
The system supports efficient searching across large datasets using distributed key value stores, enabling scalability and effective retrieval as data size grows.
The method allows for the identification of derivative or copied texts within a database using the same fingerprinting and comparison mechanism.
Documented Applications
Searching for matches to secure queries, such as an individual's Social Security Number, against public databases or the internet without exposing the query string.
Detecting near-duplicate documents or emails within large datasets, such as matching near-duplicate emails in a public email corpus.
Identifying similarity or copied/derivative texts across multiple artifacts within a database.
Allowing users to search for personal confidential information that might have been exposed in publicly available artifacts, enabling them to take remedial actions.
Interested in licensing this patent?