Detecting copied computer code using cryptographically hashed overlapping shingles

Inventors

Rogers, Daniel J.Blazakis, Dionysus

Assignees

Deloitte Development LLC

Publication Number

US-10261784-B1

Publication Date

2019-04-16

Expiration Date

2038-06-20

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Systems and methods of detecting copying of code or portions of code involve disassembling a set of compiled code into an architecture-agnostic intermediate representation. The intermediate representation is used to form a number of cryptographically hashed overlapping shingles. The number of cryptographically hashed overlapping shingles can be searched against a database of cryptographically hashed overlapping shingles to identify copied code.

Core Innovation

The invention provides a technique for detecting copied computer code by disassembling a set of compiled code into an architecture-agnostic intermediate representation. This representation is then used to generate cryptographically hashed overlapping shingles, which are derived by dividing control flow graph paths into overlapping instruction segments. These hashed shingles can then be compared against a database of hashed shingles to identify instances of copied code.

The problem addressed by this invention is the inefficiency and inaccuracy of existing methods for detecting copied code, particularly when comparing compiled code resulting from different hardware/software configurations or when only small portions of a larger codebase have been copied. Existing solutions are resource-intensive, limited to pairwise comparisons, or ineffective when structural changes are introduced by copying code. These methods often fail to scale or recognize copied subsections embedded within larger sets of code.

The solution described offers a more efficient and accurate approach by focusing on creating architecture-independent hashed fingerprints (shingles) for even small code segments, reducing both processing and memory requirements. By allowing the use of plain text search engines to compare shingles and avoiding the need to modify the original code (unlike watermarking), the invention makes large-scale and scalable copied-code detection practical while maintaining the integrity of the original compiled code.

Claims Coverage

The patent contains three independent claims, each introducing a core inventive feature for detecting copied computer code.

Detecting copied code using cryptographically hashed overlapping shingles generated from control flow graphs

A processor disassembles compiled code into an architecture-agnostic intermediate representation. A control flow graph is generated, from which each path is divided into overlapping instruction segments to create plurality of cryptographically hashed overlapping shingles. These shingles are compared to a database of hashed shingles. Identification of copied code is based on detecting matching shingles, and an indication is outputted to the owner. All shingles have a common length, and no shingle is created for paths shorter than that length.

Detecting copied code using architecture-agnostic hashed shingles from intermediate representation, without requiring control flow graph

A processor disassembles compiled code into an architecture-agnostic intermediate representation and generates cryptographically hashed overlapping shingles directly from this representation. At least one of these shingles is selected and compared to a hashed shingle database. A match indicates code copying, and a notification is issued. Every shingle has the same length, and no shingle is created for paths shorter than this length.

A system for detecting copied code using hashed overlapping shingles derived from control flow graphs and intermediate representation

The system comprises a processor and non-transitory memory storing instructions. The processor disassembles compiled code into an architecture-agnostic intermediate representation, generates a control flow graph, then uses the graph to derive cryptographically hashed overlapping shingles. The processor selects shingles, compares them to a database, identifies copied code based on matches, and outputs notification. Each shingle has a common length, and no shingle is created for short paths.

The inventive features broadly cover efficient detection of copied code by forming architecture-neutral, cryptographically hashed overlapping shingles from compiled code, enabling scalable comparison against large hashed-code databases using simple search techniques.

Stated Advantages

Reduces processor and memory resources required for code fingerprinting and comparison.

Allows scalable detection of copied code in large code databases without complex heuristics or pairwise comparisons.

Provides higher accuracy and is less likely to produce false negatives when copied code changes program structure.

Enables detection based on smaller portions of code instead of relying on large, complex functions.

Operates on compiled code, protecting the underlying source code, which is often a trade secret.

Documented Applications

Open source compliance verification.

Third-party security audits.

Intellectual property (IP) theft monitoring.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.