Method of cyberthreat detection by learning first-order rules on large-scale social media
Inventors
Rao, Praveen • KAMHOUA, CHARLES • KWIAT, KEVIN • NJILLA, LAURENT
Assignees
United States Department of the Air Force
Publication Number
US-10812500-B2
Publication Date
2020-10-20
Expiration Date
2038-01-30
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A cyberthreat detection method and system includes a distributed file system and a commodity cluster. The commodity cluster has a plurality of servers. A data array of key-value pairs related to social media is received; it stores a plurality of predetermined ground predicates. A ground predicate graph is constructed for each user then partitioned into balanced portions Pi each corresponding to a server and the ground predicates stored on that server. In parallel on each server, a plurality of leaned rules are determined for the files stored on that server. From a union of the plurality of learned rules, the system determines a respective weight for each of the learned rules. The plurality of rules are ranked in order of accuracy by the plurality of weights.
Core Innovation
The invention provides a cyberthreat detection system and method that operates on large-scale social media data, specifically focusing on Twitter® posts. It employs first-order logic, probability theory, and graph theory, particularly using a commodity cluster with multiple servers and a distributed file system. Central to the innovation is the automatic learning of first-order rules from massive datasets by constructing ground predicate graphs for users, forming a user-centric graph with weighted vertices and edges representing relationships between users, partitioning this graph into balanced portions corresponding to servers, and then performing parallel rule learning on each partition. The learned rules from all partitions are combined, weighted, and ranked to identify the most relevant rules indicative of cyberthreats.
The problem addressed arises from the social and economic damage caused by cyber threats on social media platforms like Twitter®, including malicious postings, misinformation, and malware spread through malicious links. Prior approaches used handcrafted first-order logic rules in knowledge bases and probabilistic inference to detect suspicious users and malicious content. However, malicious users adapt and change behavior, and the volume of tweets grows exceedingly large, resulting in evolving and emerging new rules. Modeling and reasoning about the veracity of social media posts faces challenges due to the complex, noisy, and diverse tweet attributes and relationships among users. This necessitates a scalable automated method that can learn relevant first-order rules efficiently across large datasets and adapt to dynamic social media environments.
To solve these issues, the invention employs a divide-and-conquer strategy using a distributed commodity cluster. It constructs a user-centric graph that clusters ground predicates around users, defines edges based on social interactions such as mentions, friendships, follows, and retweets, and assigns weights to vertices and edges to capture social relationships. Parallel graph partitioning is applied to minimize cut edges and balance the partitions, enabling efficient distribution of data and processing across servers. On each partition, Markov Logic Network (MLN) structure learning algorithms identify probabilistic first-order logic rules, then MLN weight learning assigns weights to these rules. The method finally combines and ranks these rules through an objective function that multiplies and sums weights across partitions, enabling selection of the most relevant rules for cyberthreat detection. This approach drastically reduces processing time while maintaining accuracy in identifying suspicion and malicious behavior on large-scale social media data.
Claims Coverage
The patent contains three independent claims covering a system, a method, and a system for detecting suspicious social media activity, each incorporating a distributed file system, a commodity cluster, ground predicate construction, user-centric graph formation, partitioning, parallel learned rule generation, weight determination, and rule ranking.
Distributed cyberthreat detection system using a user centric graph partitioned across servers
The system comprises a distributed file system and a commodity cluster with multiple servers. It receives a data array of key-value pairs relating to social media posts and users, stores predetermined ground predicates, constructs ground predicate graphs per user, then builds a user-centric graph where each vertex represents a user's ground predicate graph. This graph is partitioned into balanced portions corresponding to servers, with ground predicates stored as files on respective servers. The system determines learned rules in parallel on each server, receives a union of these rules, determines respective weights for each learned rule, and ranks the rules by these weights.
Method for cyberthreat detection by partitioning user centric graphs and learning weighted first-order rules in parallel
The method includes receiving first-order predicates and ground predicates characterized by key-value pairs relating to social media data, storing the ground predicates, constructing ground predicate graphs per user, forming a user-centric graph with vertices and edges representing user relationships. The graph is partitioned into balanced portions corresponding to servers, with ground predicates partitioned accordingly. Learned rules are determined in parallel on each server, combined as a union, weighted per rule, and ranked by these weights. The method utilizes map and reduce operations, splitting data into blocks, and stores intermediate and final data in Hadoop Distributed File System files. Edges in the graph are defined according to specific social relation conditions among users.
System for detecting suspicious social media activity employing Markov Logic Network learning on partitioned user-centric graphs
This system includes a distributed file system and a commodity cluster of servers, with a processor configured to store predetermined ground predicates and construct ground predicate graphs per user. It builds a user-centric graph with vertices representing each user's ground predicate graph and edges defined by social interactions such as mentions, friendships, follows, and retweets. The graph is partitioned into balanced parts with ground predicates stored accordingly. The system determines a plurality of learned rules in parallel on each server by running an MLN learning algorithm, combines these rules via union, determines respective weights for each rule, and ranks the rules by calculating partial products and summing them to define total weights.
The independent claims collectively disclose a distributed system and method that partition a user-centric graph derived from social media ground predicates, perform parallel learning of weighted first-order logic rules using MLN techniques, and rank these rules to detect cyberthreats on social media efficiently and accurately.
Stated Advantages
The divide-and-conquer approach significantly reduces the time required to learn relevant knowledge base rules over large datasets by operating in parallel on partitioned ground predicates rather than a single large set.
User-centric graph partitioning minimizes the chance of missing important rules that span multiple partitions by balancing partitions and minimizing cut edge weights.
Automated rule learning adapts to evolving malicious behaviors on social media, improving accuracy in detecting suspicious users and malicious content.
The use of commodity clusters and distributed file systems enhances scalability and flexibility for large-scale social media data processing.
Documented Applications
Detecting suspicious users and malicious content on large-scale Twitter® data for cyberthreat detection.
Protecting against cyberattacks on social media platforms by identifying misinformation and malicious posts.
Interested in licensing this patent?