Machine learning methods and systems for identifying patterns in data

Inventors

VIRKAR, HemantStark, KarenBorgman, Jacob

Assignees

Virkar Hemant VHemant V VirkarDIGITAL INFUZION Inc

Publication Number

US-10402748-B2

Publication Date

2019-09-03

Expiration Date

2029-09-10

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Methods for training machines to categorize data, and/or recognize patterns in data, and machines and systems so trained. More specifically, variations of the invention relates to methods for training machines that include providing one or more training data samples encompassing one or more data classes, identifying patterns in the one or more training data samples, providing one or more data samples representing one or more unknown classes of data, identifying patterns in the one or more of the data samples of unknown class(es), and predicting one or more classes to which the data samples of unknown class(es) belong by comparing patterns identified in said one or more data samples of unknown class with patterns identified in said one or more training data samples. Also provided are tools, systems, and devices, such as support vector machines (SVMs) and other methods and features, software implementing the methods and features, and computers or other processing devices incorporating and/or running the software, where the methods and features, software, and processors utilize specialized methods to analyze data.

Core Innovation

The invention provides methods, systems, and devices for training machines to categorize or recognize patterns in data, such as biological and image data, and enables searches for relevant information within this data. These methods include training two or more learning machines with different or supplemented training data sets, evaluating the performance of these machines using multiple derived measures, and selecting the best-trained machine based on automated or learned criteria. The selected machine is then used for search queries or classification of data in databases, with hardware and software implementing these processes.

The problem addressed is the lack of effective, efficient, and user-accessible tools for pattern recognition and data querying across complex, high-dimensional, or diverse datasets—specifically in fields like bioinformatics, where traditional methods either require advanced mathematical knowledge or rely on annotation-based searches. Existing machine learning tools, like SVMs, are limited in their application for searching and querying large or varied repositories, and optimizing machine selection and generalization requires improvements in speed, automation, and non-expert usability.

The core innovation introduces a machine learning framework that automates the selection of optimal trained learning machines by using multiple performance measures (e.g., divergence, number of support vectors, cross-validation methods). It incorporates novel weighting schemes, including randomly generated low-weighted negative samples, and feature reduction methods, facilitating the analysis and search of large complex datasets. Automated ranking and learning-machine selection are achieved via secondary learning machines, such as neural networks, to further predict and optimize the generalization and query-success of the chosen trained machine.

Claims Coverage

The claims provide coverage for multiple inventive features, mainly focused on training, selecting, and utilizing optimal learning machines for searching and classifying data.

Method for selecting among trained learning machines using secondary selecting learning machines

The method includes: - Training two or more learning machines with different training data sets to obtain multiple trained machines - Obtaining multiple measures of training output from each trained machine - Training one or more selecting learning machines (which may be neural networks) to select among the trained machines based on these multiple measures - Selecting one trained machine using the selecting learning machines - Loading the selected trained machine into computer memory to configure a computer to perform search queries or classification of a database or dataset based on the selected machine

Assigning and weighting measures for learning machine selection

The inventive feature encompasses: - Using measures of training output, including number of features, support vectors, unbound margins, divergence, margins, angles, positive/negative/total leave-one-out, and stability - Selecting the optimal trained learning machine by means of weighting these measures within a neural network or similar selector - The neural network assigns weights based on methods including LOO, LTO, n-fold cross-validation, number of support vectors, VC dimension, ratio of support vectors, parameter magnitude, sigma for Gaussian kernels, Lagrange multiplier bounds, divergence, and combinations thereof - Automatically selecting the best trained machine without user intervention

Supplementing training data with randomly-generated negative examples assigned reduced weight

This inventive feature comprises: - Providing training data samples of known classes - Supplementing the training data with randomly-generated negative examples assigned reduced weight - Training two or more learning machines with the supplemented data - Selecting the trained learning machine optimizing a performance function based on variables between classes - The negative examples may be generated by estimating density functions, importance-sampling Monte Carlo, or random selection from existing data

Machine learning selection and optimization by training sets differing in features, samples, and weights

The inventive feature includes: - Generating different training sets for each learning machine by inclusion or exclusion of features or samples, or by assigning different weights to training samples - Training each learning machine on its respective dataset - Comparing performance and selecting the best trained machine using an automated or learned selector

Configuring selected trained machine for search queries or classification

This feature covers: - Outputting the selected trained learning machine to a device or loading it into computer memory - Using the selected trained machine for classifying or querying databases (including genetic sequences and microarray expression data) - Providing either binary or ranked outputs for classification or search results

In summary, the claims cover methods for automated training, evaluation, and selection of learning machines using multiple performance criteria, supplementation of training sets with special negative samples, and deployment of the selected machine for efficient data querying and classification.

Stated Advantages

Enables accurate, efficient, and automated selection of optimal trained learning machines without requiring user expertise in mathematics.

Improves generalization and performance for classification and querying, especially in high-dimensional and noisy data sets.

Provides a querying system that can search large, complex databases for similar or related patterns directly in the data, not just by annotation.

Enables supplementation of training data with low-weighted negative examples for better balanced and more effective machine learning.

Facilitates broad application across various data domains including biological, chemical, financial, climate, image, and more.

Provides faster machine learning training and feature reduction compared to existing methods.

Allows ranking and selection of important features or genes relevant to the classification or query task.

Enables query by hypothetical or user-defined patterns, allowing exploration of potential relationships in data.

Documented Applications

Diagnosis and prognosis of diseases and changes in biological systems using gene expression data and trained learning machines.

Searching and querying large biological data repositories, such as microarray or gene expression databases, for similar or related patterns.

Identification of therapeutic compounds by querying databases for chemical compounds that modulate gene expression associated with specific diseases.

Analysis of climate data, document classification and similarity, financial data mining, geospatial data analysis, handwriting and character recognition, speech recognition, strategy-based tasks (business, military, games), and vision recognition.

Testing and treating individuals who exhibit changes identified by the trained machine, including preparation of diagnostic test kits or targeted therapies.

Pattern-based searching of unknown or hypothetical patterns to determine if they exist in actual data.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.