Machine learning methods and systems for identifying patterns in data using a plurality of learning machines wherein the learning machine that optimizes a performance function is selected

Inventors

VIRKAR, HemantStark, KarenBorgman, Jacob

Assignees

DIGITAL INFUZION Inc

Publication Number

US-8386401-B2

Publication Date

2013-02-26

Expiration Date

2029-09-10

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Methods for training machines to categorize data, and/or recognize patterns in data, and machines and systems so trained. More specifically, variations of the invention relates to methods for training machines that include providing one or more training data samples encompassing one or more data classes, identifying patterns in the one or more training data samples, providing one or more data samples representing one or more unknown classes of data, identifying patterns in the one or more of the data samples of unknown class(es), and predicting one or more classes to which the data samples of unknown class(es) belong by comparing patterns identified in said one or more data samples of unknown class with patterns identified in said one or more training data samples. Also provided are tools, systems, and devices, such as support vector machines (SVMs) and other methods and features, software implementing the methods and features, and computers or other processing devices incorporating and/or running the software, where the methods and features, software, and processors utilize specialized methods to analyze data.

Core Innovation

The invention provides methods and systems for training machines to categorize data and recognize patterns in data by supplying training data samples with known classes, identifying patterns within these samples, and then using two or more learning machines to independently learn the classification task. A key aspect is the selection of the optimally trained machine based on maximizing specific performance functions—such as divergence between classes, cross-validation, support vector parameters, and other mathematical criteria—without the need for a separate test data set.

This approach enables both supervised and unsupervised machine learning, pattern-based classification, and querying, relying on automated or user-driven feature reduction and optimization. The machine that best generalizes, according to internal criteria like divergence or support vector metrics, is selected and used for further query or classification tasks. This selection can incorporate adjustments for feature weighting, quality measures within data samples, and optimization of generalization performance, all conducted without requiring the end user to possess mathematical expertise.

The problem addressed is the inefficiency and complexity of current machine learning techniques requiring mathematical expertise for optimal use, difficulty in generalization, and suboptimal pattern recognition in large and varied data sets, particularly in contexts like bioinformatics and gene expression analysis. Existing systems are limited in allowing automated, accurate selection of the best-trained machine and do not provide flexible, automatable methods to achieve optimal queries and classification directly from training data, particularly in complex domains such as biological or medical data.

Claims Coverage

There are three independent claims, each introducing a principal inventive feature relating to machine learning systems, selection of optimal trained machines, and implementation in a computer program product.

Selection of the optimal trained learning machine without use of test data

A machine learning method wherein: - One or more training data samples with known classes are provided. - Two or more learning machines (of the same kernel type) are trained using these data. - The trained learning machine that optimizes a performance function—dependent on variables such as maximizing divergence between classes, n-fold cross validation, number of support vectors, VC dimension, support vector ratios, relative magnitude of parameters, sigma for a Gaussian kernel, or upper bounds for Lagrange multipliers—is selected for use. - No test data set is used for selection, distinguishing the process from conventional approaches reliant on held-out data. - The selected trained machine is output to computer memory.

Identification of data pattern correspondence using trained machines selected by internal performance functions

A machine learning method comprising: - Providing one or more data patterns and one or more data samples. - Training two or more learning machines (sharing the same kernel type) to identify which data samples correspond to the data patterns. - Selection of the trained learning machine is accomplished by optimizing the same performance functions as above (e.g., divergence, cross validation, support vector characteristics) and does not use a test data set for selection. - The selected trained machine is stored in computer memory for subsequent use.

Computer program product for automated selection and output of optimal trained learning machine

A computer program product incorporating control logic that: - Provides one or more training data samples with known classes. - Trains two or more learning machines (of the same kernel type) using these samples. - Selects the trained machine that optimizes a performance function dependent on variables such as divergence, cross validation, support vector metrics, VC dimension, or kernel or Lagrange multiplier parameters—without reference to test data sets. - Outputs the selected trained machine on an output device. - The product can be implemented in a computer system with associated computer-usable media.

The inventive features focus on automating the selection of the optimally trained learning machine based purely on internal performance functions derived from training data, eliminating the need for test data in the selection process, applicable to pattern classification, data querying, and modular software or computer systems.

Stated Advantages

Provides improved generalization by selecting the optimal trained machine using internal criteria without needing test data.

Enables automation of machine learning optimization, allowing use by individuals without specialized mathematical knowledge.

Increases speed and performance of machine learning processes, particularly during feature reduction, training, and classification.

Enhances the ability to query large, complex, and diverse databases for both exact and similar pattern matches, including in biological and medical data.

Permits effective handling of noisy or limited data sets through feature weighting and supplemental negative sample generation, improving classifier robustness.

Facilitates discovery of important features (variables) within data sets, supporting knowledge discovery and research insights beyond simple classification.

Documented Applications

Diagnosis and prognosis of changes in biological systems, such as diseases, using machine learning on gene expression and other biomedical data.

Development of diagnostic tests and treatments based on machine-learned relationships in biological and medical information.

Drug discovery by querying and analyzing gene expression databases for relationships linked to disease states.

Pattern recognition and data mining for areas including climate data, document classification, financial data mining, geospatial data, handwriting and character recognition, information retrieval, population data, search engines, speech recognition, business, military, games, and vision recognition.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.