Bagged filtering method for selection and deselection of features for classification
Inventors
Röder, Heinrich • Röder, Joanna • Steingrimsson, Arni • Oliveira, Carlos
Assignees
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
Classifier generation methods are described in which features used in classification (e.g., mass spectral peaks) are selected, or deselected using bagged filtering. A development sample set is split into two subsets, one of which is used as a training set the other of which is set aside. We define a classifier (e.g., K-nearest neighbor, decision tree, margin-based classifier or other) using the training subset and at least one of the features (or subsets of two or more features in combination). We apply the classifier to a subset of samples. A filter is applied to the performance of the classifier on the sample subset and the at least one feature is added to a “filtered feature list” if the classifier performance passes the filter. We do this for many different realizations of the separation of the development sample set into two subsets, and, for each realization, different features or sets of features in combination. After all the iterations are performed the filtered feature list is used to either select features, or deselect features, for a final classifier.
Core Innovation
The invention relates to a method improving the functioning of a computer as a classifier by selecting or deselecting one or more features in a data set for generating the classifier. Physical measurement data and a class label are obtained from a development set of samples, where the physical measurement data comprises a feature value for a multitude of individual features. The method uses a programmed computer to repeatedly separate the development set into a training set and a held-aside remainder subset for different realizations.
For each realization, a classifier is defined using the training subset and at least one of the features, and the classifier is applied to the training subset. A filter is then applied to the performance of the classifier, and the at least one feature is added to a filtered feature list if the classifier performance passes the filter step. The process is repeated for different realizations of the separation and for different one or more features.
After repeating the steps, the filtered feature list is used to either select features or deselect features from the multitude of individual features for use in a final classifier generated from the development set of samples. The description emphasizes an ensemble, “bagged filtering” approach that keeps or removes features based on whether performance satisfies the filter across realizations. It includes tailoring using simple or compound filters, including logical AND across multiple criteria and use of performance metrics such as accuracy and hazard ratio.
Claims Coverage
The document contains one independent claim that defines the core bagged filtering method for selecting or deselecting features for improving a computer classifier. The dependent claims refine the approach with additional context for measurement and with filter-performance criteria and logical relationships, yielding multiple inventive features built on the independent claim steps.
Bagged filtering for filtered feature list formation
Separating the development set into two subsets for different realizations; defining a classifier on the training subset using at least one feature; applying the classifier to the training subset; applying a filter to the performance; adding the at least one feature to a filtered feature list if the classifier performance passes the filter; repeating for different realizations and different features; and using the filtered feature list to select features or deselect features for use in a final classifier.
Mass spectrometry feature definition with integrated intensity values across m/z ranges
Defining the at least one feature using integrated intensity values across one or more m/z ranges obtained from mass spectrometry measurements, and using these features in the method of generating the classifier with the filtered feature list.
Hazard-ratio threshold filter criterion between classification groups
Using a filter such that the performance meets a specified hazard-ratio threshold between two classification groups to determine whether features are added to the filtered feature list.
Compound filter logic using logical AND with classifier performance criteria
Using a filter that uses two classifier performance criteria combined with a logical AND operation, including one criterion based on classifier performance on a set of patient samples from patients without cancer, to decide whether features are added to the filtered feature list.
Overall, the claim set covers a repeated split-and-evaluate framework that constructs a filtered feature list via a performance filter, and then uses that filtered list to select or deselect features for a final classifier. Dependent inventive features explicitly narrow the measurement-derived feature definition and specify quantitative and compound filter-performance criteria, including hazard-ratio thresholds and logical AND combinations that include performance on patients without cancer.
Stated Advantages
Improves the functioning of a computer as a classifier by selecting or deselecting one or more features.
Avoids overfitting and provides robust classifier performance across repeated realizations by using the filter across different separations of the development set.
Enables clinical tailoring by using simple or compound filters, including logical AND based on classifier performance criteria and hazard ratio.
Allows tuning behavior for prognostic versus predictive tests using time-to-event data.
Documented Applications
Feature selection and deselection for generating a classifier using biomedical data, including mass spectrometry/MALDI-TOF integrated m/z intensity features.
Feature selection and deselection for classifiers using genomic/mRNA data and proteomic data, with evaluation including prognostic and predictive testing behavior using time-to-event data.
Use in contexts including hepatocellular carcinoma (HCC), including mention of AFP (alpha-fetoprotein) and a liver function confounder.
Use in a lung cancer genomics context, including reference to GEO GSE14814 and ACT vs OBS arms.
Interested in licensing this patent?