Rapid adaptation to contemporary text datasets

Inventors

Cadigan, JohnGraciarena, Martin

Assignees

SRI International Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-12430377-B2

Patent

Publication Date

2025-09-30

Expiration Date


Abstract

In an example, a method for adapting a machine learning model includes receiving first input data; choosing a first set of unlabeled textual spans in the first input data, wherein the chosen first set of unlabeled textual spans is associated with a first domain; labeling the chosen first set of unlabeled textual spans to generate a labeled first set of textual spans; categorizing the labeled first set of textual spans to generate a categorized labeled first set of textual spans; receiving second input data; choosing a second set of unlabeled textual spans, wherein the second set of unlabeled textual spans is associated with a second domain; and adapting the machine learning model to the second domain based on the categorized second set of unlabeled textual spans that is generated based on the categorized labeled first set of textual spans.

Core Innovation

A method adapts a machine learning model to a second domain by choosing unlabeled textual spans associated with a first domain, labeling the chosen first set of unlabeled textual spans to generate a labeled first set of textual spans, and categorizing the labeled first set of textual spans to generate a categorized labeled first set of textual spans for the machine learning model. The method then receives second input data and chooses a second set of unlabeled textual spans in the second input data associated with a second domain.

The method categorizes the second set of unlabeled textual spans based on the categorized labeled first set of textual spans to generate a categorized second set of unlabeled textual spans using a relationship between the first domain, text of the labeled first set of textual spans, and a corresponding category. Using model adaptation, the method adapts the machine learning model to the second domain based on the categorized second set of unlabeled textual spans, at least in part, based on a performed consistency loss analysis.

In the described system, the machine learning system performs consistency loss analysis in an unsupervised domain adaptation (UDA) setting by combining supervised cross-entropy on labeled source data with unsupervised consistency loss on augmented target-domain data. The approach is framed as rapid contemporary-domain adaptation in the presence of dataset drift and concept drift, and includes system components such as a training component, a model adaptation system, and a consistency-loss analyzer.

Claims Coverage

The partial content provides three independent claim types (a method, a computing system, and non-transitory computer-readable media) that share the same core inventive sequence. Across them, the independent claims include a total of four main inventive features: domain-associated unlabeled span selection, labeling and categorizing for the first domain, relationship-based categorization of second-domain unlabeled spans, and model adaptation using performed consistency loss analysis.

Domain-associated unlabeled textual span selection and labeling

choose a first set of unlabeled textual spans in the first input data, wherein the chosen first set of unlabeled textual spans is associated with a first domain; labeling the chosen first set of unlabeled textual spans to generate a labeled first set of textual spans

Categorization of labeled spans and relationship-based categorization of second-domain unlabeled spans

categorize the labeled first set of textual spans to generate a categorized labeled first set of textual spans for the machine learning model; categorize the second set of unlabeled textual spans, based on the categorized labeled first set of textual spans, to generate a categorized second set of unlabeled textual spans using a relationship between the first domain, text of the labeled first set of textual spans, and a corresponding category

Model adaptation using performed consistency loss analysis

adapt the machine learning model to the second domain based on the categorized second set of unlabeled textual spans using model adaptation, at least in part, based on a performed consistency loss analysis

Across the independent claims, the inventive coverage is directed to adapting a machine learning model across domains by selecting unlabeled textual spans per domain, labeling and categorizing the first-domain spans, generating categorized second-domain unlabeled spans via a relationship between first-domain text and categories, and adapting the model to the second domain using model adaptation based at least in part on a performed consistency loss analysis.

Stated Advantages

Rapid contemporary-domain adaptation under dataset drift and concept drift.

Documented Applications

Adapting a machine learning model to contemporary data relative to old data via unsupervised domain adaptation (UDA) framed around dataset drift and concept drift.

Active-learning context involving uncertainty sampling, query-by-committee, and weak labeling (as part of the described approach).

Mention of SemEval dataset.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.