User targeted content generation using multimodal embeddings

Inventors

Divakaran, Ajay • Sikka, Karan • Ray, Arijit • Lin, Xiao • Yao, Yi

Assignees

SRI International Inc

Publication Number

US-12367420-B2

Publication Date

2025-07-22

Expiration Date

2041-03-04

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Abstract

A method, apparatus and system for determining user-content associations for determining and providing user-preferred content using multimodal embeddings include creating an embedding space for multimodal content by creating a first modality vector representation of the multimodal content having a first modality, creating a second modality vector representation of the multimodal content having a second modality, creating a user vector representation, as a third modality, for each user associated with at least a portion of the multimodal content, and embedding the first and the second modality vector representations and the user vector representations in the common embedding space using at least a mixture of loss functions for each modality pair of the first, the at least second and the third modalities that pushes closer co-occurring pairs of multimodal content. Embodiments can further include generating content using determined attributes of a message to be conveyed and features of the user-preferred content.

Core Innovation

Embodiments of the present principles disclose methods, apparatuses, and systems for determining and providing user-preferred content using multimodal embeddings. The invention involves creating a common embedding space for multimodal content by generating vector representations for different content modalities, such as text, images, audio, and associating these with user vector representations, which are embedded into the common space. The embeddings are trained using mixtures of loss functions that bring closer co-occurring pairs of multimodal content across different modalities, including user-content pairs.

The problem solved arises from current approaches relying on coarse polling and intuition-based methods that lack a precise mathematical model linking content and users. These approaches do not enable fine-grained associations between users and content across multiple content modalities. The invention addresses the need for a precise, mathematical embedding framework to associate users with multimodal content, supporting improved content preference determination and content generation tailored to user interests.

Claims Coverage

The patent claims include multiple independent claims covering methods and apparatuses related to training neural networks and methods for predicting user content preferences, as well as for generating content using user preferred content. The main inventive features pertain to creating a common embedding space with multimodal user-content associations, predicting user preferences using the embedding space, and generating content to convey message intents using determined user preferred content.

Training a neural network to create a common embedding space with multimodal user-content associations

Creating respective modality vector representations for multiple modalities of multimodal content using machine learning models; creating user vector representations as a third modality; training the neural network in stages for each modality embedding; embedding the first, second, and user vector representations into a common embedding space using at least a combined loss function comprising ranking loss for modality pairs that pushes co-occurring pairs closer.

Using at least three mixture loss functions to embed multiple modalities

Applying at least three mixture loss functions to embed the first modality, second modality, and user vector representations into the common embedding space to improve embedding quality and user-content associations.

Creating modality vector representations with specific machine learning models

Using a text-based convolutional neural network (CNN) with fully connected layers for text modality vector representations and a deep image encoder implementing a convolutional neural network with fully connected layers for image modality vector representations.

Generating additional modality vector representations and embedding with multiple loss functions

Creating at least a fourth modality vector representation such as audio using machine learning; embedding all modality vectors (first, second, user, and fourth) into the common embedding space with a mixture of loss functions for each modality pair that brings co-occurring pairs closer.

Generating user vector representations as arithmetic averages

Creating user vector representations by computing the arithmetic average of respective modality vector representations (e.g., text and image) of multimodal content associated with the user.

Determining similarity for user preference identification

Determining similarity between embedded user vector representations and embedded modality vector representations using distance functions to identify user multimodal content preferences from the common embedding space.

Predicting user content preferences using the trained embedding space

Identifying a user and locating the embedded user vector representation in the trained embedding space; determining similarity between the user vector and multimodal content embeddings to identify preferences; comparing the user preferences with multimodal content to predict user content preferences.

Generating content to convey message intents using user preferred content

Determining attributes of a message to be conveyed and features of the user preferred content using respective pre-trained embedding spaces; generating content by combining these determined attributes and features, optionally modifying existing content or generating new content, with generation performed using an adversarial relationship between a discriminator and generator.

The claims cover a comprehensive system for representing multimodal content and users in a common embedding space using combined loss functions; predicting user content preferences from this space; and generating content that conveys message intents by leveraging user preferred content features. The claims also recite apparatus elements configured to perform these methods.

Stated Advantages

The method enables precise mathematical modeling of user-content associations across multiple content modalities, improving fine-grained user preference determination.

Embedding users along with content modalities in a common embedding space serves as anchor points and regularizes learning, thereby enhancing cross-modal retrieval performance.

Combining multiple modality pairs in the embedding training improves prediction accuracy of user interests compared to single-modality embeddings.

The approach enables generation of content customized to user preferences that accurately conveys desired message intents, leveraging adversarial generation techniques.

Documented Applications

Identifying purveyors of anti US Government or terrorist recruitment content and their target audiences for Law Enforcement and Defense.

Targeted advertising and identification of vendors and desired products for Marketing.

Matching content providers with interested consumers in Entertainment.

Identifying content influencers in Social Media.

Abstract
Core Innovation
Claims Coverage
Stated Advantages
Documented Applications
Interested in licensing this patent?

User targeted content generation using multimodal embeddings

Inventors

Assignees

Publication Number

Publication Date

Expiration Date

Interested in licensing this patent?

Abstract

Core Innovation

Claims Coverage

Training a neural network to create a common embedding space with multimodal user-content associations

Using at least three mixture loss functions to embed multiple modalities

Creating modality vector representations with specific machine learning models

Generating additional modality vector representations and embedding with multiple loss functions

Generating user vector representations as arithmetic averages

Determining similarity for user preference identification

Predicting user content preferences using the trained embedding space

Generating content to convey message intents using user preferred content

Stated Advantages

Documented Applications

Interested in licensing this patent?

Stay Connected with MTEC