System and method for prompt searching
Inventors
WILLMOTT, Devin T. • Akinwande, Victor Abayomi • Jiang, Yiding • Sam, Dylan Jiang • KOLTER, Jeremy
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Assignees
Carnegie Mellon UniversityCarnegie Mellon University is a global research institution based in Pittsburgh, Pennsylvania, recognized for interdisciplinary education, research, and innovation in science, engineering, arts, technology, and social sciences. The university leads advancements in artificial intelligence, robotics, digital health, and performing arts. Located in a technology-driven and culturally rich city, CMU powers real-world impact through research centers, industry engagement, workforce training, and initiatives that shape regional and global communities.
Carnegie Mellon University is a global research institution based in Pittsburgh, Pennsylvania, recognized for interdisciplinary education, research, and innovation in science, engineering, arts, technology, and social sciences. The university leads advancements in artificial intelligence, robotics, digital health, and performing arts. Located in a technology-driven and culturally rich city, CMU powers real-world impact through research centers, industry engagement, workforce training, and initiatives that shape regional and global communities.
Abstract
A computer-implemented method that includes receiving a plurality of input images, generating a visual matrix utilizing the plurality of images and an image encoder, wherein the visual matrix includes a list of encoded images, receiving a plurality of text prompts, selecting a text prompt from the plurality of text prompts, send the first one of the text prompts to a language model to generate a candidate list of tokens, selecting tokens, converting the text prompts into updated text prompts via appending the tokens, generating a text matrix utilizing the text prompt and text encoder, and utilizing numerical values assigned at an image-text similarity matrix, determining a score associated with the image-text similarity matrix; and evaluating a criteria and outputting a final token to the updated text prompt in response to identifying a highest score associated with the final token after evaluating each of the plurality of text prompts.
Core Innovation
The invention relates to prompt-search and prompt-engineering for a pre-trained machine-learning network that receives a plurality of input images and a plurality of text prompts. It generates a visual matrix using an image encoder and produces encoded images, then selects a first one of the text prompts and sends it to a large language model or language model to generate a candidate list of tokens.
One or more tokens from the candidate list are selected and appended to convert the text prompt into updated text prompts. A text matrix is generated using the updated text prompts and a text encoder of the machine-learning network, and the updated text prompts are represented as a list of encoded visual descriptors.
The text matrix is multiplied with the visual matrix to generate an image-text similarity matrix whose entries assign numerical values indicating similarities between each encoded visual descriptor and each encoded image. A score is determined utilizing the numerical values, and threshold-controlled repetition is carried out until a final token is output in response to identifying a highest score associated with the final token after evaluating each of the plurality of text prompts.
Claims Coverage
The provided set includes three independent claims (clm-00001, clm-00011, clm-00015) centered on iterative prompt/token updating driven by a language model and scored via an image-text similarity matrix, using a threshold-controlled repetition process.
Iterative LLM token candidate selection for updating text prompts
Selecting a first one of the text prompts from the plurality of text prompts and sending the first one of the text prompts to a large language model or language model to generate a candidate list of tokens, wherein the candidate list is a subset smaller than all tokens associated with the first one of the text prompts and includes highest-probable tokens calculated in response to output of the LLM; selecting one or more tokens from the candidate list.
Updated text prompts via appending selected tokens
Converting the one of the text prompts into updated text prompts via appending the one or more selected tokens associated with the plurality of text prompts.
Scoring updated prompts using an image-text similarity matrix from encoded images and encoded descriptors
Generating a visual matrix utilizing the plurality of input images and an image encoder of the machine-learning network, receiving a text matrix utilizing both the updated text prompts and a text encoder of the machine-learning network, multiplying the text matrix and the visual matrix to generate an image-text similarity matrix, and determining a score associated with the image-text similarity matrix using the numerical values assigned at the image-text similarity matrix.
Threshold-controlled repetition and final token output based on highest score
When the score falls below a threshold, repeating steps for a second token for the first one of the text prompts, and when the score exceeds the threshold, adding the one or more tokens to the updated text prompts and repeating steps for a remainder of each of the plurality of text prompts; outputting a final token to the updated text prompt in response to identifying a highest score associated with the final token after evaluating each of the plurality of text prompts.
Multimodal system for images indicative of radar, sonar, video, picture, sound, or LiDar information
Receiving a plurality of input images indicative of radar, sonar, video, picture, sound, or LiDar information; executing the same iterative token-selection, updated text prompt creation, visual and text encoding, image-text similarity matrix computation, threshold-based repetition, and final token output as recited.
Across the independent claims, the method or system generates an image-text similarity matrix from a visual matrix and a text matrix, uses an LLM or LM-produced candidate token subset to iteratively update the prompts, and applies a threshold to decide whether to repeat token selection or proceed and then output a final token associated with a highest score.
Stated Advantages
Not explicitly described in patent.
Documented Applications
Not explicitly described in patent.
Interested in licensing this patent?