VPA with integrated object recognition and facial expression recognition

Inventors

Divakaran, AjayTamrakar, AmirACHARYA, GIRISHMark, WilliamHo, GregHuang, JihuaSalter, DavidKalns, EdgarWessel, MichaelYin, MinCarpenter, JamesMombourquette, BrentNitz, KennethShriberg, ElizabethLaw, EricFrandsen, MichaelKim, Hyong-GyunAlbright, CoryTsiartas, Andreas

Assignees

SRI International Inc

Publication Number

US-12282606-B2

Publication Date

2025-04-22

Expiration Date

2036-10-24

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Methods, computing devices, and computer-program products are provided for implementing a virtual personal assistant. In various implementations, a virtual personal assistant can be configured to receive sensory input, including at least two different types of information. The virtual personal assistant can further be configured to determine semantic information from the sensory input, and to identify a context-specific framework. The virtual personal assistant can further be configured to determine a current intent. Determining the current intent can include using the semantic information and the context-specific framework. The virtual personal assistant can further be configured to determine a current input state. Determining the current input state can include using the semantic information and one or more behavioral models. The behavioral models can include one or more interpretations of previously-provided semantic information. The virtual personal assistant can further be configured to determine an action using the current intent and the current input state.

Core Innovation

The invention provides methods, computing devices, and computer-program products for implementing a virtual personal assistant (VPA) that integrates object recognition and facial expression recognition. The VPA is designed to receive sensory input from at least two different modalities, such as audio and visual input, and determines semantic information from this input. By analyzing this information, the VPA constructs a context-specific framework, which can include dialog history and ontologies, and uses this context to determine a current user intent.

The core problem addressed is that conventional virtual personal assistants are limited in their ability to process only spoken or typed input and cannot effectively interpret non-verbal cues such as emotional expressions, gestures, or context from ongoing conversation. This makes their interactions less natural and prevents them from accurately tracking the flow and context of conversations, especially when users refer to prior interactions or use ambiguous references.

In this invention, the VPA further determines the user's input state, such as emotional, cognitive, or mental state, by utilizing behavioral models and scene information extracted from sensory input, including object, event, and motion data. The system adapts its questioning style based on the detected user state, enabling more effective, contextually appropriate, and user-sensitive dialog. Actions performed by the VPA may include outputting responses, requesting confirmation, searching for information, or generating control signals.

Claims Coverage

The patent claims include three primary independent inventive features regarding methods, devices, and memory devices implementing adaptive questioning styles in virtual personal assistants using multi-modal input.

Adaptive questioning style based on audio and visual input for user state

A method is disclosed that receives sensory input including both audio and visual data, determines semantic information from the sensory input, selects an initial questioning style, determines scene information from the visual input, and, after receiving user audio, detects an input state associated with the user using both the audio and scene information. The system then adapts the questioning style to a second style based on the determined input state and outputs a corresponding question to the user.

Device implementing questioning adaptation using scene and input state analysis

A device comprising at least one processor, a sensor, and memory executes operations including receiving audio and visual input, extracting semantic information, selecting and adapting questioning style based on the combination of semantic and scene information, determining user input state after waiting for audio input, and outputting questions in the selected questioning style.

Memory device storing instructions for adaptive questioning utilizing multi-modal input

At least one memory device stores instructions that, when executed, cause a processor to receive audio and visual input, determine semantic and scene information, select and modify questioning style in response to the detected user input state derived from both audio and scene information, and output questions in an adapted questioning style.

The independent claims comprehensively cover methods, devices, and memory products for multi-modal virtual personal assistant systems that analyze audio and visual inputs to dynamically adapt questioning style based on user state.

Stated Advantages

The system enables a virtual personal assistant to interpret and adapt to both verbal and non-verbal user cues for more natural, context-aware, and personalized interactions.

The adaptive questioning approach allows the assistant to respond to user emotional, cognitive, or mental states as detected from combined audio and visual information.

Integration of object recognition and scene analysis enhances the assistant's ability to understand context, track dialog history, and resolve ambiguities in user input.

Documented Applications

Integration in tablet devices for conversational interfaces supporting entertainment, communication, and information retrieval.

Implementation in automobiles for vehicle and travel-related assistance, including responding to driver queries and adapting to driver state.

Deployment in service robots for customer service in retail, education, healthcare, and therapy—providing assistance and adapting to user frustration or satisfaction.

Use in automated customer service systems, such as pharmacy prescription refilling, to adapt questioning style for better user engagement.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.