Controllable, natural paralinguistics for text to speech synthesis
Inventors
Bratt, Harry • Richey, Colleen • Yadav, Maneesh
Assignees
Publication Number
US-12361925-B2
Publication Date
2025-07-15
Expiration Date
2040-12-29
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A speech recognition module receives training data of speech and creates a representation for individual words, non-words, phonemes, and any combination. A set of speech processing detectors analyze the training data of speech from humans communicating. The set of speech processing detectors detect speech parameters that are indicative of paralinguistic effects on top of enunciated words, phonemes, and non-words in the audio stream. One or more machine learning models undergo supervised machine learning on their neural network to train on how to associate one or more mark-up markers with a textual representation, for each individual word, individual non-word, individual phoneme, and any combinations of these, that was enunciated with a particular paralinguistic effect. Each mark-up marker can correspond to its own paralinguistic effect.
Core Innovation
The invention relates to a system and method for generating and understanding controllable, natural paralinguistics in text-to-speech synthesis. It involves training one or more machine learning models to analyze audio data of speech containing words, phonemes, and non-words that are annotated with mark-up markers. These markers guide the generation of textual representations that cause different enunciations from plain speech, thereby conveying additional intended meanings or increasing comprehension of the spoken elements.
The problem being addressed is that traditional dialogue systems and text-to-speech technologies lack the ability to effectively interpret or produce paralinguistic effects—such as emphasis, sentiment, or nuances conveyed through prosody changes—in speech. Many prior systems treat speech literally, ignoring the acoustic cues that modify meaning, resulting in robotic, monotone interactions that are less natural and harder to understand.
This invention solves this problem by employing a training system including speech processing detectors, a speech recognition module, and supervised machine learning models. The detectors analyze speech parameters indicative of paralinguistic effects. The speech recognition module produces time-aligned textual and phonetic representations. The machine learning models learn to associate specific mark-up markers with corresponding paralinguistic effects on words, phonemes, and non-words. A speech generation module uses these mark-up markers to produce speech with appropriate prosodic variations, better conveying the intended secondary meanings and improving naturalness and comprehension.
Claims Coverage
The patent includes three independent claims covering apparatus and methods for training machine learning models to recognize and generate paralinguistic effects in speech, annotating training data, and generating speech with enhanced prosody guided by mark-up markers.
Machine learning models trained to generate speech with paralinguistic effects
One or more machine learning models trained to examine audio data including words, phonemes, or non-words annotated with mark-up markers that guide generation of textual representations to cause different enunciations conveying additional intended meanings, where a speech generation module uses these representations to create speech with particular paralinguistic effects.
Automated annotation of training data using speech detectors and recognition modules
A set of speech processing detectors analyze training data to detect speech parameters indicative of paralinguistic effects on words, phonemes, and non-words; a speech recognition module creates textual representations; and machine learning models trained via supervised learning associate mark-up markers to these representations, automating labeling and pre-deployment training of the models on various paralinguistic effects.
Natural language generation guided by a mark-up marker table for speech synthesis
A natural language generator module generates textual representations of words, phonemes, or non-words and references a table mapping mark-up markers to respective paralinguistic effects, marking up the textual representation to guide a speech generation module in producing speech with the corresponding paralinguistic effects differing from plain enunciation by threshold amounts.
The claims collectively describe an integrated system wherein supervised machine learning models, speech detectors, recognition modules, and markup annotation cooperate to identify, represent, and reproduce paralinguistic effects in speech, enabling more natural and meaningful speech synthesis and understanding.
Stated Advantages
The system provides more human-like naturalness in text-to-speech output by enabling controlled production of paralinguistic cues.
It improves communication clarity and efficiency by conveying additional information beyond literal word meanings through prosodic variations.
The invention reduces development cost and time compared to existing systems for producing natural prosodic speech outputs.
It enables conversational systems to process and generate a fuller range of expressive speech, enhancing user satisfaction and willingness to use spoken language systems.
Documented Applications
Use in conversational engagement platforms such as dialogue systems and virtual digital assistants (e.g., Alexa, Siri) to produce speech with nuanced paralinguistic effects.
Enhancing natural language generation and text-to-speech synthesis modules to produce speech conveying additional intended meanings via prosody.
Training systems employing speech activity detectors, speech recognition modules, and machine learning models to analyze and annotate training data for paralinguistic effects.
Interested in licensing this patent?