Deep learning based video information extraction system
Inventors
Bishop, Morgan A. • Nagy, James M. • Pottenger, William M. • Hoogs, Anthony • Tong, Tuanjie
Assignees
United States Department of the Air Force
Publication Number
US-12211278-B2
Publication Date
2025-01-28
Expiration Date
2042-05-12
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A video information extraction system includes a memory to store a video; a textual information extraction module to obtain information about terms, entities, relations, and events from a ground truth caption corresponding to the video; and a video captioning module including an encoder (i) to receive the information about the terms, entities, relations, and events from the textual information extraction module, and (ii) to extract video features from the video; and a decoder to generate a text caption based on the extracted video features.
Core Innovation
The invention provides a deep learning based video information extraction system that utilizes ground truth captions associated with videos to extract and assign types to terms, entities, relations, and events of interest within the video. The system comprises a memory to store the video; a textual information extraction module that extracts information about terms, entities, relations, and events from ground truth captions corresponding to the video; and a video captioning module including an encoder to receive this extracted information and to extract video features from the video, alongside a decoder to generate a text caption based on the extracted video features.
The background problem addressed is the difficulty in extracting information from video data due to the challenge of labeling large amounts of video data for training classifiers. Specifically, while information extraction has been extensively studied for textual data, video analytics has mostly focused on object detection rather than extracting detailed informational elements such as entities, relations, and events in videos. Expert labeling is expensive and the available data for training is often insufficient, which complicates accurate model parameter estimation.
Three embodiments of video entity, relation, and event extraction are presented. The first is a pre-information extraction (pre-IE) approach that extracts terms, entities, relations, and events from ground truth captions before training a video captioning framework. The second utilizes a joint embedding approach that encodes both video features and vectors representing terms, entities, relations, and events into a common space for training. The third is a post-information extraction (post-IE) method that trains a video captioning framework directly with ground truth captions to generate descriptive sentences, with information extraction subsequently applied to these generated captions. These approaches are particularly useful where limited caption data is available and labeled datasets for object detection are either unavailable or costly to obtain.
Claims Coverage
The patent includes three independent claims corresponding to three distinct video information extraction system embodiments, each with main inventive features relating to the extraction and processing of video and textual data for generating captions describing entities, relations, and events.
Video information extraction system with encoder and decoder modules
A system comprising a memory storing a video; a textual information extraction module extracting information about entities, relations, and events from ground truth captions; and a video captioning module with an encoder receiving the extracted information and extracting video features, and a decoder generating text captions based on these video features.
Joint embedding video information extraction system
A system with a memory storing a video; a textual information extraction module extracting information from ground truth captions; a first encoder receiving this extracted information; a second encoder extracting video features; a common embedding module encoding both extracted information and video features into vectors; and a decoder generating text captions from these vectors.
Video information extraction system generating full sentence captions and extracting information
A system with a memory storing a video; a video captioning module with an encoder receiving video and ground truth captions, and a decoder generating full sentence captions based on the ground truth captions; and a textual information extraction module obtaining entities, relations, and events from these full sentence captions.
Overall, the claims cover systems incorporating modules for extracting and encoding textual information from ground truth captions, extracting video features, embedding these into common vector spaces, and generating text captions describing video content, with optional object detection modules and techniques such as entity resolution, higher-order co-occurrence vectors, convolutional and recurrent neural networks, transfer learning, and generative adversarial learning applied for enhancing video information extraction.
Stated Advantages
Provides improved video analytics capability by extracting detailed terms, entities, relations, and events from videos using deep learning based on ground truth captions.
Enables video information extraction in scenarios with limited labeled data and absence of object class label datasets, reducing reliance on expensive expert labeling.
Facilitates richer descriptive caption generation for videos by leveraging higher-order co-occurrence information and common embedding spaces linking video features and textual data.
Supports the use of transfer learning and pretrained neural networks to enhance video feature extraction, improving accuracy for temporal and spatial aspects.
Offers flexible embodiments including pre-information extraction, joint embedding, and post-information extraction approaches to suit different data availability and application needs.
Documented Applications
Video analytics for surveillance cameras in various target domains.
Homeland security applications such as border protection.
Law enforcement, intelligence, security, and defense applications.
Interested in licensing this patent?