Light-weight pose estimation network with multi-scale heatmap fusion
Inventors
Fu, Yun • Jiang, Songyao • Sun, Bin
Assignees
Northeaster University • Northeastern University Boston
Publication Number
US-12205317-B2
Publication Date
2025-01-21
Expiration Date
2041-02-10
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
Embodiments identify joints of a multi-limb body in an image. One such embodiment unifies depth of a plurality of multi-scale feature maps generated from an image of a multi-limb body to create a plurality of feature maps each having a same depth. In turn, for each of the plurality of feature maps having the same depth, an initial indication of one or more joints in the image is generated. The one or more joints are located at an interconnection of a limb to the multi-limb body or at an interconnection of a limb to another limb. To continue, a final indication of the one or more joints in the image is generated using each generated initial indication of the one or more joints.
Core Innovation
The invention provides a light-weight, accurate, and fast pose estimation network that utilizes a multi-scale heatmap fusion mechanism for estimating 2D poses from a single RGB image. The approach unifies the depth of multiple multi-scale feature maps generated from an image of a multi-limb body, producing feature maps of the same depth but different scales. For each depth-unified feature map, the system generates initial indications of joint locations, particularly at interconnections of limbs or between limbs themselves. These initial indications at diverse scales are combined to give a final joint location estimation, enhancing the system’s accuracy in pose estimation tasks.
The problem solved by this invention addresses the challenges of pose estimation in images, where variations in scale between people and body parts reduce detection accuracy and traditional methods suffer from high computational costs or loss of accuracy when made light-weight. Existing backbone neural networks, primarily designed for image classification, are not directly suitable for efficient deployment on mobile or embedded devices due to their model complexity and resource requirements. Thus, there is a need for dedicated deep convolutional neural network modules that enable high-accuracy and high-speed pose estimation with lower computational and storage demands.
The core innovation centers on a novel architecture that employs a Low-rank Pointwise Residual (LPR) module within a high-resolution network backbone. This module reduces parameters and computational demands by factorizing the convolutional weight matrices and using residual connections through depthwise convolution, thus maintaining feature expressivity. The multi-scale head structure generates and fuses heatmaps from features at different resolutions, explicitly solving the scaling problem in pose estimation and enabling efficient, accurate operation suitable for real-time use on constrained devices.
Claims Coverage
The patent introduces several inventive features relating to computer-implemented methods, systems, and program products for identifying joints of a multi-limb body in images using unified multi-scale feature maps and multi-scale heatmap fusion.
Unifying feature map depth across scales
A method of creating a plurality of feature maps each having a same depth by unifying the depth of feature maps generated from an image, where each feature map has a different scale. This enables consistent processing across multiple resolutions, facilitating multi-scale joint detection.
Multi-scale joint indication and fusion
For each depth-unified feature map, an initial indication of one or more joints in the image is generated, specifically at interconnected limb positions. A final indication of these joints is then created by combining each generated initial indication, achieving a comprehensive joint estimation across all scales.
Generation of limb and pose indications from fused joint estimations
From the final indication of the joints, an indication of one or more limbs in the image can be generated, and subsequently, an overall pose can be determined using both joint and limb indications.
Upsampling and addition for heatmap fusion
The process of generating the final joint indication includes upsampling at least one initial joint indication to match the largest scale among the initial indications, and then adding the upsampled and largest scale indications together to form the fused result.
Depth unification via individual convolutional layers
Unifying the depth of multi-scale feature maps by applying a respective convolutional layer to each map, ensuring identical channel size for subsequent processing.
Heatmap estimating layer formed by convolutional neural network
Application of a heatmap estimating layer, composed of a convolutional neural network, to each depth-unified feature map for generating the initial joint heatmaps, enabling accurate and scalable estimation.
Training based on multi-scale ground-truth indications
The system supports training wherein generated initial joint indications are compared with corresponding ground-truth indications at each scale, and losses are used for backpropagation in the neural network.
Backbone network for generating multi-scale feature maps with fusion
A backbone neural network is employed to process the image, performing both multi-scale feature extraction and feature fusion to generate the feature maps used in pose estimation.
Computer system and program product implementation
The inventive method is extended to a computer system with processor and memory executing code to perform all outlined steps, and to a computer program product comprising program instructions and non-transitory storage to perform the claimed functions when executed.
In summary, the patent claims cover novel approaches for efficient and accurate joint and pose estimation using unified multi-scale feature processing, heatmap generation and fusion, specialized training procedures, and extend to hardware and software implementations of these methods.
Stated Advantages
The system reduces computational costs by more than 70% (in FLOPs) and reduces parameters by over 85% with minimal loss in accuracy.
Enables fast and accurate pose estimation that can run in real-time on mobile devices and with robust performance.
Solves the scaling problem in pose estimation by utilizing multi-scale feature extraction, fusion, and heatmap estimation/fusion mechanisms.
Maintains accuracy comparable to state-of-the-art methods while substantially lowering model complexity, making deployment feasible on end devices.
Modular and adaptable network design allows easy integration into different architectures.
Documented Applications
Used for detecting human behaviors in monitoring systems.
Applicable to human-computer interaction, such as video games using body movement as input (e.g., Xbox Kinect).
Employable in mobile applications that require human body movement as input, including personal fitting and training applications.
Supports pose estimation tasks for various objects such as humans, animals, machines, and robots.
Usable as a basis for applications in sports, security, autonomous self-driving cars, and robotics.
Interested in licensing this patent?