teemosliang
SDPose Wholebody
SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation (WholeBody - 133 Keypoints) [](https://arxiv.org/abs/2509.24980) [](https://t-s-liang.github.io/SDPose) [](https://huggingface.co/spaces/teemosliang/SDPose-Body) [](https://opensource.org/licenses/MIT) SDPose is a state-of-the-art human pose estimation model that leverages the powerful visual priors from Stable Diffusion to achieve exceptional performance on out-of-distribution (OOD) scenarios. This model variant estimates 133 wholebody keypoints, including body, hands, face, feet. SDPose employs a U-Net backbone initialized with Stable Diffusion v2 weights, combined with a specialized heatmap head for keypoint prediction. The model operates in a top-down manner: 1. Person Detection: Detect human bounding boxes using an object detector (e.g., YOLO11-x) 2. Pose Estimation: Crop and estimate 17 body keypoints for each detected person 3. Heatmap Generation: Produce confidence heatmaps for precise keypoint estimation Model Specifications: - Backbone: Stable Diffusion v2 U-Net (fine-tuned; minimal architectural changes) - Head: Custom heatmap prediction head - Input Resolution: 1024×768 (H×W) - Output: 133 keypoint heatmaps + coordinates with confidence scores - Framework: MMPose The model predicts 133 body keypoints following the COCO Wholebody keypoint format. - Human pose estimation in natural images - Pose estimation in artistic and stylized domains (paintings, anime, sketches) - Animation and video pose tracking - Cross-domain pose analysis and research - Applications requiring robust pose estimation under distribution shifts Trained exclusively on COCO-2017 train2017 (no extra data). - COCO-Wholebody (Common Objects in Context): 200K+ images with 133 wholebody keypoints - Images are resized and cropped to 1024×768 resolution - Augmentation: random horizontal flip, half-body & bbox transforms, UDP affine; Albumentations (Gaussian/Median blur, coarse dropout). - Heatmaps: UDP codec (MMPose style). SDPose significantly outperforms traditional pose estimation models (e.g., Sapiens) on out-of-distribution benchmarks while maintaining competitive performance on in-domain data. See our paper for comprehensive evaluation results. If you use SDPose in your research, please cite our paper: - 🌐 Project Website: https://t-s-liang.github.io/SDPose - 📄 Paper: arXiv:2509.24980 - 💻 Code Repository: GitHub - 🤗 Demo: HuggingFace Space - 📧 Contact: [email protected]
SDPose-Body
SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation (Body - 17 Keypoints) [](https://arxiv.org/abs/2509.24980) [](https://t-s-liang.github.io/SDPose) [](https://huggingface.co/spaces/teemosliang/SDPose-Body) [](https://opensource.org/licenses/MIT) SDPose is a state-of-the-art human pose estimation model that leverages the powerful visual priors from Stable Diffusion to achieve exceptional performance on out-of-distribution (OOD) scenarios. This model variant estimates 17 COCO body keypoints including nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. SDPose employs a U-Net backbone initialized with Stable Diffusion v2 weights, combined with a specialized heatmap head for keypoint prediction. The model operates in a top-down manner: 1. Person Detection: Detect human bounding boxes using an object detector (e.g., YOLO11-x) 2. Pose Estimation: Crop and estimate 17 body keypoints for each detected person 3. Heatmap Generation: Produce confidence heatmaps for precise keypoint estimation Model Specifications: - Backbone: Stable Diffusion v2 U-Net (fine-tuned; minimal architectural changes) - Head: Custom heatmap prediction head - Input Resolution: 1024×768 (H×W) - Output: 17 keypoint heatmaps + coordinates with confidence scores - Framework: MMPose The model predicts 17 body keypoints following the COCO keypoint format: - Human pose estimation in natural images - Pose estimation in artistic and stylized domains (paintings, anime, sketches) - Animation and video pose tracking - Cross-domain pose analysis and research - Applications requiring robust pose estimation under distribution shifts Trained exclusively on COCO-2017 train2017 (no extra data). - COCO (Common Objects in Context): 200K+ images with 17 body keypoints - Images are resized and cropped to 1024×768 resolution - Augmentation: random horizontal flip, half-body & bbox transforms, UDP affine; Albumentations (Gaussian/Median blur, coarse dropout). - Heatmaps: UDP codec (MMPose style). SDPose significantly outperforms traditional pose estimation models (e.g., Sapiens, ViTPose++) on out-of-distribution benchmarks while maintaining competitive performance on in-domain data. See our paper for comprehensive evaluation results. If you use SDPose in your research, please cite our paper: - 🌐 Project Website: https://t-s-liang.github.io/SDPose - 📄 Paper: arXiv:2509.24980 - 💻 Code Repository: GitHub - 🤗 Demo: HuggingFace Space - 📧 Contact: [email protected]