VeryAladeen

2 models • 1 total models in database

Sort by:

Wan2_1-HuMo_17B-GGUF

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning [](https://arxiv.org/abs/2509.08519)  [](https://phantom-video.github.io/HuMo/)  > HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning > Liyang Chen , Tianxiang Ma , Jiawei Liu, Bingchuan Li &dagger; , Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu § > Equal contribution, &dagger; Project lead, § Corresponding author > Tsinghua University | Intelligent Creation Team, ByteDance ✨ Key Features HuMo is a unified, human-centric video generation framework designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs—including text, images, and audio. It supports strong text prompt following, consistent subject preservation, synchronized audio-driven motion. > - VideoGen from Text-Image - Customize character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images. > - VideoGen from Text-Audio - Generate audio-synchronized videos solely from text and audio inputs, removing the need for image references and enabling greater creative freedom. > - VideoGen from Text-Image-Audio - Achieve the higher level of customization and control by combining text, image, and audio guidance. 📑 Todo List - [x] Release Paper - [x] Checkpoint of HuMo-17B - [x] Inference Codes - [ ] Text-Image Input - [x] Text-Audio Input - [x] Text-Image-Audio Input - [x] Multi-GPU Inference - [ ] Release Prompts to Generate Demo of Faceless Thrones - [ ] HuMo-1.7B Model Preparation | Models | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | HuMo-17B | 🤗 Huggingface | Released before September 15 | HuMo-1.7B | 🤗 Huggingface | To be released soon | Wan-2.1 | 🤗 Huggingface | VAE & Text encoder | Whisper-large-v3 | 🤗 Huggingface | Audio encoder | Audio separator | 🤗 Huggingface | Remove background noise (optional) Our model is compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality. > Some tips > - Please prepare your text, reference images and audio as described in testcase.json. > - We support Multi-GPU inference using FSDP + Sequence Parallel. > - The model is trained on 97-frame videos at 25 FPS. Generating video longer than 97 frames may degrade the performance. We will provide a new checkpoint for longer generation. HuMo’s behavior and output can be customized by modifying generate.yaml configuration file. The following parameters control generation length, video resolution, and how text, image, and audio inputs are balanced: Acknowledgements Our work builds upon and is greatly inspired by several outstanding open-source projects, including Phantom, SeedVR, MEMO, Hallo3, OpenHumanVid, and Whisper. We sincerely thank the authors and contributors of these projects for generously sharing their excellent codes and ideas. If you find this project useful for your research, please consider citing our paper. 📧 Contact If you have any comments or questions regarding this open-source project, please open a new issue or contact Liyang Chen and Tianxiang Ma.

NaNK

license:apache-2.0

424

Sec 4B

Single-file model formats for the SeC (Segment Concept) video object segmentation model, optimized for use with ComfyUI SeC Nodes. | Format | Size | Description | GPU Requirements | |--------|------|-------------|------------------| | SeC-4B-fp16.safetensors | 7.35 GB | Recommended - Best balance of quality and size | All CUDA GPUs | | SeC-4B-fp8.safetensors | 3.97 GB | VRAM-constrained systems (saves 1.5-2GB VRAM) | RTX 30 series or newer | | SeC-4B-bf16.safetensors | 7.35 GB | Alternative to FP16 | All CUDA GPUs | | SeC-4B-fp32.safetensors | 14.14 GB | Full precision | All CUDA GPUs | SeC (Segment Concept) uses Large Vision-Language Models for video object segmentation, achieving +11.8 points improvement over SAM 2.1 on complex semantic scenarios (SeCVOS benchmark). Key features: - Concept-driven tracking with semantic understanding - Handles occlusions and appearance changes - Bidirectional tracking support - State-of-the-art performance on multiple benchmarks These models are designed for use with the ComfyUI SeC Nodes custom nodes. Installation: 1. Download your preferred model format 2. Place in `ComfyUI/models/sams/` 3. Install ComfyUI SeC Nodes 4. The model will be automatically detected and available in the SeC Model Loader These are converted single-file versions of the original model: - Original Repository: OpenIXCLab/SeC-4B - Paper: arXiv:2507.15852 - Official Implementation: github.com/OpenIXCLab/SeC Original Model: Developed by OpenIXCLab - Model architecture and weights: Apache 2.0 License - Paper: Zhang et al. "SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction" Single-File Conversions: Created for ComfyUI SeC Nodes - Conversion script and ComfyUI integration: 9nate-drake - FP8 quantization support via torchao If you use this model in your research, please cite the original SeC paper:

NaNK

license:apache-2.0