checkpointsdup98

3 models • 2 total models in database

Sort by:

HunyuanImage-2.1

HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation This repo contains PyTorch model definitions, pretrained weights and inference/sampling code for our HunyuanImage-2.1. You can find more visualizations on our project page. - September 12, 2025: 🚀 Released FP8 quantized models! Making it possible to generate 2K images with only 24GB GPU memory! - September 8, 2025: 🚀 Released inference code and model weights for HunyuanImage-2.1. Contents - HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation - 🔥🔥🔥 Latest Updates - 🎥 Demo - Contents - Abstract - HunyuanImage-2.1 Overall Pipeline - Training Data and Caption - Text-to-Image Model Architecture - Reinforcement Learning from Human Feedback - Rewriting Model - Model distillation - 🎉 HunyuanImage-2.1 Key Features - Prompt Enhanced Demo - 📈 Comparisons - SSAE Evaluation - GSB Evaluation - 📜 System Requirements - 🛠️ Dependencies and Installation - 🧱 Download Pretrained Models - 🔑 Usage - 🔗 BibTeX - Acknowledgements - Github Star History Abstract We present HunyuanImage-2.1, a highly efficient text-to-image model that is capable of generating 2K (2048 × 2048) resolution images. Leveraging an extensive dataset and structured captions involving multiple expert models, we significantly enhance text-image alignment capabilities. The model employs a highly expressive VAE with a (32 × 32) spatial compression ratio, substantially reducing computational costs. Our architecture consists of two stages: 1. Base text-to-image Model: The first stage is a text-to-image model that utilizes two text encoders: a multimodal large language model (MLLM) to improve image-text alignment, and a multi-language, character-aware encoder to enhance text rendering across various languages. This stage features a single- and dual-stream diffusion transformer with 17 billion parameters. To optimize aesthetics and structural coherence, we apply reinforcement learning from human feedback (RLHF). 2. Refiner Model: The second stage introduces a refiner model that further enhances image quality and clarity, while minimizing artifacts. Additionally, we developed the PromptEnhancer module to further boost model performance, and employed meanflow distillation for efficient inference. HunyuanImage-2.1 demonstrates robust semantic alignment and cross-scenario generalization, leading to improved consistency between text and image, enhanced control of scene details, character poses, and expressions, and the ability to generate multiple objects with distinct descriptions. Structured captions provide hierarchical semantic information at short, medium, long, and extra-long levels, significantly enhancing the model’s responsiveness to complex semantics. Innovatively, an OCR agent and IP RAG are introduced to address the shortcomings of general VLM captioners in dense text and world knowledge descriptions, while a bidirectional verification strategy ensures caption accuracy. Core Components: High-Compression VAE with REPA Training Acceleration: A VAE with a 32× compression rate drastically reduces the number of input tokens for the DiT model. By aligning its feature space with DINOv2 features, we facilitate the training of high-compression VAEs. As a result, our model generates 2K images with the same token length (and thus similar inference time) as other models require for 1K images, achieving superior inference efficiency. Multi-bucket, multi-resolution REPA loss aligns DiT features with a high-dimensional semantic feature space, accelerating model convergence. Dual Text Encoder: A vision-language multimodal encoder is employed to better understand scene descriptions, character actions, and detailed requirements. A multilingual ByT5 text encoder is introduced to specialize in text generation and multilingual expression. Network: A single- and dual-stream diffusion transformer with 17 billion parameters. Reinforcement Learning from Human Feedback Two-Stage Post-Training with Reinforcement Learning: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are applied sequentially in two post-training stages. We introduce a Reward Distribution Alignment algorithm, which innovatively incorporates high-quality images as selected samples to ensure stable and improved reinforcement learning outcomes. The first systematic industrial-level rewriting model. SFT training structurally rewrites user text instructions to enrich visual expression, while GRPO training employs a fine-grained semantic AlignEvaluator reward model to substantially improve the semantics of images generated from rewritten text. The AlignEvaluator covers 6 major categories and 24 fine-grained assessment points. PromptEnhancer supports both Chinese and English rewriting and demonstrates general applicability in enhancing semantics for both open-source and proprietary text-to-image models. Model distillation We propose a novel distillation method based on meanflow that addresses the key challenges of instability and inefficiency inherent in standard meanflow training. This approach enables high-quality image generation with only a few sampling steps. To our knowledge, this is the first successful application of meanflow to an industrial-scale model. - High-Quality Generation: Efficiently produces ultra-high-definition (2K) images with cinematic composition. - Multilingual Support: Provides native support for both Chinese and English prompts. - Advanced Architecture: Built on a multi-modal, single- and dual-stream combined DiT (Diffusion Transformer) backbone. - Glyph-Aware Processing: Utilizes ByT5's text rendering capabilities for improved text generation accuracy. - Flexible Aspect Ratios: Supports a variety of image aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3). - Prompt Enhancement: Automatically rewrites prompts to improve descriptive accuracy and visual quality. Prompt Enhanced Demo To improve the quality and detail of generated images, we use a prompt rewriting model. This model automatically enhances user-provided text prompts by adding detailed and descriptive information. SSAE Evaluation SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points. Model Open Source Mean Image Accuracy Global Accuracy Primary Subject Secondary Subject Scene Other Noun Key Attributes Other Attributes Action Noun Attributes Action Noun Attributes Shot Style Composition FLUX-dev ✅ 0.7122 0.6995 0.7965 0.7824 0.5993 0.5777 0.7950 0.6826 0.6923 0.8453 0.8094 0.6452 0.7096 0.6190 Seedream-3.0 ❌ 0.8827 0.8792 0.9490 0.9311 0.8242 0.8177 0.9747 0.9103 0.8400 0.9489 0.8848 0.7582 0.8726 0.7619 Qwen-Image ✅ 0.8854 0.8828 0.9502 0.9231 0.8351 0.8161 0.9938 0.9043 0.8846 0.9613 0.8978 0.7634 0.8548 0.8095 GPT-Image ❌ 0.8952 0.8929 0.9448 0.9289 0.8655 0.8445 0.9494 0.9283 0.8800 0.9432 0.9017 0.7253 0.8582 0.7143 HunyuanImage 2.1 ✅ 0.8888 0.8832 0.9339 0.9341 0.8363 0.8342 0.9627 0.8870 0.9615 0.9448 0.9254 0.7527 0.8689 0.7619 From the SSAE evaluation results, our model has currently achieved the optimal performance among open-source models in terms of semantic alignment, and is very close to the performance of closed-source commercial models (GPT-Image). We adopted the GSB evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators. From the results, HunyuanImage 2.1 achieved a relative win rate of -1.36% against Seedream3.0 (closed-source) and 2.89% outperforming Qwen-Image (open-source). The GSB evaluation results demonstrate that HunyuanImage 2.1, as an open-source model, has reached a level of image generation quality comparable to closed-source commercial models (Seedream3.0), while showing certain advantages in comparison with similar open-source models (Qwen-Image). This fully validates the technical advancement and practical value of HunyuanImage 2.1 in text-to-image generation tasks. Hardware and OS Requirements: - NVIDIA GPU with CUDA support. Minimum requrement for now: 24 GB GPU memory for 2048x2048 image generation. > Note: The memory requirements above are measured with model CPU offloading and FP8 quantization enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed. - Supported operating system: Linux. The details of download pretrained models are shown here. 🔑 Usage HunyuanImage-2.1 only supports 2K image generation (e.g. 2048x2048 for 1:1 images, 2560x1536 for 16:9 images, etc.). Generating images with 1K resolution will result in artifacts. Additionally, we recommend using the full generation pipeline for better quality (i.e. enabling prompt enhancement and refinment). If you find this project useful for your research and applications, please cite as: We would like to thank the following open-source projects and communities for their contributions to open research and exploration: Qwen, FLUX, diffusers and HuggingFace.

—

Wan2.2 Remix

👉 Join our Telegram group for updates and feedback Wan2.2-Remix is a research and creative model designed for producing short, imaginative video clips from textual prompts. It explores human figure dynamics, body movement expression, and scene consistency, delivering smoother motion and enhanced realism in character interactions. No additional LoRA setup is required. Lightx2v Wan2.2 Lightning dyno version (quantized to fp8 ) Comfy-Org Wan 2.2 ComfyUI Repackaged (fp16, quantized to fp8 ) Enhancements: Blended with open-source LoRA resources and customized training data to enrich motion, body posture, and conceptual fidelity. Integration: Already bundled with lightx2v , no extra setup required Recommended CLIP: This release is currently beta . While functional, it is not yet perfect—feedback and testing are highly appreciated. High-noise model updated to Lightx2v dyno version , quantized to fp8 Low-noise model updated to Comfy-Org fp16 version , quantized to fp8 Improved character motion coherence Enhanced natural human limb articulation Refined body motion balance for smoother expression Improved anatomical rendering for both male and female figures Adjusted walking animation weights for realism Enhanced visual clarity in intimate interaction scenarios Introduced basic body dynamics simulation Added anatomical detail to human figures Improved facial rendering and expression consistency Added dynamic actions: walking, dancing, running Enhanced pose generation and overall gesture variety This model is released for research and creative purposes only . It must not be used for illegal, unethical, or harmful activities. The author assumes no responsibility for any misuse or consequences arising from its use. Wan2.2-Remix 是一款研究與創意模型，可根據文字提示生成短片段的影像作品。本模型著重於人體姿態、肢體動作表現力以及場景一致性，讓人物動態更自然流暢，互動更加真實。無需額外導入 LoRA。 Comfy-Org Wan 2.2 ComfyUI Repackaged （fp16，量化至 fp8 ）高躁模型更新為 Lightx2v dyno 版本，量化為 fp8 低躁模型更新為 Comfy-Org fp16 版本，量化為 fp8 改善人物動作的協調性增強人體肢體的自然表現優化肢體動態平衡，使表現更自然改善男女角色的解剖細節呈現調整行走動作權重以提升真實感增強互動場景中的細節與清晰度引入基礎的肢體動態模擬增加人體解剖細節改善臉部渲染與表情一致性新增動作：行走、跳舞、奔跑提升姿態生成與動作多樣性本模型僅供研究與創意用途。禁止用於任何非法、不道德或有害的行為。作者不承擔任何因濫用而產生的後果。

—

HoloCine

license:cc-by-nc-sa-4.0