tencent

309,328

864

Hunyuan3D-2

--- library_name: hunyuan3d-2 license: other license_name: tencent-hunyuan-community license_link: https://huggingface.co/tencent/Hunyuan3D-2/blob/main/LICENSE.txt language: - en - zh tags: - image-to-3d - text-to-3d pipeline_tag: image-to-3d extra_gated_eu_disallowed: true ---

100,232

1,653

HY-MT1.5-7B

HunyuanImage-3.0

🎨 HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation 👏 Join our WeChat and Discord | 💻 Official website(官网) Try our model! &nbsp&nbsp 🔥🔥🔥 News - September 28, 2025: 📖 ...

43,181

726

Hunyuan3D-2.1

If you found this repository helpful, please cite our report: We would like to thank the contributors to the TripoSG, DINOv2, Stable Diffusion, FLUX, diffusers and HuggingFace repositories, for their open research and exploration.

38,773

714

HunyuanWorld-Mirror

HunyuanWorld-Mirror is a versatile feed-forward model for comprehensive 3D geometric prediction. It integrates diverse geometric priors (camera poses, calibrated intrinsics, depth maps) and simultaneously generates various 3D representations (point clouds, multi-view depths, camera parameters, surface normals, 3D Gaussians) in a single forward pass. Architecture HunyuanWorld-Mirror consists of two key components: (1) Multi-Modal Prior Prompting: A mechanism that embeds diverse prior modalities, including calibrated intrinsics, camera pose, and depth, into the feed-forward model. Given any subset of the available priors, we utilize several lightweight encoding layers to convert each modality into structured tokens. (2) Universal Geometric Prediction: A unified architecture capable of handling the full spectrum of 3D reconstruction tasks from camera and depth estimation to point map regression, surface normal estimation, and novel view synthesis. If you find HunyuanWorld-Mirror useful for your research and applications, please cite using this BibTeX: Acknowledgements We would like to thank HunyuanWorld. We also sincerely thank the authors and contributors of VGGT, Fast3R, CUT3R, and DUSt3R for their outstanding open-source work and pioneering research.

26,203

413

Hunyuan-MT-7B

🤗  Hugging Face   |   🕹️  Demo   |   🤖  ModelScope 🖥️  Official Website   |   GitHub   |   Te...

DepthCrafter

15,706

100

Hunyuan-MT-7B-fp8

🤗  Hugging Face   |   🤖  ModelScope   |   🪡  AngelSlim 🖥️  Official Website   |   🕹️  Demo      Hunyuan-MT-Chimera-7B-fp8 was produced by AngelSlim. The Hunyuan Translation Model comprises a translation model, Hunyuan-MT-7B, and an ensemble model, Hunyuan-MT-Chimera. The translation model is used to translate source text into the target language, while the ensemble model integrates multiple translation outputs to produce a higher-quality result. It primarily supports mutual translation among 33 languages, including five ethnic minority languages in China. - In the WMT25 competition, the model achieved first place in 30 out of the 31 language categories it participated in. - Hunyuan-MT-7B achieves industry-leading performance among models of comparable scale - Hunyuan-MT-Chimera-7B is the industry’s first open-source translation ensemble model, elevating translation quality to a new level - A comprehensive training framework for translation models has been proposed, spanning from pretrain → cross-lingual pretraining (CPT) → supervised fine-tuning (SFT) → translation enhancement → ensemble refinement, achieving state-of-the-art (SOTA) results for models of similar size Related News 2025.9.1 We have open-sourced Hunyuan-MT-7B , Hunyuan-MT-Chimera-7B on Hugging Face. 模型链接 | Model Name | Description | Download | | ----------- | ----------- |----------- | Hunyuan-MT-7B | Hunyuan 7B translation model |🤗 Model| | Hunyuan-MT-7B-fp8 | Hunyuan 7B translation model，fp8 quant | 🤗 Model| | Hunyuan-MT-Chimera | Hunyuan 7B translation ensemble model | 🤗 Model| | Hunyuan-MT-Chimera-fp8 | Hunyuan 7B translation ensemble model，fp8 quant | 🤗 Model| Prompt Template for XX XX Translation, excluding ZH XX. Use with transformers First, please install transformers, recommends v4.56.0 The following code snippet shows how to use the transformers library to load and apply the model. !!! If you want to load fp8 model with transformers, you need to change the name"ignoredlayers" in config.json to "ignore" and upgrade the compressed-tensors to compressed-tensors-0.11.0. We recommend using the following set of parameters for inference. Note that our model does not have the default systemprompt. Supported languages: | Languages | Abbr. | Chinese Names | |-------------------|---------|-----------------| | Chinese | zh | 中文 | | English | en | 英语 | | French | fr | 法语 | | Portuguese | pt | 葡萄牙语 | | Spanish | es | 西班牙语 | | Japanese | ja | 日语 | | Turkish | tr | 土耳其语 | | Russian | ru | 俄语 | | Arabic | ar | 阿拉伯语 | | Korean | ko | 韩语 | | Thai | th | 泰语 | | Italian | it | 意大利语 | | German | de | 德语 | | Vietnamese | vi | 越南语 | | Malay | ms | 马来语 | | Indonesian | id | 印尼语 | | Filipino | tl | 菲律宾语 | | Hindi | hi | 印地语 | | Traditional Chinese | zh-Hant| 繁体中文 | | Polish | pl | 波兰语 | | Czech | cs | 捷克语 | | Dutch | nl | 荷兰语 | | Khmer | km | 高棉语 | | Burmese | my | 缅甸语 | | Persian | fa | 波斯语 | | Gujarati | gu | 古吉拉特语 | | Urdu | ur | 乌尔都语 | | Telugu | te | 泰卢固语 | | Marathi | mr | 马拉地语 | | Hebrew | he | 希伯来语 | | Bengali | bn | 孟加拉语 | | Tamil | ta | 泰米尔语 | | Ukrainian | uk | 乌克兰语 | | Tibetan | bo | 藏语 | | Kazakh | kk | 哈萨克语 | | Mongolian | mn | 蒙古语 | | Uyghur | ug | 维吾尔语 | | Cantonese | yue | 粤语 |

9,699

HunyuanWorld-1

"To see a World in a Grain of Sand, and a Heaven in a Wild Flower" Acknowledgements We would like to thank the contributors to the Stable Diffusion, FLUX, diffusers, HuggingFace, Real-ESRGAN, ZIM, GroundingDINO, MoGe, Worldsheet, WorldGen repositories, for their open research.

5,942

588

HY-MT1.5-7B-GGUF

5,658

Hunyuan3D-2mini

“ Living out everyone’s imagination on creating and manipulating 3D assets.” This repository contains the models of the paper Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. Hunyuan3D-2mini contains a 0.6B shape generator, which is smaller and faster than the previous 1.1B one. For code and more details on how to use it, refer to the Github repository. If you found this repository helpful, please cite our report: Thanks for the contributions of community members, here we have these great extensions of Hunyuan3D 2.0: - ComfyUI-Hunyuan3DWrapper - Hunyuan3D-2-for-windows - 📦 A bundle for running on Windows | 整合包 We would like to thank the contributors to the DINOv2, Stable Diffusion, FLUX, diffusers and HuggingFace repositories, for their open research and exploration.

5,655

101

KaLM-Embedding-Gemma3-12B-2511

KaLM-Embedding-Gemma3-12B-2511 is a versatile and compact embedding model, which achieves SOTA performance in MMTEB (due to 11-2025). | Rank (Borda) | Model | Mean (Task) | Mean (TaskType) | Bitext Mining | Classification | Clustering | Instruction Reranking | Multilabel Classification | Pair Classification | Reranking | Retrieval | STS | | :--- | :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 1 | KaLM-Embedding-Gemma3-12B-2511 | 72.32 | 62.51 | 83.76 | 77.88 | 55.77 | 5.49 | 33.03 | 84.73 | 67.27 | 75.66 | 79.02 | | 2 | llama-embed-nemotron-8b | 69.46 | 61.09 | 81.72 | 73.21 | 54.35 | 10.82 | 29.86 | 83.97 | 67.78 | 68.69 | 79.41 | | 3 | Qwen3-Embedding-8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | | 4 | gemini-embedding-001 | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40 | | 5 | Qwen3-Embedding-4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86 | | 6 | Qwen3-Embedding-0.6B | 64.34 | 56.01 | 72.23 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.65 | 76.17 | | 7 | gte-Qwen2-7B-instruct | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98 | | 8 | Linq-Embed-Mistral | 61.47 | 54.14 | 70.34 | 62.24 | 50.60 | 0.94 | 24.77 | 80.43 | 64.37 | 58.69 | 74.86 | | 9 | multilingual-e5-large-instruct | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81 | | 10 | embeddinggemma-300m | 61.15 | 54.31 | 64.40 | 60.90 | 51.17 | 5.61 | 24.82 | 81.40 | 63.25 | 62.49 | 74.73 | Model Details - Model Size: 11.76B - Embedding Dimension: 3840 - Max Input Tokens: 32k - MRL dimensions: 3840, 2048, 1024, 512, 256, 128, and 64 - Pooling: lasttoken pooling Usage sentence-transformers support Using this model becomes easy when you have sentence-transformers installed: Or you can use `encodequery` and `encodedocument` to automatically add the default prompt for queries (`"Instruct: Given a query, retrieve documents that answer the query \nQuery: "`) and documents (`""`), respectively. vllm support Note: Since vllm only supports the Gemma3ForCausalLM model class and not Gemma3TextModel, model parameters must be loaded by specifying the CausalLM branch via `revision="CausalLM"`. Citation If you find this model useful, please consider giving a star and citation. Contact If you encounter any issue, feel free to contact us via the email: ,

5,591

Hunyuan-MT-Chimera-7B

🤗  Hugging Face   |   🤖  ModelScope   |   🖥️  Official Website   |   🕹️  Demo      The Hunyuan Translation Model comprises a translation model, Hunyuan-MT-7B, and an ensemble model, Hunyuan-MT-Chimera. The translation model is used to translate source text into the target language, while the ensemble model integrates multiple translation outputs to produce a higher-quality result. It primarily supports mutual translation among 33 languages, including five ethnic minority languages in China. - In the WMT25 competition, the model achieved first place in 30 out of the 31 language categories it participated in. - Hunyuan-MT-7B achieves industry-leading performance among models of comparable scale - Hunyuan-MT-Chimera-7B is the industry’s first open-source translation ensemble model, elevating translation quality to a new level - A comprehensive training framework for translation models has been proposed, spanning from pretrain → cross-lingual pretraining (CPT) → supervised fine-tuning (SFT) → translation enhancement → ensemble refinement, achieving state-of-the-art (SOTA) results for models of similar size Related News 2025.9.1 We have open-sourced Hunyuan-MT-7B , Hunyuan-MT-Chimera-7B on Hugging Face. 模型链接 | Model Name | Description | Download | | ----------- | ----------- |----------- | Hunyuan-MT-7B | Hunyuan 7B translation model |🤗 Model| | Hunyuan-MT-7B-fp8 | Hunyuan 7B translation model，fp8 quant | 🤗 Model| | Hunyuan-MT-Chimera | Hunyuan 7B translation ensemble model | 🤗 Model| | Hunyuan-MT-Chimera-fp8 | Hunyuan 7B translation ensemble model，fp8 quant | 🤗 Model| Prompt Template for XX XX Translation, excluding ZH XX. Use with transformers First, please install transformers, recommends v4.56.0 The following code snippet shows how to use the transformers library to load and apply the model. !!! If you want to load fp8 model with transformers, you need to change the name"ignoredlayers" in config.json to "ignore" and upgrade the compressed-tensors to compressed-tensors-0.11.0. We recommend using the following set of parameters for inference. Note that our model does not have the default systemprompt. Supported languages: | Languages | Abbr. | Chinese Names | |-------------------|---------|-----------------| | Chinese | zh | 中文 | | English | en | 英语 | | French | fr | 法语 | | Portuguese | pt | 葡萄牙语 | | Spanish | es | 西班牙语 | | Japanese | ja | 日语 | | Turkish | tr | 土耳其语 | | Russian | ru | 俄语 | | Arabic | ar | 阿拉伯语 | | Korean | ko | 韩语 | | Thai | th | 泰语 | | Italian | it | 意大利语 | | German | de | 德语 | | Vietnamese | vi | 越南语 | | Malay | ms | 马来语 | | Indonesian | id | 印尼语 | | Filipino | tl | 菲律宾语 | | Hindi | hi | 印地语 | | Traditional Chinese | zh-Hant| 繁体中文 | | Polish | pl | 波兰语 | | Czech | cs | 捷克语 | | Dutch | nl | 荷兰语 | | Khmer | km | 高棉语 | | Burmese | my | 缅甸语 | | Persian | fa | 波斯语 | | Gujarati | gu | 古吉拉特语 | | Urdu | ur | 乌尔都语 | | Telugu | te | 泰卢固语 | | Marathi | mr | 马拉地语 | | Hebrew | he | 希伯来语 | | Bengali | bn | 孟加拉语 | | Tamil | ta | 泰米尔语 | | Ukrainian | uk | 乌克兰语 | | Tibetan | bo | 藏语 | | Kazakh | kk | 哈萨克语 | | Mongolian | mn | 蒙古语 | | Uyghur | ug | 维吾尔语 | | Cantonese | yue | 粤语 |

5,558

HY-MT1.5-1.8B-GGUF

5,547

Youtu-VL-4B-Instruct-GGUF

llama.cpp

4,370

SRPO

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference Xiangwei Shen 1,2 , Zhimin Li 1 , Zhantao Yang 1 , Shiyi Zhang 3 , Yingfang Zhang 1 , Donghao Li 1 , 2 School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 3 Shenzhen International Graduate School, Tsinghua University Abstract Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x. We sincerely appreciate contributions from the research community to this project. Below are quantized versions developed by fellow researchers. 1. 8bit(fp8e4m3fn/Q80) version by wikeeyang: https://huggingface.co/wikeeyang/SRPO-Refine-Quantized-v1.0 2. bf16 version by rockerBOO: https://huggingface.co/rockerBOO/flux.1-dev-SRPO 3. GGUF version by befox: https://huggingface.co/befox/SRPO-GGUF ⚠️ Note: When loading weights in ComfyUI, avoid direct conversion of FP32 weights to FP8 format, as this may result in incomplete denoising. For official weights in this repository, FP32/BF16 loading is recommended. Checkpoints The `diffusionpytorchmodel.safetensors` is online version of SRPO based on FLUX.1 Dev, trained on HPD dataset with HPSv2 🔑 Inference Load the following image in ComfyUI to get the workflow, or load the JSON file directly SRPO-workflow: Tip: The workflow JSON info was added to the image file. License SRPO is licensed under the License Terms of SRPO. See `./License.txt` for more details. Citation If you use SRPO for your research, please cite our paper:

4,229

957

Youtu-VL-4B-Instruct

Hunyuan-7B-Instruct

3,249

Hunyuan3D-2mv

“ Living out everyone’s imagination on creating and manipulating 3D assets.” This repository contains the models of the paper Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. Hunyuan3D-2mv is finetuned from Hunyuan3D-2 to support multiview controlled shape generation. For code and more details on how to use it, refer to the Github repository. If you found this repository helpful, please cite our report: Thanks for the contributions of community members, here we have these great extensions of Hunyuan3D 2.0: - ComfyUI-Hunyuan3DWrapper - Hunyuan3D-2-for-windows - 📦 A bundle for running on Windows | 整合包 We would like to thank the contributors to the DINOv2, Stable Diffusion, FLUX, diffusers and HuggingFace repositories, for their open research and exploration.

2,179

392

Hunyuan3D-Omni

1,604

131

Youtu-LLM-2B-Base

1,571

Hunyuan3D-Part

- Pipeline of our image to 3D part generation. It contains two key components, P3-SAM and X-Part. The holistic mesh is fed to part detection module P3-SAM to obtain the semantic features, part segmentations and part bounding boxes. Then X-Part generate the complete parts. P3-SAM： Native 3DPart Segmentation. - Paper: https://arxiv.org/abs/2509.06784. - Code: https://github.com/Tencent-Hunyuan/Hunyuan3D-Part/tree/main/P3-SAM. - Project Page: https://murcherful.github.io/P3-SAM/ . - HuggingFace Demo: https://huggingface.co/spaces/tencent/Hunyuan3D-Part. X-Part： high-fidelity and structure-coherent shapede composition - Paper: https://arxiv.org/abs/2509.08643. - Code: https://github.com/Tencent-Hunyuan/Hunyuan3D-Part/tree/main/XPart. - Project Page: https://yanxinhao.github.io/Projects/X-Part/. - HuggingFace Demo: https://huggingface.co/spaces/tencent/Hunyuan3D-Part. Notice - The current release is a light version of X-Part. The full-blood version is available on [](https://3d.hunyuan.tencent.com/studio). - For X-Part, we recommend using scanned or AI-generated meshes (e.g., from Hunyuan3D V2.5 or V3.0) as input. - P3-SAM can handle any input mesh. 🔗 Citation If you found this repository helpful, please cite our reports:

1,551

501

HunyuanVideo

1,487

2,078

Youtu-LLM-2B

Penguin-VL-8B

1,340

HY-WorldPlay

1,239

316

Hunyuan3D-1

1,140

309

HY-MT1.5-1.8B

WeDLM-8B-Instruct

HY-Embodied-0.5

Youtu-Embedding

🤗  Hugging Face   |   🖥️  GitHub   |   🌎  Technical Report 💬  WeChat   |   🤖  Discord Youtu-Embedding is a state-of-the-art, general-purpose text embedding model developed by Tencent Youtu Lab. It delivers exceptional performance across a wide range of natural language processing tasks, including Information Retrieval (IR), Semantic Textual Similarity (STS), Clustering, Reranking, and Classification. - Top-Ranked Performance: Achieved the #1 score of 77.58 on the authoritative CMTEB (Chinese Massive Text Embedding Benchmark) as of September 2025, demonstrating its powerful and robust text representation capabilities. - Innovative Training Framework: Features a Collaborative-Discriminative Fine-tuning Framework designed to resolve the "negative transfer" problem in multi-task learning. This is accomplished through a unified data format, task-differentiated loss functions, and a dynamic single-task sampling mechanism. > Note: You can easily adapt and fine-tune the model on your own datasets for domain-specific tasks. For implementation details, please refer to the training code. | Model Name | Parameters | Dimensions | Sequence Length | Download | | :------------------- | :--------: | :--------: | :-----------------: | :------------------------------------------------------------------------------------------ | | Youtu-Embedding | 2B | 2048 | 8K | Model | 3. Using `LangChain` 🦜 Easily integrate the model into your LangChain applications, such as RAG pipelines. 4. Using `LlamaIndex` 🦙 This is perfect for integrating the model into your LlamaIndex search and retrieval systems. 📊 CMTEB | Model | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | | :------------------------ | :--------------| :----------------- | :----------------- | :----: | :----: | :---------: | :-----: | :----: | :---: | | bge-multilingual-gemma2 | 9B | 67.64 | 68.52 | 75.31 | 59.30 | 79.30 | 68.28 | 73.73 | 55.19 | | ritrieve\zh\v1 | 326M | 72.71 | 73.85 | 76.88 | 66.50 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | | Conan-embedding-v2 | 1.4B | 74.24 | 75.99 | 76.47 | 68.84 | 92.44 | 74.41 | 78.31 | 65.48 | | Seed1.6-embedding | - | 75.63 | 76.68 | 77.98 | 73.11 | 88.71 | 71.65 | 79.69 | 68.94 | | QZhou-Embedding | 7B | 76.99 | 78.58 | 79.99 | 70.91 | 95.07 | 74.85 | 78.80 | 71.89 | | Youtu-Embedding | 2B | 77.58 | 78.86 | 78.65 | 84.27 | 86.12 | 75.10 | 80.21 | 68.82 | > Note: Comparative scores are from the MTEB leaderboard, recorded on September 28, 2025.

723

Hunyuan-0.5B-Pretrain

688

Hunyuan-A13B-Instruct-GGUF

560

Hunyuan 4B Instruct

🤗  HuggingFace  |  🤖  ModelScope  |  🪡  AngelSlim 🖥️  Official Website   |   🕖  HunyuanAPI   |   🕹️  Demo      Hunyuan is Tencent's open-source efficient large language model series, designed for versatile deployment across diverse computational environments. From edge devices to high-concurrency production systems, these models deliver optimal performance with advanced quantization support and ultra-long context capabilities. We have released a series of Hunyuan dense models, comprising both pre-trained and instruction-tuned variants, with parameter scales of 0.5B, 1.8B, 4B, and 7B. These models adopt training strategies similar to the Hunyuan-A13B, thereby inheriting its robust performance characteristics. This comprehensive model family enables flexible deployment optimization - from resource-constrained edge computing with smaller variants to high-throughput production environments with larger models, all while maintaining strong capabilities across diverse scenarios. - Hybrid Reasoning Support: Supports both fast and slow thinking modes, allowing users to flexibly choose according to their needs. - Ultra-Long Context Understanding: Natively supports a 256K context window, maintaining stable performance on long-text tasks. - Enhanced Agent Capabilities: Optimized for agent tasks, achieving leading results on benchmarks such as BFCL-v3, τ-Bench and C3-Bench. - Efficient Inference: Utilizes Grouped Query Attention (GQA) and supports multiple quantization formats, enabling highly efficient inference. Related News 2025.7.30 We have open-sourced Hunyuan-0.5B-Pretrain , Hunyuan-0.5B-Instruct , Hunyuan-1.8B-Pretrain , Hunyuan-1.8B-Instruct , Hunyuan-4B-Pretrain , Hunyuan-4B-Instruct , Hunyuan-7B-Pretrain ,Hunyuan-7B-Instruct on Hugging Face. Note: The following benchmarks are evaluated by TRT-LLM-backend on several base models. | Model | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain| |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:| | MMLU | 54.02 | 64.62 | 74.01 | 79.82 | | MMLU-Redux | 54.72 | 64.42 | 73.53 | 79 | | MMLU-Pro | 31.15 | 38.65 | 51.91 | 57.79 | | SuperGPQA | 17.23 | 24.98 | 27.28 | 30.47 | | BBH | 45.92 | 74.32 | 75.17 | 82.95 | | GPQA | 27.76 | 35.81 | 43.52 | 44.07 | | GSM8K | 55.64 | 77.26 | 87.49 | 88.25 | | MATH | 42.95 | 62.85 | 72.25 | 74.85 | | EvalPlus | 39.71 | 60.67 | 67.76 | 66.96 | | MultiPL-E | 21.83 | 45.92 | 59.87 | 60.41 | | MBPP | 43.38 | 66.14 | 76.46 | 76.19 | | CRUX-O | 30.75 | 36.88 | 56.5 | 60.75 | | Chinese SimpleQA | 12.51 | 22.31 | 30.53 | 38.86 | | simpleQA (5shot) | 2.38 | 3.61 | 4.21 | 5.69 | | Topic | Bench | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct| |:-------------------:|:----------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:| | Mathematics | AIME 2024 AIME 2025 MATH | 17.2 20 48.5 | 56.7 53.9 86 | 78.3 66.5 92.6 | 81.1 75.3 93.7 | | Science | GPQA-Diamond OlympiadBench | 23.3 29.6 | 47.2 63.4 | 61.1 73.1 | 60.1 76.5 | | Coding | Livecodebench Fullstackbench | 11.1 20.9 | 31.5 42 | 49.4 54.6 | 57 56.3 | | Reasoning | BBH DROP ZebraLogic | 40.3 52.8 34.5 | 64.6 76.7 74.6 | 83 78.2 83.5 | 87.8 85.9 85.1 | | Instruction Following | IF-Eval SysBench | 49.7 28.1 | 67.6 55.5 | 76.6 68 | 79.3 72.7 | | Agent | BFCL v3 τ-Bench ComplexFuncBench C3-Bench | 49.8 14.4 13.9 45.3 | 58.3 18.2 22.3 54.6 | 67.9 30.1 26.3 64.3 | 70.8 35.3 29.2 68.5 | | Long Context | PenguinScrolls longbench-v2 FRAMES | 53.9 34.7 41.9 | 73.1 33.2 55.6 | 83.1 44.1 79.2 | 82 43 78.6 | Use with transformers First, please install transformers. Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning. 1. Pass "enablethinking=False" when calling applychattemplate. 2. Adding "/nothink" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning. The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output. We recommend using the following set of parameters for inference. Note that our model does not have the default systemprompt. If you need to fine-tune our Instruct model, we recommend processing the data into the following format, corresponding to both slow-thinking and fast-thinking scenarios. In the following chapter, we will introduce how to use `LLaMA-Factory` to fine-tune the `Hunyuan` model. Verify installation of the following dependencies: - LLaMA-Factory: Follow official installation guide - DeepSpeed (optional): Follow official installation guide - Transformer Library: Use the companion branch (Hunyuan-submitted code is pending review) We need to prepare a custom dataset: 1. Organize your data in `json` format and place it in the `data` directory in `LLaMA-Factory`. The current implementation uses the `sharegpt` dataset format, which requires the following structure: Refer to the Data Format section mentioned earlier for details. 2. Define your dataset in the data/datasetinfo.json file using the following format: 1. Copy all files from the `train/llamafactorysupport/exampleconfigs` directory to the `example/hunyuan` directory in `LLaMA-Factory`. 2. Modify the model path and dataset name in the configuration file `hunyuanfull.yaml`. Adjust other configurations as needed: 3. Execute training commands: Single-node training Note: Set the environment variable DISABLEVERSIONCHECK to 1 to avoid version conflicts. Multi-node training Execute the following command on each node. Configure NNODES, NODERANK, MASTERADDR, and MASTERPORT according to your environment: Quantization Compression We used our own AngleSlim compression tool to produce FP8 and INT4 quantization models. `AngleSlim` is a toolset dedicated to creating a more user-friendly, comprehensive and efficient model compression solution. FP8 Quantization We use FP8-static quantization, FP8 quantization adopts 8-bit floating point format, through a small amount of calibration data (without training) to pre-determine the quantization scale, the model weights and activation values will be converted to FP8 format, to improve the inference efficiency and reduce the deployment threshold. We you can use AngleSlim quantization, you can also directly download our quantization completed open source model to use LINK. Int4 Quantization We use the GPTQ and AWQ algorithm to achieve W4A16 quantization. GPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold. AWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization. You can use AngleSlim quantization, you can also directly download our quantization completed open source model to use LINK. Quantization Benchmark This subsection describes the Benchmark metrics for the Hunyuan quantitative model. | Bench | Quantization | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct | |:-------------:|:---------------------------------:|:----------------------------:|:------------------------------:|:----------------------------:|:----------------------------:| | DROP | B16 FP8 Int4GPTQ Int4AWQ | 52.8 51.6 50.9 48.9 | 76.7 75.1 73.0 71.7 | 78.2 78.3 78.1 78.2 | 85.9 86.0 85.7 85.9 | | GPQA-Diamond | B16 FP8 Int4GPTQ Int4AWQ | 23.3 22.5 23.3 23.3 | 47.2 47.7 44.43 43.62 | 61.1 60.2 58.1 - | 60.1 60.1 60.0 60.1 | | OlympiadBench | B16 FP8 Int4GPTQ Int4AWQ | 29.6 29.6 26.8 26.3 | 63.4 62.5 60.9 61.7 | 73.1 73.1 71.1 71.2 | 76.5 76.6 76.2 76.4 | | AIME 2024 | B16 FP8 Int4GPTQ Int4AWQ | 17.2 17.2 - - | 56.7 55.17 - - | 78.3 76.6 - - | 81.1 80.9 81.0 80.9 | For deployment, you can use frameworks such as TensorRT-LLM, vLLM, or SGLang to serve the model and create an OpenAI-compatible API endpoint. image: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags We provide a pre-built Docker image based on the latest version of TensorRT-LLM. We use tencent/Hunyuan-7B-Instruct for example - To get started: - After running service script successfully, run the request script Quantitative model deployment This section describes the process of deploying a post-quantization model using vLLM. Int8 quantitative model deployment Deploying the Int8-weight-only version of the HunYuan-7B model only requires setting the environment variables Int4 quantitative model deployment Deploying the Int4-weight-only version of the HunYuan-7B model only requires setting the environment variables , using the GPTQ method FP8 quantitative model deployment Deploying the W8A8C8 version of the HunYuan-7B model only requires setting the environment variables We also provide a pre-built Docker image based on the latest version of SGLang. If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).

555

Hunyuan-0.5B-Instruct

482

HunyuanVideo Foley

Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation Professional-grade AI sound effect generation for video content creators Sizhe Shan 1,2 • Qiulin Li 1,3 • Yutao Cui 1 • Miles Yang 1 • Yuehai Wang 2 • Qun Yang 3 • Jin Zhou 1† • Zhao Zhong 1 🏢 1 Tencent Hunyuan • 🎓 2 Zhejiang University • ✈️ 3 Nanjing University of Aeronautics and Astronautics - [2025.9.29] 🚀 HunyuanVideo-Foley-XL Model Release - Release XL-sized model with offload inference support, significantly reducing VRAM requirements. - [2025.8.28] 🌟 HunyuanVideo-Foley Open Source Release - Inference code and model weights publicly available. Experience the magic of AI-generated Foley audio in perfect sync with video content! --> 🎬 Watch how HunyuanVideo-Foley generates immersive sound effects synchronized with video content --> 🎭 Multi-scenario Sync High-quality audio synchronized with complex video scenes 🧠 Multi-modal Balance Perfect harmony between visual and textual information 🎵 48kHz Hi-Fi Output Professional-grade audio generation with crystal clarity 🚀 Tencent Hunyuan open-sources HunyuanVideo-Foley an end-to-end video sound effect generation model! A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development. 🎬 Multi-scenario Audio-Visual Synchronization Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications. ⚖️ Multi-modal Semantic Balance Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements. 🎵 High-fidelity Audio Output Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality. HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions! 📊 Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories 🔄 Comprehensive data processing pipeline for high-quality text-video-audio datasets The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities. 🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks HunyuanVideo-Foley employs a sophisticated hybrid architecture: - 🔄 Multimodal Transformer Blocks: Process visual-audio streams simultaneously - 🎵 Unimodal Transformer Blocks: Focus on audio stream refinement - 👁️ Visual Encoding: Pre-trained encoder extracts visual features from video frames - 📝 Text Processing: Semantic features extracted via pre-trained text encoder - 🎧 Audio Encoding: Latent representations with Gaussian noise perturbation - ⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation > Objective and Subjective evaluation results demonstrating superior performance across all metrics | 🏆 Method | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | MOS-Q ↑ | MOS-S ↑ | MOS-T ↑ | |:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|:------------:|:------------:|:------------:| | FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36±0.78 | 3.54±0.88 | 3.46±0.95 | | V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55±0.97 | 2.60±1.20 | 2.70±1.37 | | Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92±0.95 | 2.76±1.20 | 2.94±1.26 | | MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58±0.84 | 3.63±1.00 | 3.47±1.03 | | ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20±0.97 | 3.01±1.04 | 3.02±1.08 | | HunyuanVideo-Foley (ours) | 6.59 | 2.74 | 3.88 | 6.13 | 0.35 | 0.74 | 0.33 | 4.14±0.68 | 4.12±0.77 | 4.15±0.75 | > Comprehensive objective evaluation showcasing state-of-the-art performance | 🏆 Method | FDPANNs ↓ | FDPASST ↓ | KL ↓ | IS ↑ | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | |:-------------:|:--------------:|:--------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:| | FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 | | V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 | | Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 | | MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 | | ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 | | HunyuanVideo-Foley (ours) | 6.07 | 202.12 | 1.89 | 8.30 | 6.12 | 2.76 | 3.22 | 5.53 | 0.38 | 0.54 | 0.24 | 🎉 Outstanding Results! HunyuanVideo-Foley achieves the best scores across ALL evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment. 🔧 System Requirements - CUDA: 12.4 or 11.8 recommended - Python: 3.8+ - OS: Linux (primary support) 💡 Tip: We recommend using Conda for Python environment management. Generate Foley audio for a single video file with text description: Process multiple videos using a CSV file with video paths and descriptions: Launch a user-friendly Gradio web interface for easy interaction: 🚀 Then open your browser and navigate to the provided local URL to start generating Foley audio! If you find HunyuanVideo-Foley useful for your research, please consider citing our paper: We extend our heartfelt gratitude to the open-source community! 🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning! [](https://github.com/Tencent-Hunyuan) [](https://twitter.com/TencentHunyuan) [](https://hunyuan.tencent.com/) © 2025 Tencent Hunyuan. All rights reserved. | Made with ❤️ for the AI community

461

145

POINTS-Reader

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion We are delighted to announce that the WePOINTS family has welcomed a new member: POINTS-Reader, a vision-language model for end-to-end document conversion, as introduced in the paper POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion. - 2025.09.12: Quick Start with Transformers🤗: Colab Notebook - 2025.09.11: A live demo of POINTS-Reader is now available on Hugging Face Spaces, thanks to a wonderful contribution from @prithivsakthiur. - 2025.08.27: Support deploying POINTS-Reader with SGLang💪💪💪. - 2025.08.26: We released the weights of the most recent version of POINT-Reader🔥🔥🔥. - 2025.08.21: POINTS-Reader is accepted by EMNLP 2025 for presentation at the Main Conference🎉🎉🎉. 1. Simplicity: POINTS-Reader is a very streamlined model that fully follows the structure of POINTS1.5, except that we have replaced Qwen2.5-7B-Instruct with Qwen2.5-3B-Instruct. Moreover, the input and output of POINTS-Reader are extremely straightforward. The input consists of a fixed prompt and a document image, and the output contains only a string (text extracted from the document image). The model's output is the final result delivered to the user without any post-processing. 2. Performance: Currently, POINTS-Reader supports extraction from both Chinese and English documents, achieving impressive results, with scores of 0.133 for English and 0.212 for Chinese on OmniDocBench. 3. High Throughput: With current mainstream inference frameworks, such as SGLang and vLLM, optimization is predominantly focused on LLMs. Thus, a large ViT would significantly impact the model’s throughput, which is why we selected a ViT with a moderate number of parameters (600M NaViT used in POINTS1.5). Combined with our support for SGLang, we currently achieve a very satisfactory throughput. We will also provide support for vLLM in the future. 4. Open-source Technical Approach: In the POINTS-Reader paper, we propose a two-stage data augmentation strategy. The first stage leverages automated data to endow the model with basic document extraction capabilities. In the subsequent stage, continuous self-evolution improves the quality of data generated by the model. The self-evolution approach in the second stage is highly extensible and can be applied to virtually any model. For comparison, we use the results reported by OmniDocBench and POINTS-Reader. Compared with the version submitted to EMNLP 2025, the current release provides (1) improved performance and (2) support for Chinese documents. Both enhancements build upon the methods proposed in this paper. Method Type Methods Overall Edit ↓ Text Edit ↓ Formula Edit ↓ Formula CDM ↑ Table TEDS ↑ Table Edit ↓ Read Order Edit ↓ Pipeline Tools MinerU-pipeline-2.1.1 0.162 0.244 0.072 0.111 0.313 0.581 79.2 48.8 77.4 79.5 0.166 0.15 0.097 0.136 Marker-1.2.3 0.336 0.556 0.08 0.315 0.53 0.883 17.6 11.7 67.6 49.2 0.619 0.685 0.114 0.34 Marker-1.7.1 0.296 0.497 0.085 0.293 0.374 0.688 79.0 36.7 67.6 54.0 0.609 0.678 0.116 0.329 PaddleOCR PP-StructureV3 0.145 0.206 0.058 0.088 0.295 0.535 81.8 52.1 77.2 83.9 0.159 0.109 0.069 0.091 Mathpix 0.191 0.364 0.105 0.381 0.306 0.454 82.7 64.6 77.0 67.1 0.243 0.32 0.108 0.304 Docling-2.14.0 0.589 0.909 0.416 0.987 0.999 1 - - 61.3 25.0 0.627 0.810 0.313 0.837 Pix2Text-1.1.2.3 0.32 0.528 0.138 0.356 0.276 0.611 78.4 39.6 73.6 66.2 0.584 0.645 0.281 0.499 Unstructured-0.17.2 0.586 0.716 0.198 0.481 0.999 1 - - 0 0.064 1 0.998 0.145 0.387 OpenParse-0.7.0 0.646 0.814 0.681 0.974 0.996 1 0.106 0 64.8 27.5 0.284 0.639 0.595 0.641 Expert VLMs POINTS-Reader-3B 0.133 0.212 0.062 0.139 0.304 0.465 - - 83.7 85.0 0.128 0.136 0.036 0.106 MinerU2.0-2505-0.9B 0.133 0.238 0.045 0.115 0.273 0.506 79.0 50.8 82.1 83.4 0.15 0.209 0.066 0.122 MonkeyOCR-pro-1.2B 0.146 0.221 0.068 0.118 0.272 0.452 76.7 63.3 81.3 85.5 0.149 0.134 0.093 0.179 Dolphin 0.356 0.440 0.352 0.440 0.465 0.604 61.6 40.4 70.2 56.8 0.258 0.367 0.35 0.351 Nanonets-OCR-s 0.283 0.295 0.134 0.231 0.518 0.546 63.2 52.0 76.8 79.4 0.343 0.201 0.135 0.2 OCRFlux-3B 0.238 0.349 0.112 0.256 0.447 0.716 60.2 31.9 69.0 80.0 0.269 0.162 0.126 0.263 GOT-OCR 0.287 0.411 0.189 0.315 0.360 0.528 74.3 45.3 53.2 47.2 0.459 0.52 0.141 0.28 Nougat 0.452 0.973 0.365 0.998 0.488 0.941 15.1 16.8 39.9 0.0 0.572 1.000 0.382 0.954 Mistral OCR 0.268 0.439 0.072 0.325 0.318 0.495 64.6 45.9 75.8 63.6 0.6 0.65 0.083 0.284 OLMOCR-sglang 0.326 0.469 0.097 0.293 0.455 0.655 74.3 43.2 68.1 61.3 0.608 0.652 0.145 0.277 SmolDocling-256Mtransformer 0.493 0.816 0.262 0.838 0.753 0.997 32.1 0.551 44.9 16.5 0.729 0.907 0.227 0.522 Gemini2.0-flash 0.191 0.264 0.091 0.139 0.389 0.584 77.6 43.6 79.7 78.9 0.193 0.206 0.092 0.128 Gemini2.5-Pro 0.148 0.212 0.055 0.168 0.356 0.439 80.0 69.4 85.8 86.4 0.13 0.119 0.049 0.121 GPT4o 0.233 0.399 0.144 0.409 0.425 0.606 72.8 42.8 72.0 62.9 0.234 0.329 0.128 0.251 Qwen2-VL-72B 0.252 0.327 0.096 0.218 0.404 0.487 82.2 61.2 76.8 76.4 0.387 0.408 0.119 0.193 Qwen2.5-VL-7B 0.316 0.399 0.151 0.243 0.376 0.5 75.3 57.3 71.1 71.3 0.598 0.627 0.138 0.226 Qwen2.5-VL-72B 0.214 0.261 0.092 0.18 0.315 0.434 81.4 64.1 81.4 83.0 0.341 0.262 0.106 0.168 InternVL2-76B 0.44 0.443 0.353 0.290 0.543 0.701 67.4 44.1 63.0 60.2 0.547 0.555 0.317 0.228 InternVL3-78B 0.218 0.296 0.117 0.21 0.38 0.533 79.2 58.8 69.0 73.9 0.279 0.282 0.095 0.161 This following code snippet has been tested with following environment: If you encounter environment issues, please feel free to open an issue. If you encounter issues like repeation, please try to increase the resolution of the image to allievate the problem. We have created a Pull Request for SGLang. You can check out this branch and install SGLang in editable mode by following the official guide prior to the merging of this PR. You can deploy POINTS-Reader with SGLang using the following command: You can use the following code to obtain results from SGLang: - Complex Document Parsing: POINTS-Reader can struggle with complex layouts (e.g., newspapers), often producing repeated or missing content. - Handwritten Document Parsing: It also has difficulty handling handwritten inputs (e.g., receipts, notes), which can lead to recognition errors or omissions. - Multi-language Document Parsing: POINTS-Reader currently supports only English and Chinese, limiting its effectiveness on other languages. If you use this model in your work, please cite the following paper:

441

Hunyuan-1.8B-Instruct

Hunyuan-4B-Pretrain

351

HunyuanImage 2.1

HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation This repo contains PyTorch model definitions, pretrained weights and inference/sampling code for our HunyuanImage-2.1. You can find more visualizations on our project page. - September 12, 2025: 🚀 Released FP8 quantized models! Making it possible to generate 2K images with only 24GB GPU memory! - September 8, 2025: 🚀 Released inference code and model weights for HunyuanImage-2.1. Contents - HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation - 🔥🔥🔥 Latest Updates - 🎥 Demo - Contents - Abstract - HunyuanImage-2.1 Overall Pipeline - Training Data and Caption - Text-to-Image Model Architecture - Reinforcement Learning from Human Feedback - Rewriting Model - Model distillation - 🎉 HunyuanImage-2.1 Key Features - Prompt Enhanced Demo - 📈 Comparisons - SSAE Evaluation - GSB Evaluation - 📜 System Requirements - 🛠️ Dependencies and Installation - 🧱 Download Pretrained Models - 🔑 Usage - 🔗 BibTeX - Acknowledgements - Github Star History Abstract We present HunyuanImage-2.1, a highly efficient text-to-image model that is capable of generating 2K (2048 × 2048) resolution images. Leveraging an extensive dataset and structured captions involving multiple expert models, we significantly enhance text-image alignment capabilities. The model employs a highly expressive VAE with a (32 × 32) spatial compression ratio, substantially reducing computational costs. Our architecture consists of two stages: 1. Base text-to-image Model: The first stage is a text-to-image model that utilizes two text encoders: a multimodal large language model (MLLM) to improve image-text alignment, and a multi-language, character-aware encoder to enhance text rendering across various languages. This stage features a single- and dual-stream diffusion transformer with 17 billion parameters. To optimize aesthetics and structural coherence, we apply reinforcement learning from human feedback (RLHF). 2. Refiner Model: The second stage introduces a refiner model that further enhances image quality and clarity, while minimizing artifacts. Additionally, we developed the PromptEnhancer module to further boost model performance, and employed meanflow distillation for efficient inference. HunyuanImage-2.1 demonstrates robust semantic alignment and cross-scenario generalization, leading to improved consistency between text and image, enhanced control of scene details, character poses, and expressions, and the ability to generate multiple objects with distinct descriptions. Structured captions provide hierarchical semantic information at short, medium, long, and extra-long levels, significantly enhancing the model’s responsiveness to complex semantics. Innovatively, an OCR agent and IP RAG are introduced to address the shortcomings of general VLM captioners in dense text and world knowledge descriptions, while a bidirectional verification strategy ensures caption accuracy. Core Components: High-Compression VAE with REPA Training Acceleration: A VAE with a 32× compression rate drastically reduces the number of input tokens for the DiT model. By aligning its feature space with DINOv2 features, we facilitate the training of high-compression VAEs. As a result, our model generates 2K images with the same token length (and thus similar inference time) as other models require for 1K images, achieving superior inference efficiency. Multi-bucket, multi-resolution REPA loss aligns DiT features with a high-dimensional semantic feature space, accelerating model convergence. Dual Text Encoder: A vision-language multimodal encoder is employed to better understand scene descriptions, character actions, and detailed requirements. A multilingual ByT5 text encoder is introduced to specialize in text generation and multilingual expression. Network: A single- and dual-stream diffusion transformer with 17 billion parameters. Reinforcement Learning from Human Feedback Two-Stage Post-Training with Reinforcement Learning: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are applied sequentially in two post-training stages. We introduce a Reward Distribution Alignment algorithm, which innovatively incorporates high-quality images as selected samples to ensure stable and improved reinforcement learning outcomes. The first systematic industrial-level rewriting model. SFT training structurally rewrites user text instructions to enrich visual expression, while GRPO training employs a fine-grained semantic AlignEvaluator reward model to substantially improve the semantics of images generated from rewritten text. The AlignEvaluator covers 6 major categories and 24 fine-grained assessment points. PromptEnhancer supports both Chinese and English rewriting and demonstrates general applicability in enhancing semantics for both open-source and proprietary text-to-image models. Model distillation We propose a novel distillation method based on meanflow that addresses the key challenges of instability and inefficiency inherent in standard meanflow training. This approach enables high-quality image generation with only a few sampling steps. To our knowledge, this is the first successful application of meanflow to an industrial-scale model. - High-Quality Generation: Efficiently produces ultra-high-definition (2K) images with cinematic composition. - Multilingual Support: Provides native support for both Chinese and English prompts. - Advanced Architecture: Built on a multi-modal, single- and dual-stream combined DiT (Diffusion Transformer) backbone. - Glyph-Aware Processing: Utilizes ByT5's text rendering capabilities for improved text generation accuracy. - Flexible Aspect Ratios: Supports a variety of image aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3). - Prompt Enhancement: Automatically rewrites prompts to improve descriptive accuracy and visual quality. Prompt Enhanced Demo To improve the quality and detail of generated images, we use a prompt rewriting model. This model automatically enhances user-provided text prompts by adding detailed and descriptive information. SSAE Evaluation SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points. Model Open Source Mean Image Accuracy Global Accuracy Primary Subject Secondary Subject Scene Other Noun Key Attributes Other Attributes Action Noun Attributes Action Noun Attributes Shot Style Composition FLUX-dev ✅ 0.7122 0.6995 0.7965 0.7824 0.5993 0.5777 0.7950 0.6826 0.6923 0.8453 0.8094 0.6452 0.7096 0.6190 Seedream-3.0 ❌ 0.8827 0.8792 0.9490 0.9311 0.8242 0.8177 0.9747 0.9103 0.8400 0.9489 0.8848 0.7582 0.8726 0.7619 Qwen-Image ✅ 0.8854 0.8828 0.9502 0.9231 0.8351 0.8161 0.9938 0.9043 0.8846 0.9613 0.8978 0.7634 0.8548 0.8095 GPT-Image ❌ 0.8952 0.8929 0.9448 0.9289 0.8655 0.8445 0.9494 0.9283 0.8800 0.9432 0.9017 0.7253 0.8582 0.7143 HunyuanImage 2.1 ✅ 0.8888 0.8832 0.9339 0.9341 0.8363 0.8342 0.9627 0.8870 0.9615 0.9448 0.9254 0.7527 0.8689 0.7619 From the SSAE evaluation results, our model has currently achieved the optimal performance among open-source models in terms of semantic alignment, and is very close to the performance of closed-source commercial models (GPT-Image). We adopted the GSB evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators. From the results, HunyuanImage 2.1 achieved a relative win rate of -1.36% against Seedream3.0 (closed-source) and 2.89% outperforming Qwen-Image (open-source). The GSB evaluation results demonstrate that HunyuanImage 2.1, as an open-source model, has reached a level of image generation quality comparable to closed-source commercial models (Seedream3.0), while showing certain advantages in comparison with similar open-source models (Qwen-Image). This fully validates the technical advancement and practical value of HunyuanImage 2.1 in text-to-image generation tasks. Hardware and OS Requirements: - NVIDIA GPU with CUDA support. Minimum requrement for now: 24 GB GPU memory for 2048x2048 image generation. > Note: The memory requirements above are measured with model CPU offloading and FP8 quantization enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed. - Supported operating system: Linux. The details of download pretrained models are shown here. 🔑 Usage HunyuanImage-2.1 only supports 2K image generation (e.g. 2048x2048 for 1:1 images, 2560x1536 for 16:9 images, etc.). Generating images with 1K resolution will result in artifacts. Additionally, we recommend using the full generation pipeline for better quality (i.e. enabling prompt enhancement and refinment). If you find this project useful for your research and applications, please cite as: We would like to thank the following open-source projects and communities for their contributions to open research and exploration: Qwen, FLUX, diffusers and HuggingFace.

333

643

Penguin-VL-2B

326

HunyuanVideo-I2V

320

342

Hunyuan-1.8B-Pretrain

285

SongGeneration

Demo  |  Paper  |  Code  |  Space Demo This repository is the official weight repository for LeVo: High-Quality Song Generation with Multi-Preference Alignment. In this repository, we provide the SongGeneration model, inference scripts, and the checkpoint that has been trained on the Million Song Dataset. | Model | Max Length | Language | GPU Menmory | RFT(A100) | Download Link | | ------------------------- | :--------: | :------------------: | :---------: | :-------: | ------------------------------------------------------------ | | SongGeneration-base | 2m30s | zh | 10G/16G | 1.26 | You were here | | SongGeneration-base-new | 2m30s | zh, en | 10G/16G | 1.26 | Huggingface | | SongGeneration-base-full | 4m30s | zh, en | 12G/18G | 1.30 | Huggingface | | SongGeneration-large | 4m30s | zh, en | 22G/28G | 1.51 | Huggingface | | SongGeneration-v1.5-small | 2m | zh, en, es, ja, etc. | - | - | Coming soon | | SongGeneration-v1.5-base | 4m30s | zh, en, es, ja, etc. | - | - | Coming soon | | SongGeneration-v1.5-large | 4m30s | zh, en, es, ja, etc. | - | - | Coming soon | We develop the SongGeneration model. It is an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. The music codec reconstructs the dual-track tokens into highfidelity music audio. SongGeneration significantly improves over the open-source music generation models and performs competitively with current state-of-the-art industry systems. For more details, please refer to our paper. The code and weights in this repository is released in the LICENSE file.

233

302

Hunyuan-7B-Pretrain

211

Tencent-Hunyuan-Large

196

614

HunyuanImage-3.0-Instruct

194

793

HY-MT1.5-7B-GPTQ-Int4

192

Hunyuan-A13B-Instruct-GPTQ-Int4

189

Hunyuan-7B-Instruct-0124

169

HunyuanWorld Voyager

We introduce HunyuanWorld-Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Voyager can generate 3D-consistent scene videos for world exploration following custom camera trajectories. It can also jointly generate aligned depth and RGB video for effective and direct 3D reconstruction. If you find Voyager useful for your research and applications, please cite using this BibTeX: We would like to thank HunyuanWorld, Hunyuan3D-2, and HunyuanVideo-I2V. We also thank VGGT, MoGE, Metric3D, for their open research and exploration.

157

584

Hunyuan-A13B-Pretrain

150

HunyuanVideo-PromptRewrite

145

SongPrep 7B

144

Penguin-Encoder

132

Hunyuan-7B-Pretrain-0124

122

HY-MT1.5-1.8B-GPTQ-Int4

118

Hunyuan-4B-Instruct-GPTQ-Int4

117

HY-Motion-1.0

112

125

DeepSeek V3.1 Terminus W4AFP8

license:mit

Hunyuan-7B-Instruct-FP8

POINTS-GUI-G

Hunyuan-4B-Instruct-FP8

HY-MT1.5-1.8B-FP8

Hunyuan-0.5B-Instruct-FP8

Hunyuan-1.8B-Instruct-FP8

HunyuanImage-3.0-Instruct-Distil

Hunyuan-0.5B-Instruct-GPTQ-Int4

DOGR

Hunyuan-GameCraft-1.0

484

Hunyuan-7B-Instruct-GPTQ-Int4

Hunyuan-7B-Instruct-AWQ-Int4

HY-MT1.5-7B-FP8

Hunyuan-1.8B-Instruct-AWQ-Int4

Hunyuan-1.8B-Instruct-GPTQ-Int4

Hunyuan-4B-Instruct-AWQ-Int4

Hunyuan-0.5B-Instruct-AWQ-Int4

TCAndon-Router

Sequential-Hidden-Decoding-8B-n8

HunyuanVideo-Avatar

> HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios. The source code and model weights will be released publicly. We propose HunyuanVideo-Avatar, a multi-modal diffusion transformer(MM-DiT)-based model capable of generating dynamic, emotion-controllable, and multi-character dialogue videos. High-Dynamic and Emotion-Controllable Video Generation HunyuanVideo-Avatar supports animating any input avatar images to high-dynamic and emotion-controllable videos with simple audio conditions. Specifically, it takes as input multi-style avatar images at arbitrary scales and resolutions. The system supports multi-style avatars encompassing photorealistic, cartoon, 3D-rendered, and anthropomorphic characters. Multi-scale generation spanning portrait, upper-body and full-body. It generates videos with high-dynamic foreground and background, achieving superior realistic and naturalness. In addition, the system supports controlling facial emotions of the characters conditioned on input audio. HunyuanVideo-Avatar supports various downstream tasks and applications. For instance, the system generates talking avatar videos, which could be applied to e-commerce, online streaming, social media video production, etc. In addition, its multi-character animation feature enlarges the application such as video content creation, editing, etc. For example, to generate a video with 8 GPUs, you can use the following command: For example, to generate a video with 1 GPU, you can use the following command: If you find HunyuanVideo-Avatar useful for your research and applications, please cite using this BibTeX: We would like to thank the contributors to the HunyuanVideo, SD3, FLUX, Llama, LLaVA, Xtuner, diffusers and HuggingFace repositories, for their open research and exploration.

305

HY-OmniWeaving

250

HunyuanCustom

190

HY-World-2.0

111

HunyuanPortrait

HunyuanPortrait Implicit Condition Control for Enhanced Portrait Animation 📜 Requirements An NVIDIA 3090 GPU with CUDA support is required. The model is tested on a single 24G GPU. Tested operating system: Linux All models are stored in `pretrainedweights` by default: TL;DR: HunyuanPortrait is a diffusion-based framework for generating lifelike, temporally consistent portrait animations by decoupling identity and motion using pre-trained encoders. It encodes driving video expressions/poses into implicit control signals, injects them via attention-based adapters into a stabilized diffusion backbone, enabling detailed and style-flexible animation from a single reference image. The method outperforms existing approaches in controllability and coherence. Some results of portrait animation using HunyuanPortrait can be found on our Project page. The code is based on SVD, DiNOv2, Arc2Face, YoloFace. We thank the authors for their open-sourced code and encourage users to cite their works when applicable. Stable Video Diffusion is licensed under the Stable Video Diffusion Research License, Copyright (c) Stability AI Ltd. All Rights Reserved. This codebase is intended solely for academic purposes. 🎼 Citation If you think this project is helpful, please feel free to leave a star⭐️⭐️⭐️ and cite our paper: