chaitnya26

1,785

Qwen-Image-Lightning

789

qwen-image-edit-2509-fork

This repository contains Nunchaku-quantized versions of Qwen-Image-Edit-2509, an image-editing model based on Qwen-Image, advances in complex text rendering. It is optimized for efficient inference while maintaining minimal loss in performance. - [2025-09-25] 🔥 Release 4-bit 4/8-step lightning Qwen-Image-Edit! - [2025-09-24] 🚀 Release 4-bit SVDQuant quantized Qwen-Image-Edit-2509 model with rank 32 and 128! - Developed by: Nunchaku Team - Model type: image-to-image - License: apache-2.0 - Quantized from model: Qwen-Image-Edit-2509 - `svdq-int4r32-qwen-image-edit-2509.safetensors`: SVDQuant INT4 (rank 32) Qwen-Image-Edit-2509 model. For users with non-Blackwell GPUs (pre-50-series). - `svdq-int4r128-qwen-image-edit-2509.safetensors`: SVDQuant INT4 (rank 128) Qwen-Image-Edit-2509 model. For users with non-Blackwell GPUs (pre-50-series). It offers better quality than the rank 32 model, but it is slower. - `svdq-int4r32-qwen-image-edit-2509-lightningv2.0-4steps.safetensors`: SVDQuant INT4 (rank 32) 4-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-4steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with non-Blackwell GPUs (pre-50-series). - `svdq-int4r128-qwen-image-edit-2509-lightning`: SVDQuant INT4 (rank 128) 4-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-4steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with non-Blackwell GPUs (pre-50-series). - `svdq-int4r32-qwen-image-edit-2509-lightningv2.0-8steps.safetensors`: SVDQuant INT4 (rank 32) 8-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-8steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with non-Blackwell GPUs (pre-50-series). - `svdq-int4r128-qwen-image-edit-2509-lightningv2.0-8steps.safetensors`: SVDQuant INT4 (rank 128) 8-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-8steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with non-Blackwell GPUs (pre-50-series). - `svdq-fp4r32-qwen-image-edit-2509.safetensors`: SVDQuant NVFP4 (rank 32) Qwen-Image-Edit-2509 model. For users with Blackwell GPUs (50-series). - `svdq-fp4r128-qwen-image-edit-2509.safetensors`: SVDQuant NVFP4 (rank 128) Qwen-Image-Edit-2509 model. For users with Blackwell GPUs (50-series). It offers better quality than the rank 32 model, but it is slower. - `svdq-fp4r32-qwen-image-edit-2509-lightningv2.0-4steps.safetensors`: SVDQuant NVFP4 (rank 32) 4-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-4steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with Blackwell GPUs (50-series). - `svdq-fp4r128-qwen-image-edit-2509-lightningv2.0-4steps.safetensors`: SVDQuant NVFP4 (rank 128) 4-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-4steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with Blackwell GPUs (50-series). - `svdq-fp4r32-qwen-image-edit-2509-lightningv2.0-8steps.safetensors`: SVDQuant NVFP4 (rank 32) 8-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-8steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with Blackwell GPUs (50-series). - `svdq-fp4r128-qwen-image-edit-2509-lightningv2.0-8steps.safetensors`: SVDQuant NVFP4 (rank 128) 8-step Qwen-Image-Edit-2509 model by fusing Qwen-Image-Lightning-8steps-V2.0-bf16.safetensors using LoRA strength = 1.0. For users with Blackwell GPUs (50-series). - Inference Engine: nunchaku - Quantization Library: deepcompressor - Paper: SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models - Demo: svdquant.mit.edu - Diffusers Usage: See qwen-image-edit-2509.py. Check this tutorial for more advanced usage. - ComfyUI Usage: See nunchaku-qwen-image-edit-2509.json.

dataset:mit-han-lab/svdquant-datasets

471

Qwen-Image-Edit-GGUF-fork

435

FLUX.1-Kontext-dev-GGUF-forked

`FLUX.1 Kontext [dev]` is a 12 billion parameter rectified flow transformer capable of editing images based on text instructions. For more information, please read our blog post and our technical report. You can find information about the `[pro]` version in here. Key Features 1. Change existing images based on an edit instruction. 2. Have character, style and object reference without any finetuning. 3. Robust consistency allows users to refine an image through multiple successive edits with minimal visual drift. 4. Trained using guidance distillation, making `FLUX.1 Kontext [dev]` more efficient. 5. Open weights to drive new scientific research, and empower artists to develop innovative workflows. 6. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the [FLUX.1 \[dev\] Non-Commercial License](https://github.com/black-forest-labs/flux/blob/main/modellicenses/LICENSE-FLUX1-dev). Usage We provide a reference implementation of `FLUX.1 Kontext [dev]`, as well as sampling code, in a dedicated github repository. Developers and creatives looking to build on top of `FLUX.1 Kontext [dev]` are encouraged to use this as a starting point. `FLUX.1 Kontext [dev]` is also available in both ComfyUI and Diffusers. API Endpoints The FLUX.1 Kontext models are also available via API from the following sources - bfl.ai: https://docs.bfl.ai/ - DataCrunch: https://datacrunch.io/flux-kontext - fal: https://fal.ai/flux-kontext - Replicate: https://replicate.com/blog/flux-kontext - https://replicate.com/black-forest-labs/flux-kontext-dev - https://replicate.com/black-forest-labs/flux-kontext-pro - https://replicate.com/black-forest-labs/flux-kontext-max - Runware: https://runware.ai/blog/introducing-flux1-kontext-instruction-based-image-editing-with-ai?utmsource=bfl - TogetherAI: https://www.together.ai/models/flux-1-kontext-dev Risks Black Forest Labs is committed to the responsible development of generative AI technology. Prior to releasing FLUX.1 Kontext, we evaluated and mitigated a number of risks in our models and services, including the generation of unlawful content. We implemented a series of pre-release mitigations to help prevent misuse by third parties, with additional post-release mitigations to help address residual risks: 1. Pre-training mitigation. We filtered pre-training data for multiple categories of “not safe for work” (NSFW) content to help prevent a user generating unlawful content in response to text prompts or uploaded images. 2. Post-training mitigation. We have partnered with the Internet Watch Foundation, an independent nonprofit organization dedicated to preventing online abuse, to filter known child sexual abuse material (CSAM) from post-training data. Subsequently, we undertook multiple rounds of targeted fine-tuning to provide additional mitigation against potential abuse. By inhibiting certain behaviors and concepts in the trained model, these techniques can help to prevent a user generating synthetic CSAM or nonconsensual intimate imagery (NCII) from a text prompt, or transforming an uploaded image into synthetic CSAM or NCII. 3. Pre-release evaluation. Throughout this process, we conducted multiple internal and external third-party evaluations of model checkpoints to identify further opportunities for improvement. The third-party evaluations—which included 21 checkpoints of FLUX.1 Kontext [pro] and [dev]—focused on eliciting CSAM and NCII through adversarial testing with text-only prompts, as well as uploaded images with text prompts. Next, we conducted a final third-party evaluation of the proposed release checkpoints, focused on text-to-image and image-to-image CSAM and NCII generation. The final FLUX.1 Kontext [pro] (as offered through the FLUX API only) and FLUX.1 Kontext [dev] (released as an open-weight model) checkpoints demonstrated very high resilience against violative inputs, and FLUX.1 Kontext [dev] demonstrated higher resilience than other similar open-weight models across these risk categories. Based on these findings, we approved the release of the FLUX.1 Kontext [pro] model via API, and the release of the FLUX.1 Kontext [dev] model as openly-available weights under a non-commercial license to support third-party research and development. 4. Inference filters. We are applying multiple filters to intercept text prompts, uploaded images, and output images on the FLUX API for FLUX.1 Kontext [pro]. Filters for CSAM and NCII are provided by Hive, a third-party provider, and cannot be adjusted or removed by developers. We provide filters for other categories of potentially harmful content, including gore, which can be adjusted by developers based on their specific risk profile. Additionally, the repository for the open FLUX.1 Kontext [dev] model includes filters for illegal or infringing content. Filters or manual review must be used with the model under the terms of the FLUX.1 [dev] Non-Commercial License. We may approach known deployers of the FLUX.1 Kontext [dev] model at random to verify that filters or manual review processes are in place. 5. Content provenance. The FLUX API applies cryptographically-signed metadata to output content to indicate that images were produced with our model. Our API implements the Coalition for Content Provenance and Authenticity (C2PA) standard for metadata. 6. Policies. Access to our API and use of our models are governed by our Developer Terms of Service, Usage Policy, and FLUX.1 [dev] Non-Commercial License, which prohibit the generation of unlawful content or the use of generated content for unlawful, defamatory, or abusive purposes. Developers and users must consent to these conditions to access the FLUX Kontext models. 7. Monitoring. We are monitoring for patterns of violative use after release, and may ban developers who we detect intentionally and repeatedly violate our policies via the FLUX API. Additionally, we provide a dedicated email address ([email protected]) to solicit feedback from the community. We maintain a reporting relationship with organizations such as the Internet Watch Foundation and the National Center for Missing and Exploited Children, and we welcome ongoing engagement with authorities, developers, and researchers to share intelligence about emerging risks and develop effective mitigations. License This model falls under the [FLUX.1 \[dev\] Non-Commercial License](https://github.com/black-forest-labs/flux/blob/main/modellicenses/LICENSE-FLUX1-dev).

162

LTXV-fork

157

Wan2.2-Fun-A14B-Control-GGUF-fork

139

Emerhyst-20B-GGUF-fork

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Emerhyst 20B - GGUF - Model creator: Undi - Original model: Emerhyst 20B This repo contains GGUF format model files for Undi's Emerhyst 20B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Undi's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The creator of the source model has listed its license as `cc-by-nc-4.0`, and this quantization has therefore used that same license. As this model is based on Llama 2, it is also subject to the Meta Llama 2 license terms, and the license files for that are additionally included. It should therefore be considered as being claimed to be licensed under both licenses. I contacted Hugging Face for clarification on dual licensing but they do not yet have an official position. Should this change, or should Meta provide any feedback on this situation, I will update this section accordingly. In the meantime, any questions regarding licensing, and in particular how these two licenses might interact, should be directed to the original model repository: Undi's Emerhyst 20B. These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | emerhyst-20b.Q2K.gguf | Q2K | 2 | 8.31 GB| 10.81 GB | smallest, significant quality loss - not recommended for most purposes | | emerhyst-20b.Q3KS.gguf | Q3KS | 3 | 8.66 GB| 11.16 GB | very small, high quality loss | | emerhyst-20b.Q3KM.gguf | Q3KM | 3 | 9.70 GB| 12.20 GB | very small, high quality loss | | emerhyst-20b.Q3KL.gguf | Q3KL | 3 | 10.63 GB| 13.13 GB | small, substantial quality loss | | emerhyst-20b.Q40.gguf | Q40 | 4 | 11.29 GB| 13.79 GB | legacy; small, very high quality loss - prefer using Q3KM | | emerhyst-20b.Q4KS.gguf | Q4KS | 4 | 11.34 GB| 13.84 GB | small, greater quality loss | | emerhyst-20b.Q4KM.gguf | Q4KM | 4 | 12.04 GB| 14.54 GB | medium, balanced quality - recommended | | emerhyst-20b.Q50.gguf | Q50 | 5 | 13.77 GB| 16.27 GB | legacy; medium, balanced quality - prefer using Q4KM | | emerhyst-20b.Q5KS.gguf | Q5KS | 5 | 13.77 GB| 16.27 GB | large, low quality loss - recommended | | emerhyst-20b.Q5KM.gguf | Q5KM | 5 | 14.16 GB| 16.66 GB | large, very low quality loss - recommended | | emerhyst-20b.Q6K.gguf | Q6K | 6 | 16.40 GB| 18.90 GB | very large, extremely low quality loss | | emerhyst-20b.Q80.gguf | Q80 | 8 | 21.25 GB| 23.75 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Emerhyst-20B-GGUF and below it, a specific filename to download, such as: emerhyst-20b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. In addition, LimaRP v3 was used, is it recommanded to read the documentation. - PygmalionAI/pygmalion-2-13b - Xwin-LM/Xwin-LM-13B-V0.1 - The-Face-Of-Goonery/Huginn-13b-FP16 - zattio770/120-Days-of-LORA-v2-13B - lemonilia/LimaRP-Llama2-13B-v3-EXPERIMENT You can follow these instruction format settings in SillyTavern. Replace tiny with your desired response length:

llama

Lumina-DiMOO-fork

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding [[📑 Technical Report (Coming Soon)]()] &emsp; [💜 Project Page (Demo & Benchmark)] &emsp; [🤗 Model ] ¹Shanghai AI Laboratory, ²Shanghai Innovation Institute, ³Shanghai Jiao Tong University ⁶The Chinese University of Hong Kong, ⁷Tsinghua University 📚 Introduction We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations: - Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. - Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding. - Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x. - Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field. 📽️ Qualitative Results Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page. Controllable & Subject-Driven Generation Comparison - Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation. - Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128. 💬 Discussion You can reach us with this WeChat QR code! 📜 Acknowledgements This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huawei‘s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.

Kontext Tryon7 Fork

This is a batch-run banana model, used for practicing the kontext without mask outfit lora replacement effect. All the example results were achieved by directly combining two images without using a mask. Based on the test results, compared with the banana model, it has a greater advantage in terms of consistency. The workflow for each image is similar, with only a slight adjustment of parameters. You can view the details by dragging the image into the comfyui.

Hunyuan3D-Part-fork

- Pipeline of our image to 3D part generation. It contains two key components, P3-SAM and X-Part. The holistic mesh is fed to part detection module P3-SAM to obtain the semantic features, part segmentations and part bounding boxes. Then X-Part generate the complete parts. P3-SAM： Native 3DPart Segmentation. - Paper: https://arxiv.org/abs/2509.06784. - Code: https://github.com/Tencent-Hunyuan/Hunyuan3D-Part/tree/main/P3-SAM. - Project Page: https://murcherful.github.io/P3-SAM/ . - HuggingFace Demo: https://huggingface.co/spaces/tencent/Hunyuan3D-Part. X-Part： high-fidelity and structure-coherent shapede composition - Paper: https://arxiv.org/abs/2509.08643. - Code: https://github.com/Tencent-Hunyuan/Hunyuan3D-Part/tree/main/XPart. - Project Page: https://yanxinhao.github.io/Projects/X-Part/. - HuggingFace Demo: https://huggingface.co/spaces/tencent/Hunyuan3D-Part. Notice - The current release is a light version of X-Part. The full-blood version is available on [](https://3d.hunyuan.tencent.com/studio). - For X-Part, we recommend using scanned or AI-generated meshes (e.g., from Hunyuan3D V2.5 or V3.0) as input. - P3-SAM can handle any input mesh. 🔗 Citation If you found this repository helpful, please cite our reports:

Step-Audio-2-mini-fork

Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation, presented in the paper Step-Audio 2 Technical Report. - Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information. - Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information. - Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech. - State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See Evaluation and Technical Report). + Open-source: Step-Audio 2 mini and Step-Audio 2 mini Base are released under Apache 2.0 license. Model Download Huggingface | Models | 🤗 Hugging Face | |-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | Model Usage 🔧 Dependencies and Installation - Python >= 3.10 - PyTorch >= 2.3-cu121 - CUDA Toolkit - Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled. - You will need an API key from the StepFun Open Platform. - Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled. - Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner. You can scan the following QR code to join our WeChat group for communication and discussion. Automatic speech recognition CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is not supported. Category Test set Doubao LLM ASR GPT-4o Transcribe Kimi-Audio Qwen-Omni Step-Audio 2 Step-Audio 2 mini Multilingual FLEURS Arabian N/A 11.72 N/A 25.13 14.22 16.46 In-house Anhui accent 8.83 50.55 22.17 18.73 10.61 11.65 Shanghai dialect 47.49 89.58 82.90 58.74 17.77 19.30 Paralinguistic information understanding StepEval-Audio-Paralinguistic Model Avg. Gender Age Timbre Scenario Event Emotion Pitch Rhythm Speed Style Vocal GPT-4o Audio 43.45 18 42 34 22 14 82 40 60 58 64 44 Step-Audio-AQAA 36.91 70 66 18 14 14 40 38 48 54 44 0 Step-Audio 2 83.09 100 96 82 78 60 86 82 86 88 88 68 Step-Audio 2 mini 80.00 100 94 80 78 60 82 82 68 74 86 76 Tool calling StepEval-Audio-Toolcall. Date and time tools have no parameter. Model Objective Metric Audio search Date & Time Weather Web search Qwen3-32B † Trigger Precision / Recall 67.5 / 98.5 98.4 / 100.0 90.1 / 100.0 86.8 / 98.5 Step-Audio 2 Trigger Precision / Recall 86.8 / 99.5 96.9 / 98.4 92.2 / 100.0 88.4 / 95.5 Speech-to-speech conversation URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation, respectively. GPT-4o Audio Chinese 78.59 89.40 65.48 85.24 67.10 70.60 57.22 70.20 Kimi-Audio 73.59 79.34 64.66 79.75 66.07 60.44 59.29 76.21 Qwen-Omni 68.98 59.66 69.74 77.27 59.11 59.01 59.82 58.74 Step-Audio-AQAA 74.71 87.61 59.63 81.93 65.61 74.76 47.29 68.97 Step-Audio 2 83.32 91.05 75.45 86.08 68.25 74.78 63.18 65.10 Step-Audio 2 mini 77.81 89.19 64.53 84.12 69.57 76.84 58.90 69.42 GPT-4o Audio English 84.54 90.18 75.90 90.41 67.51 60.65 64.36 78.46 Kimi-Audio 60.04 83.36 42.31 60.36 49.79 50.32 40.59 56.04 Qwen-Omni 70.58 66.29 69.62 76.16 50.99 44.51 63.88 49.41 Step-Audio-AQAA 71.11 90.15 56.12 72.06 52.01 44.25 54.54 59.81 Step-Audio 2 83.90 92.72 76.51 84.92 66.07 64.86 67.75 66.33 Step-Audio 2 mini 74.36 90.07 60.12 77.65 61.25 58.79 61.94 63.80 The model and code in the repository is licensed under Apache 2.0 License.

granite-speech-3.3-8b-fork

gpt-oss-20b-fork

stable-diffusion-v1-5-fork

HunyuanImage-3.0-fork

🎨 HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation 👏 Join our WeChat and Discord | 💻 Official website(官网) Try our model! &nbsp&nbsp 🔥🔥🔥 News - September 28, 2025: 📖 HunyuanImage-3.0 Technical Report Released - Comprehensive technical documentation now available - September 28, 2025: 🚀 HunyuanImage-3.0 Open Source Release - Inference code and model weights publicly available If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know. - HunyuanImage-3.0 (Image Generation Model) - [x] Inference - [x] HunyuanImage-3.0 Checkpoints - [ ] HunyuanImage-3.0-Instruct Checkpoints (with reasoning) - [ ] VLLM Support - [ ] Distilled Checkpoints - [ ] Image-to-Image Generation - [ ] Multi-turn Interaction 🗂️ Contents - 🔥🔥🔥 News - 🧩 Community Contributions - 📑 Open-source Plan - 📖 Introduction - ✨ Key Features - 🛠️ Dependencies and Installation - 💻 System Requirements - 📦 Environment Setup - 📥 Install Dependencies - Performance Optimizations - 🚀 Usage - 🔥 Quick Start with Transformers - 🏠 Local Installation & Usage - 🎨 Interactive Gradio Demo - 🧱 Models Cards - 📝 Prompt Guide - Manually Writing Prompts - System Prompt For Automatic Rewriting the Prompt - Advanced Tips - More Cases - 📊 Evaluation - 📚 Citation - 🙏 Acknowledgements - 🌟🚀 Github Star History HunyuanImage-3.0 is a groundbreaking native multimodal model that unifies multimodal understanding and generation within an autoregressive framework. Our text-to-image module achieves performance comparable to or surpassing leading closed-source models. 🧠 Unified Multimodal Architecture: Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation. 🏆 The Largest Image Generation MoE Model: This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance. 🎨 Superior Image Generation Performance: Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details. 💭 Intelligent World-Knowledge Reasoning: The unified multimodal architecture endows HunyuanImage-3.0 with powerful reasoning capabilities. It leverages its extensive world knowledge to intelligently interpret user intent, automatically elaborating on sparse prompts with contextually appropriate details to produce superior, more complete visual outputs. 🖥️ Operating System: Linux 🎮 GPU: NVIDIA GPU with CUDA support 💾 Disk Space: 170GB for model weights 🧠 GPU Memory: ≥3×80GB (4×80GB recommended for better performance) 🐍 Python: 3.12+ (recommended and tested) 🔥 PyTorch: 2.7.1 ⚡ CUDA: 12.8 For up to 3x faster inference, install these optimizations: > 💡Installation Tips: It is critical that the CUDA version used by PyTorch matches the system's CUDA version. > FlashInfer relies on this compatibility when compiling kernels at runtime. Pytorch 2.7.1+cu128 is tested. > GCC version >=9 is recommended for compiling FlashAttention and FlashInfer. > ⚡ Performance Tips: These optimizations can significantly speed up your inference! > 💡Notation: When FlashInfer is enabled, the first inference may be slower (about 10 minutes) due to kernel compilation. Subsequent inferences on the same machine will be much faster. 3️⃣ Run the Demo The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, for optimal results currently, we recommend community partners to use deepseek to rewrite the prompts. You can go to Tencent Cloud to apply for an API Key. | Arguments | Description | Default | | ----------------------- | ------------------------------------------------------------ | ----------- | | `--prompt` | Input prompt | (Required) | | `--model-id` | Model path | (Required) | | `--attn-impl` | Attention implementation. Either `sdpa` or `flashattention2`. | `sdpa` | | `--moe-impl` | MoE implementation. Either `eager` or `flashinfer` | `eager` | | `--seed` | Random seed for image generation | `None` | | `--diff-infer-steps` | Diffusion infer steps | `50` | | `--image-size` | Image resolution. Can be `auto`, like `1280x768` or `16:9` | `auto` | | `--save` | Image save path. | `image.png` | | `--verbose` | Verbose level. 0: No log; 1: log inference information. | `0` | | `--rewrite` | Whether to enable rewriting | `1` | | `--sys-deepseek-prompt` | Select sys-prompt from `universal` or `textrendering` | `universal` | Launch an interactive web interface for easy text-to-image generation. > 🌐 Web Interface: Open your browser and navigate to `http://localhost:443` (or your configured port) | Model | Params | Download | Recommended VRAM | Supported | |---------------------------| --- | --- | --- | --- | | HunyuanImage-3.0 | 80B total (13B active) | HuggingFace | ≥ 3 × 80 GB | ✅ Text-to-Image | HunyuanImage-3.0-Instruct | 80B total (13B active) | HuggingFace | ≥ 3 × 80 GB | ✅ Text-to-Image ✅ Prompt Self-Rewrite ✅ CoT Think Notes: - Install performance extras (FlashAttention, FlashInfer) for faster inference. - Multi‑GPU inference is recommended for the Base model. Manually Writing Prompts. The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, Instruct Checkpoint can rewrite or enhance input prompts with thinking . For optimal results currently, we recommend community partners consulting our official guide on how to write effective prompts. We've included two system prompts in the PE folder of this repository that leverage DeepSeek to automatically enhance user inputs: systempromptuniversal: This system prompt converts photographic style, artistic prompts into a detailed one. systemprompttextrendering: This system prompt converts UI/Poster/Text Rending prompts to a deailed on that suits the model. Note that these system prompts are in Chinese because Deepseek works better with Chinese system prompts. If you want to use it for English oriented model, you may translate it into English or refer to the comments in the PE file as a guide. We also create a Yuanqi workflow to implement the universal one, you can directly try it. Advanced Tips - Content Priority: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters. Keywords can be added both before and after this structure. - Image resolution: Our model not only supports multiple resolutions but also offers both automatic and specified resolution options. In auto mode, the model automatically predicts the image resolution based on the input prompt. In specified mode (like traditional DiT), the model outputs an image resolution that strictly aligns with the user's chosen resolution. More Cases Our model can follow complex instructions to generate high‑quality, creative images. Our model can effectively process very long text inputs, enabling users to precisely control the finer details of generated images. Extended prompts allow for intricate elements to be accurately captured, making it ideal for complex projects requiring precision and creativity. Show prompt A cinematic medium shot captures a single Asian woman seated on a chair within a dimly lit room, creating an intimate and theatrical atmosphere. The composition is focused on the subject, rendered with rich colors and intricate textures that evoke a nostalgic and moody feeling. The primary subject is a young Asian woman with a thoughtful and expressive countenance, her gaze directed slightly away from the camera. She is seated in a relaxed yet elegant posture on an ornate, vintage armchair. The chair is upholstered in a deep red velvet, its fabric showing detailed, intricate textures and slight signs of wear. She wears a simple, elegant dress in a dark teal hue, the material catching the light in a way that reveals its fine-woven texture. Her skin has a soft, matte quality, and the light delicately models the contours of her face and arms. The surrounding room is characterized by its vintage decor, which contributes to the historic and evocative mood. In the immediate background, partially blurred due to a shallow depth of field consistent with a f/2.8 aperture, the wall is covered with wallpaper featuring a subtle, damask pattern. The overall color palette is a carefully balanced interplay of deep teal and rich red hues, creating a visually compelling and cohesive environment. The entire scene is detailed, from the fibers of the upholstery to the subtle patterns on the wall. The lighting is highly dramatic and artistic, defined by high contrast and pronounced shadow play. A single key light source, positioned off-camera, projects gobo lighting patterns onto the scene, casting intricate shapes of light and shadow across the woman and the back wall. These dramatic shadows create a strong sense of depth and a theatrical quality. While some shadows are deep and defined, others remain soft, gently wrapping around the subject and preventing the loss of detail in darker areas. The soft focus on the background enhances the intimate feeling, drawing all attention to the expressive subject. The overall image presents a cinematic, photorealistic photography style. Show prompt A cinematic, photorealistic medium shot captures a high-contrast urban street corner, defined by the sharp intersection of light and shadow. The primary subject is the exterior corner of a building, rendered in a low-saturation, realistic style. The building wall, which occupies the majority of the frame, is painted a warm orange with a finely detailed, rough stucco texture. Horizontal white stripes run across its surface. The base of the building is constructed from large, rough-hewn stone blocks, showing visible particles and texture. On the left, illuminated side of the building, there is a single window with closed, dark-colored shutters. Adjacent to the window, a simple black pendant lamp hangs from a thin, taut rope, casting a distinct, sharp-edged shadow onto the sunlit orange wall. The composition is split diagonally, with the right side of the building enveloped in a deep brown shadow. At the bottom of the frame, a smooth concrete sidewalk is visible, upon which the dynamic silhouette of a person is captured mid-stride, walking from right to left. In the shallow background, the faint, out-of-focus outlines of another building and the bare, skeletal branches of trees are softly visible, contributing to the quiet urban atmosphere and adding a sense of depth to the scene. These elements are rendered with minimal detail to keep the focus on the foreground architecture. The scene is illuminated by strong, natural sunlight originating from the upper left, creating a dramatic chiaroscuro effect. This hard light source casts deep, well-defined shadows, producing a sharp contrast between the brightly lit warm orange surfaces and the deep brown shadow areas. The lighting highlights the fine details in the wall texture and stone particles, emphasizing the photorealistic quality. The overall presentation reflects a high-quality photorealistic photography style, infused with a cinematic film noir aesthetic. Show prompt 一幅极具视觉张力的杂志封面风格人像特写。画面主体是一个身着古风汉服的人物，构图采用了从肩部以上的超级近距离特写，人物占据了画面的绝大部分，形成了强烈的视觉冲击力。画面中的人物以一种慵懒的姿态出现，微微倾斜着头部，裸露的一侧肩膀线条流畅。她正用一种妩媚而直接的眼神凝视着镜头，双眼微张，眼神深邃，传递出一种神秘而勾人的气质。人物的面部特征精致，皮肤质感细腻，在特定的光线下，面部轮廓清晰分明，展现出一种古典与现代融合的时尚美感。整个画面的背景被设定为一种简约而高级的纯红色。这种红色色调深沉，呈现出哑光质感，既纯粹又无任何杂质，为整个暗黑神秘的氛围奠定了沉稳而富有张力的基调。这个纯色的背景有效地突出了前景中的人物主体，使得所有视觉焦点都集中在其身上。光线和氛围的营造是这幅杂志风海报的关键。一束暗橘色的柔和光线作为主光源，从人物的一侧斜上方投射下来，精准地勾勒出人物的脸颊、鼻梁和肩膀的轮廓，在皮肤上形成微妙的光影过渡。同时，人物的周身萦绕着一层暗淡且低饱和度的银白色辉光，如同清冷的月光，形成一道朦胧的轮廓光。这道银辉为人物增添了几分疏离的幽灵感，强化了整体暗黑风格的神秘气质。光影的强烈对比与色彩的独特搭配，共同塑造了这张充满故事感的特写画面。整体图像呈现出一种融合了古典元素的现代时尚摄影风格。这道醒目的红色笔触运用了厚涂技法，颜料堆叠形成了强烈的物理厚度和三维立体感。它从画面的左上角附近延伸至右下角附近，构成一个动态的对角线。颜料表面可以清晰地看到画刀刮擦和笔刷拖曳留下的痕迹，边缘处的颜料层相对较薄，而中央部分则高高隆起，形成了不规则的起伏。在这道立体的红色颜料之上，巧妙地构建了一处精致的微缩景观。景观的核心是一片模拟红海滩的区域，由细腻的深红色颜料点缀而成，与下方基底的鲜红色形成丰富的层次对比。紧邻着“红海滩”的是一小片湖泊，由一层平滑且带有光泽的蓝色与白色混合颜料构成，质感如同平静无波的水面。湖泊边缘，一小撮芦苇丛生，由几根纤细挺拔的、用淡黄色和棕色颜料勾勒出的线条来表现。一只小巧的白鹭立于芦苇旁，其形态由一小块纯白色的厚涂颜料塑造，仅用一抹精炼的黑色颜料点出其尖喙，姿态优雅宁静。整个构图的背景是大面积的留白，呈现为一张带有细微凹凸纹理的白色纸质基底，这种极简处理极大地突出了中央的红色笔触及其上的微缩景观。光线从画面一侧柔和地照射下来，在厚涂的颜料堆叠处投下淡淡的、轮廓分明的阴影，进一步增强了画面的三维立体感和油画质感。整幅画面呈现出一种结合了厚涂技法的现代极简主义油画风格。 Show prompt 整体画面采用一个二乘二的四宫格布局，以产品可视化的风格，展示了一只兔子在四种不同材质下的渲染效果。每个宫格内都有一只姿态完全相同的兔子模型，它呈坐姿，双耳竖立，面朝前方。所有宫格的背景均是统一的中性深灰色，这种简约背景旨在最大限度地突出每种材质的独特质感。左上角的宫格中，兔子模型由哑光白色石膏材质构成。其表面平滑、均匀且无反射，在模型的耳朵根部、四肢交接处等凹陷区域呈现出柔和的环境光遮蔽阴影，这种微妙的阴影变化凸显了其纯粹的几何形态，整体感觉像一个用于美术研究的基础模型。右上角的宫格中，兔子模型由晶莹剔透的无瑕疵玻璃制成。它展现了逼真的物理折射效果，透过其透明的身体看到的背景呈现出轻微的扭曲。清晰的镜面高光沿着其身体的曲线轮廓流动，表面上还能看到微弱而清晰的环境反射，赋予其一种精致而易碎的质感。左下角的宫格中，兔子模型呈现为带有拉丝纹理的钛金属材质。金属表面具有明显的各向异性反射效果，呈现出冷峻的灰调金属光泽。锐利明亮的高光和深邃的阴影形成了强烈对比，精确地定义了其坚固的三维形态，展现了工业设计般的美感。右下角的宫格中，兔子模型覆盖着一层柔软浓密的灰色毛绒。根根分明的绒毛清晰可见，创造出一种温暖、可触摸的质地。光线照射在绒毛的末梢，形成柔和的光晕效果，而毛绒内部的阴影则显得深邃而柔软，展现了高度写实的毛发渲染效果。整个四宫格由来自多个方向的、柔和均匀的影棚灯光照亮，确保了每种材质的细节和特性都得到清晰的展现，没有任何刺眼的阴影或过曝的高光。这张图像以一种高度写实的3D渲染风格呈现，完美地诠释了产品可视化的精髓 Show prompt 由一个两行两列的网格构成，共包含四个独立的场景，每个场景都以不同的艺术风格描绘了一个小男孩（小明）一天中的不同活动。左上角的第一个场景，以超写实摄影风格呈现。画面主体是一个大约8岁的东亚小男孩，他穿着整洁的小学制服——一件白色短袖衬衫和蓝色短裤，脖子上系着红领巾。他背着一个蓝色的双肩书包，正走在去上学的路上。他位于画面的前景偏右侧，面带微笑，步伐轻快。场景设定在清晨，柔和的阳光从左上方照射下来，在人行道上投下清晰而柔和的影子。背景是绿树成荫的街道和模糊可见的学校铁艺大门，营造出宁静的早晨氛围。这张图片的细节表现极为丰富，可以清晰地看到男孩头发的光泽、衣服的褶皱纹理以及书包的帆布材质，完全展现了专业摄影的质感。右上角的第二个场景，采用日式赛璐璐动漫风格绘制。画面中，小男孩坐在家中的木质餐桌旁吃午饭。他的形象被动漫化，拥有大而明亮的眼睛和简洁的五官线条。他身穿一件简单的黄色T恤，正用筷子夹起碗里的米饭。桌上摆放着一碗汤和两盘家常菜。背景是一个温馨的室内环境，一扇明亮的窗户透进正午的阳光，窗外是蓝天白云。整个画面色彩鲜艳、饱和度高，角色轮廓线清晰明确，阴影部分采用平涂的色块处理，是典型的赛璐璐动漫风格。左下角的第三个场景，以细腻的铅笔素描风格呈现。画面描绘了下午在操场上踢足球的小男孩。整个图像由不同灰度的石墨色调构成，没有其他颜色。小男孩身穿运动短袖和短裤，身体呈前倾姿态，右脚正要踢向一个足球，动作充满动感。背景是空旷的操场和远处的球门，用简练的线条和排线勾勒。艺术家通过交叉排线和涂抹技巧来表现光影和体积感，足球上的阴影、人物身上的肌肉线条以及地面粗糙的质感都通过铅笔的笔触得到了充分的展现。这张铅笔画突出了素描的光影关系和线条美感。右下角的第四个场景，以文森特·梵高的后印象派油画风格进行诠释。画面描绘了夜晚时分，小男孩独自在河边钓鱼的景象。他坐在一块岩石上，手持一根简易的钓鱼竿，身影在深蓝色的夜幕下显得很渺小。整个画面的视觉焦点是天空和水面，天空布满了旋转、卷曲的星云，星星和月亮被描绘成巨大、发光的光团，使用了厚涂的油画颜料（Impasto），笔触粗犷而充满能量。深蓝、亮黄和白色的颜料在画布上相互交织，形成强烈的视觉冲击力。水面倒映着天空中扭曲的光影，整个场景充满了梵高作品中特有的强烈情感和动荡不安的美感。这幅画作是对梵高风格的深度致敬。 Show prompt 以平视视角，呈现了一幅关于如何用素描技法绘制鹦鹉的九宫格教学图。整体构图规整，九个大小一致的方形画框以三行三列的形式均匀分布在浅灰色背景上，清晰地展示了从基本形状到最终成品的全过程。第一行从左至右展示了绘画的初始步骤。左上角的第一个画框中，用简洁的铅笔线条勾勒出鹦鹉的基本几何形态：一个圆形代表头部，一个稍大的椭圆形代表身体。右上角有一个小号的无衬线字体数字“1”。中间的第二个画框中，在基础形态上添加了三角形的鸟喙轮廓和一条长长的弧线作为尾巴的雏形，头部和身体的连接处线条变得更加流畅；右上角标有数字“2”。右侧的第三个画框中，进一步精确了鹦鹉的整体轮廓，勾勒出头部顶端的羽冠和清晰的眼部圆形轮廓；右上角标有数字“3”。第二行专注于结构与细节的添加，描绘了绘画的中期阶段。左侧的第四个画框里，鹦鹉的身体上添加了翅膀的基本形状，同时在身体下方画出了一根作为栖木的横向树枝，鹦鹉的爪子初步搭在树枝上；右上角标有数字“4”。中间的第五个画框中，开始细化翅膀和尾部的羽毛分组，用短促的线条表现出层次感，并清晰地画出爪子紧握树枝的细节；右上角标有数字“5”。右侧的第六个画框里，开始为鹦鹉添加初步的阴影，使用交叉排线的素描技法在腹部、翅膀下方和颈部制造出体积感；右上角标有数字“6”。第三行则展示了最终的润色与完成阶段。左下角的第七个画框中，素描的排线更加密集，阴影层次更加丰富，羽毛的纹理细节被仔细刻画出来，眼珠也添加了高光点缀，显得炯炯有神；右上角标有数字“7”。中间的第八个画框里，描绘的重点转移到栖木上，增加了树枝的纹理和节疤细节，同时整体调整了鹦鹉身上的光影关系，使立体感更为突出；右上角标有数字“8”。右下角的第九个画框是最终完成图，所有线条都经过了精炼，光影对比强烈，鹦鹉的羽毛质感、木质栖木的粗糙感都表现得淋漓尽致，呈现出一幅完整且细节丰富的素描作品；右上角标有数字“9”。整个画面的光线均匀而明亮，没有任何特定的光源方向，确保了每个教学步骤的视觉清晰度。整体呈现出一种清晰、有条理的数字插画教程风格。海报的主体是位于画面正中央的一只腾讯QQ企鹅。这只企鹅采用了圆润可爱的3D卡通渲染风格，身体主要为饱满的黑色，腹部为纯白色。它的眼睛大而圆，眼神好奇地直视前方。黄色的嘴巴小巧而立体，双脚同样为鲜明的黄色，稳稳地站立着。一条标志性的红色围巾整齐地系在它的脖子上，围巾的材质带有轻微的布料质感，末端自然下垂。企鹅的整体造型干净利落，边缘光滑，呈现出一种精致的数字插画质感。海报的背景是一种从上到下由浅蓝色平滑过渡到白色的柔和渐变，营造出一种开阔、明亮的空间感。在企鹅的身后，散布着一些淡淡的、模糊的圆形光斑和几道柔和的抽象光束，为这个简约的平面设计海报增添了微妙的深度和科技感。画面的底部区域是文字部分，排版居中对齐。上半部分是一行稍大的黑色黑体字，内容为“Hunyuan Image 3.0”。紧随其下的是一行字号略小的深灰色黑体字，内容为“原生多模态大模型”。两行文字清晰易读，与整体的现代平面设计风格保持一致。整体光线明亮、均匀，没有明显的阴影，突出了企鹅和文字信息，符合现代设计海报的视觉要求。这张图像呈现了现代、简洁的平面设计海报风格。 🤖 SSAE (Machine Evaluation) SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points. We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators. If you find HunyuanImage-3.0 useful in your research, please cite our work: We extend our heartfelt gratitude to the following open-source projects and communities for their invaluable contributions: 🤗 Transformers - State-of-the-art NLP library 🎨 Diffusers - Diffusion models library 🌐 HuggingFace - AI model hub and community ⚡ FlashAttention - Memory-efficient attention 🚀 FlashInfer - Optimized inference engine [](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0) [](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0) [](https://www.star-history.com/#Tencent-Hunyuan/HunyuanImage-3.0&Date)

HunyuanVideo-Foley-fork

Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation Professional-grade AI sound effect generation for video content creators Sizhe Shan 1,2 • Qiulin Li 1,3 • Yutao Cui 1 • Miles Yang 1 • Yuehai Wang 2 • Qun Yang 3 • Jin Zhou 1† • Zhao Zhong 1 🏢 1 Tencent Hunyuan • 🎓 2 Zhejiang University • ✈️ 3 Nanjing University of Aeronautics and Astronautics - [2025.9.29] 🚀 HunyuanVideo-Foley-XL Model Release - Release XL-sized model with offload inference support, significantly reducing VRAM requirements. - [2025.8.28] 🌟 HunyuanVideo-Foley Open Source Release - Inference code and model weights publicly available. Experience the magic of AI-generated Foley audio in perfect sync with video content! --> 🎬 Watch how HunyuanVideo-Foley generates immersive sound effects synchronized with video content --> 🎭 Multi-scenario Sync High-quality audio synchronized with complex video scenes 🧠 Multi-modal Balance Perfect harmony between visual and textual information 🎵 48kHz Hi-Fi Output Professional-grade audio generation with crystal clarity 🚀 Tencent Hunyuan open-sources HunyuanVideo-Foley an end-to-end video sound effect generation model! A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development. 🎬 Multi-scenario Audio-Visual Synchronization Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications. ⚖️ Multi-modal Semantic Balance Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements. 🎵 High-fidelity Audio Output Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality. HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions! 📊 Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories 🔄 Comprehensive data processing pipeline for high-quality text-video-audio datasets The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities. 🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks HunyuanVideo-Foley employs a sophisticated hybrid architecture: - 🔄 Multimodal Transformer Blocks: Process visual-audio streams simultaneously - 🎵 Unimodal Transformer Blocks: Focus on audio stream refinement - 👁️ Visual Encoding: Pre-trained encoder extracts visual features from video frames - 📝 Text Processing: Semantic features extracted via pre-trained text encoder - 🎧 Audio Encoding: Latent representations with Gaussian noise perturbation - ⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation > Objective and Subjective evaluation results demonstrating superior performance across all metrics | 🏆 Method | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | MOS-Q ↑ | MOS-S ↑ | MOS-T ↑ | |:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|:------------:|:------------:|:------------:| | FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36±0.78 | 3.54±0.88 | 3.46±0.95 | | V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55±0.97 | 2.60±1.20 | 2.70±1.37 | | Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92±0.95 | 2.76±1.20 | 2.94±1.26 | | MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58±0.84 | 3.63±1.00 | 3.47±1.03 | | ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20±0.97 | 3.01±1.04 | 3.02±1.08 | | HunyuanVideo-Foley (ours) | 6.59 | 2.74 | 3.88 | 6.13 | 0.35 | 0.74 | 0.33 | 4.14±0.68 | 4.12±0.77 | 4.15±0.75 | > Comprehensive objective evaluation showcasing state-of-the-art performance | 🏆 Method | FDPANNs ↓ | FDPASST ↓ | KL ↓ | IS ↑ | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | |:-------------:|:--------------:|:--------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:| | FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 | | V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 | | Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 | | MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 | | ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 | | HunyuanVideo-Foley (ours) | 6.07 | 202.12 | 1.89 | 8.30 | 6.12 | 2.76 | 3.22 | 5.53 | 0.38 | 0.54 | 0.24 | 🎉 Outstanding Results! HunyuanVideo-Foley achieves the best scores across ALL evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment. 🔧 System Requirements - CUDA: 12.4 or 11.8 recommended - Python: 3.8+ - OS: Linux (primary support) 💡 Tip: We recommend using Conda for Python environment management. Generate Foley audio for a single video file with text description: Process multiple videos using a CSV file with video paths and descriptions: Launch a user-friendly Gradio web interface for easy interaction: 🚀 Then open your browser and navigate to the provided local URL to start generating Foley audio! If you find HunyuanVideo-Foley useful for your research, please consider citing our paper: We extend our heartfelt gratitude to the open-source community! 🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning! [](https://github.com/Tencent-Hunyuan) [](https://twitter.com/TencentHunyuan) [](https://hunyuan.tencent.com/) © 2025 Tencent Hunyuan. All rights reserved. | Made with ❤️ for the AI community

stable-diffusion-2-inpainting-fork

SongGeneration-fork

TRELLIS-text-xlarge-fork

The text conditioned version of TRELLIS with model size XL, a large 3D genetive model. It was introduced in the paper Structured 3D Latents for Scalable and Versatile 3D Generation.

Qwen3-Omni-30B-A3B-Instruct-fork

Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. Key features: State-of-the-art across modalities: Early text-first pretraining and mixed multimodal training provide native multimodal support. While achieving strong audio and audio-video results, unimodal text and image performance does not regress. Reaches SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36; ASR, audio understanding, and voice conversation performance is comparable to Gemini 2.5 Pro. Multilingual: Supports 119 text languages, 19 speech input languages, and 10 speech output languages. - Speech Input: English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, Urdu. - Speech Output: English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean. Novel Architecture: MoE-based Thinker–Talker design with AuT pretraining for strong general representations, plus a multi-codebook design that drives latency to a minimum. Real-time Audio/Video Interaction: Low-latency streaming with natural turn-taking and immediate text or speech responses. Flexible Control: Customize behavior via system prompts for fine-grained control and easy adaptation. Detailed Audio Captioner: Qwen3-Omni-30B-A3B-Captioner is now open source: a general-purpose, highly detailed, low-hallucination audio captioning model that fills a critical gap in the open-source community. Qwen3-Omni supports a wide range of multimodal application scenarios, covering various domain tasks involving audio, image, video, and audio-visual modalities. Below are several cookbooks demonstrating the usage cases of Qwen3-Omni and these cookbooks include our actual execution logs. You can first follow the QuickStart guide to download the model and install the necessary inference environment dependencies, then run and experiment locally—try modifying prompts or switching model types, and enjoy exploring the capabilities of Qwen3-Omni! Audio Speech Recognition Speech recognition, supporting multiple languages and long audio. Speech Translation Speech-to-Text / Speech-to-Speech translation. Music Analysis Detailed analysis and appreciation of any music, including style, genre, rhythm, etc. Sound Analysis Description and analysis of various sound effects and audio signals. Audio Caption Audio captioning, detailed description of any audio input. Mixed Audio Analysis Analysis of mixed audio content, such as speech, music, and environmental sounds. Image Question Answering arbitrary questions about any image. Image Math Solving complex mathematical problems in images, highlighting the capabilities of the Thinking model. Video Description Detailed description of video content. Video Navigation Generating navigation commands from first-person motion videos. Video Scene Transition Analysis of scene transitions in videos. Audio-Visual Audio Visual Question Answering arbitrary questions in audio-visual scenarios, demonstrating the model's ability to model temporal alignment between audio and video. Audio Visual Interaction Interactive communication with the model using audio-visual inputs, including task specification via audio. Audio Visual Dialogue Conversational interaction with the model using audio-visual inputs, showcasing its capabilities in casual chat and assistant-like behavior. Agent Audio Function Call Using audio input to perform function calls, enabling agent-like behaviors. Downstream Task Fine-tuning Omni Captioner Introduction and capability demonstration of Qwen3-Omni-30B-A3B-Captioner , a downstream fine-tuned model based on Qwen3-Omni-30B-A3B-Instruct, illustrating the strong generalization ability of the Qwen3-Omni foundation model. Below is the description of all Qwen3-Omni models. Please select and download the model that fits your needs. | Model Name | Description | |------------------------------|-------------| | Qwen3-Omni-30B-A3B-Instruct | The Instruct model of Qwen3-Omni-30B-A3B, containing both thinker and talker, supporting audio, video, and text input, with audio and text output. For more information, please read the Qwen3-Omni Technical Report. | | Qwen3-Omni-30B-A3B-Thinking | The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. For more information, please read the Qwen3-Omni Technical Report.| | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's cookbook. | During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory: The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you create a new Python environment to avoid environment runtime issues. We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using vLLM for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default. Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. Here is a code snippet to show you how to use Qwen3-Omni with `transformers` and `qwenomniutils`: Here are some more advanced usage examples. You can expand the sections below to learn more. The model can batch inputs composed of mixed samples of various types such as text, images, audio, and videos as input when `returnaudio=False` is set. Here is an example. The model supports both text and audio outputs. If users do not need audio outputs, they can call `model.disabletalker()` after initializing the model. This option will save about `10GB` of GPU memory, but the `returnaudio` option for the `generate` function will only allow `False`. For a more flexible experience, we recommend that users decide whether to return audio when the `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs, resulting in faster text responses. Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows: | Voice Type | Gender | Description | |------------|--------|-------------| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe. | | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. | | Aiden | Male | A warm, laid-back American voice with a gentle, boyish charm. | Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`. We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, and audio output inference support for the Instruct model will be released in the near future, you can follow the commands below to install vLLM from source. Please note that we recommend you create a new Python environment to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the vLLM official documentation. You can use the following code for vLLM inference. The `limitmmperprompt` parameter specifies the maximum number of each modality's data allowed per message. Since vLLM needs to pre-allocate GPU memory, larger values will require more GPU memory; if OOM issues occur, try reducing this value. Setting `tensorparallelsize` greater than one enables multi-GPU parallel inference, improving concurrency and throughput. In addition, `maxnumseqs` indicates the number of sequences that vLLM processes in parallel during each inference step. A larger value requires more GPU memory but enables higher batch inference speed. For more details, please refer to the vLLM official documentation. Below is a simple example of how to run Qwen3-Omni with vLLM: Here are some more advanced usage examples. You can expand the sections below to learn more. Using vLLM enables fast batch inference, which can help you efficiently process large volumes of data or conduct benchmarking. Refer to the following code example: vLLM serve for Qwen3-Omni currently only supports the thinker model. The `useaudioinvideo` parameter is not available in vLLM serve; you can handle this by separately passing video and audio inputs for processing. You can start vLLM serve through the following command: Then you can use the chat API as below (via curl, for example): | Model | Precision | 15s Video | 30s Video | 60s Video | 120s Video | |------------------------------|-----------| --------- | --------- | --------- | --------- | | Qwen3-Omni-30B-A3B-Instruct | BF16 | 78.85 GB | 88.52 GB | 107.74 GB | 144.81 GB | | Qwen3-Omni-30B-A3B-Thinking | BF16 | 68.74 GB | 77.79 GB | 95.76 GB | 131.65 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` precision, tested with `attnimplementation="flashattention2"`. The Instruct model includes both the thinker and talker components, whereas the Thinking model includes only the thinker part. When using Qwen3-Omni for audio-visual multimodal interaction, where the input consists of a video and its corresponding audio (with the audio serving as a query), we recommend using the following system prompt. This setup helps the model maintain high reasoning capability while better assuming interactive roles such as a smart assistant. Additionally, the text generated by the thinker will be more readable, with a natural, conversational tone and without complex formatting that is difficult to vocalize, leading to more stable and fluent audio output from the talker. You can customize the `usersystemprompt` field in the system prompt to include character settings or other role-specific descriptions as needed. The `Qwen3-Omni-30B-A3B-Thinking` model is primarily designed for understanding and interacting with multimodal inputs, including text, audio, image, and video. To achieve optimal performance, we recommend that users include an explicit textual instruction or task description in each round of dialogue alongside the multimodal input. This helps clarify the intent and significantly enhances the model's ability to leverage its reasoning capabilities. For example: In multimodal interaction, user-provided videos are often accompanied by audio (such as spoken questions or sounds from events in the video). This information helps the model provide a better interactive experience. We provide the following options for users to decide whether to use the audio from a video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter must be set consistently across these steps; otherwise, unexpected results may occur. Qwen3-Omni maintains state-of-the-art performance on text and visual modalities without degradation relative to same-size single-model Qwen counterparts. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and sets the SOTA on 22, outperforming strong closed-source systems such as Gemini 2.5 Pro and GPT-4o. GPT-4o-0327 Qwen3-235B-A22B Non Thinking Qwen3-30B-A3B-Instruct-2507 Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Multilingual Tasks MultiIF 70.4 70.2 67.9 64.0 64.7 Gemini-2.5-Flash Thinking Qwen3-235B-A22B Thinking Qwen3-30B-A3B-Thinking-2507 Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Multilingual Tasks MultiIF 74.4 71.9 76.4 72.9 73.2 Seed-ASR Voxtral-Mini Voxtral-Small GPT-4o-Transcribe Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Wenetspeech net | meeting 4.66 | 5.69 24.30 | 31.53 20.33 | 26.08 15.30 | 32.27 14.43 | 13.47 5.91 | 7.65 4.69 | 5.89 4.62 | 5.75 Librispeech clean | other 1.58 | 2.84 1.88 | 4.12 1.56 | 3.30 1.39 | 3.75 2.89 | 3.56 1.74 | 3.45 1.22 | 2.48 1.27 | 2.44 Fleurs-avg (19 lang) - 15.67 8.09 4.48 5.55 14.04 5.33 5.31 MIR-1K (vocal-only) 6.45 23.33 18.73 11.87 9.85 8.15 5.90 5.85 Opencpop-test 2.98 31.01 16.06 7.93 6.49 2.84 1.54 2.02 Fleurs-en2xx - 30.35 37.85 - 39.25 29.22 37.50 36.22 Fleurs-xx2en - 27.54 32.81 - 35.41 28.61 31.08 30.71 Fleurs-zh2xx - 17.03 22.05 - 26.63 17.97 25.17 25.10 Fleurs-xx2zh - 28.75 34.82 - 37.50 27.68 33.13 31.19 GPT-4o-Audio Gemini-2.5-Flash Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Instruct Qwen3-Omni-Flash-Thinking MMAU-v05.15.25 62.5 71.8 77.4 65.5 77.5 75.4 77.6 76.5 Best Specialist Models GPT-4o-Audio Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct RUL-MuchoMusic 47.6 (Audio Flamingo 3) 36.1 49.4 47.3 52.0 52.1 MTG Genre Micro F1 35.8 (MuQ-MuLan) 25.3 32.6 32.5 39.0 39.5 MTG Mood/Theme Micro F1 10.9 (MuQ-MuLan) 11.3 14.1 8.9 21.0 21.7 MTG Instrument Micro F1 39.8 (MuQ-MuLan) 34.2 33.0 22.6 40.5 40.7 MTG Top50 Micro F1 33.2 (MuQ-MuLan) 25.0 26.1 21.6 36.7 36.9 MagnaTagATune Micro F1 41.6 (MuQ) 29.2 28.1 30.1 44.3 46.8 Datasets GPT4-o Gemini-2.0-Flash Qwen2.5-VL 72B Qwen3-Omni-30B-A3B -Instruct Qwen3-Omni-Flash -Instruct Datasets Gemini-2.5-flash-thinking InternVL-3.5-241B-A28B Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Datasets Previous Open-source SoTA Gemini-2.5-Flash Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Datasets Previous Open-source SoTA Gemini-2.5-Flash-Thinking Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Qwen3-Omni-30B-A3B MiniMax ElevenLabs Qwen3-Omni-30B-A3B MiniMax ElevenLabs Decoding Strategy: For the Qwen3-Omni series across all evaluation benchmarks, `Instruct` models use greedy decoding during generation without sampling. For `Thinking` models, the decoding parameters should be taken from the `generationconfig.json` file in the checkpoint. Benchmark-Specific Formatting: For the majority of evaluation benchmarks, they come with their own ChatML formatting to embed the question or prompt. It should be noted that all video data are set to `fps=2` during evaluation. Default Prompts: For tasks in certain benchmarks that do not include a prompt, we use the following prompt settings: | Task Type | Prompt | | :--- | :--- | | Auto Speech Recognition (ASR) for Chinese | 请将这段中文语音转换为纯文本。 | | Auto Speech Recognition (ASR) for Other languages | Transcribe the audio into text. | | Speech-to-Text Translation (S2TT) | Listen to the provided speech and produce a translation in text. | | Song Lyrics Recognition | Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations. | System Prompt: No `system prompt` should be set for any evaluation benchmark. Input Sequence: The question or prompt should be input as user text. Unless otherwise specified by the benchmark, the text should come after multimodal data in the sequence. For example:

Gemmasutra-Pro-27B-v1-fork

license:cc-by-nc-4.0

dpt-large

Janus-Pro-1B-fork

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models. Janus-Pro is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation. Janus-Pro is constructed based on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base. For multimodal understanding, it uses the SigLIP-L as the vision encoder, which supports 384 x 384 image input. For image generation, Janus-Pro uses the tokenizer from here with a downsample rate of 16. This code repository is licensed under the MIT License. The use of Janus-Pro models is subject to DeepSeek Model License. 5. Citation If you have any questions, please raise an issue or contact us at [email protected].

Qwen-Image-fork

sam2-hiera-large

Qwen3-Omni-30B-A3B-Thinking-fork

nomic-embed-multimodal-7b-fork

Nomic Embed Multimodal 7B: State-of-the-Art Visual Document Retrieval `nomic-embed-multimodal-7b` is a dense state-of-the-art multimodal embedding model that excels at visual document retrieval tasks: - High Performance: Achieves 58.8 NDCG@5 on Vidore-v2, outperforming all other dense multimodal embedding models. - Unified Text-Image Encoding: Directly encodes interleaved text and images without complex preprocessing - Advanced Architecture: 7B parameter multimodal embedding model - Fully Open-Source: Model weights, training data, and code available | Model | Avg. | ESG Restaurant Human | Econ Macro Multi. | AXA Multi. | MIT Bio | ESG Restaurant Synth. | ESG Restaurant Synth. Multi. | MIT Bio Multi. | AXA | Econ. Macro | |-------|------|----------------------|-------------------|------------|---------|----------------------|----------------------------|---------------|-----|------------| | ColNomic Embed Multimodal 7B | 62.7 | 73.9 | 54.7 | 61.3 | 66.1 | 57.3 | 56.7 | 64.2 | 68.3 | 61.6 | | ColNomic Embed Multimodal 3B | 61.2 | 65.8 | 55.4 | 61.0 | 63.5 | 56.6 | 57.2 | 62.5 | 68.8 | 60.2 | | T-Systems ColQwen2.5-3B | 59.9 | 72.1 | 51.2 | 60.0 | 65.3 | 51.7 | 53.3 | 61.7 | 69.3 | 54.8 | | Nomic Embed Multimodal 7B | 59.7 | 65.7 | 57.7 | 59.3 | 64.0 | 49.2 | 51.9 | 61.2 | 66.3 | 63.1 | | GME Qwen2 7B | 59.0 | 65.8 | 56.2 | 55.4 | 64.0 | 54.3 | 56.7 | 55.1 | 60.7 | 62.9 | | Nomic Embed Multimodal 3B | 58.8 | 59.8 | 57.5 | 58.8 | 62.5 | 49.4 | 49.4 | 58.6 | 69.6 | 63.5 | | Llama Index vdr-2b-multi-v1 | 58.4 | 63.1 | 52.8 | 61.0 | 60.6 | 50.3 | 51.2 | 56.9 | 68.8 | 61.2 | | Voyage Multimodal 3 | 55.0 | 56.1 | 55.0 | 59.5 | 56.4 | 47.2 | 46.2 | 51.5 | 64.1 | 58.8 | To use `nomic-embed-multimodal-7b`, please install `colpali` from source - Total Parameters: 7B - Training Approach: Fine-tuned from Qwen2.5-VL 7B Instruct - Architecture Type: Vision-Language Model with unified text and image input processing - Key Innovations: - Same-source sampling to create harder in-batch negatives - Hard negative mining with positive-aware techniques Nomic Embed Multimodal 7B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows: 1. Direct Document Embedding: Skip OCR and complex processing by directly embedding document page images 2. Faster Processing: Eliminate preprocessing steps for quicker indexing 3. More Complete Information: Capture both textual and visual cues in a single embedding 4. Simple Implementation: Use the same API for both text and images The model excels at handling real-world document retrieval scenarios that challenge traditional text-only systems: - Research Papers: Capture equations, diagrams, and tables - Technical Documentation: Encode code blocks, flowcharts, and screenshots - Product Catalogs: Represent images, specifications, and pricing tables - Financial Reports: Embed charts, graphs, and numerical data - Visually Rich Content: Where layout and visual information are important - Multilingual Documents: Where visual context provides important cues Nomic Embed Multimodal 7B was developed through several key innovations: 1. Sampling From the Same Source: Forcing sampling from the same dataset source creates harder in-batch negatives, preventing the model from learning dataset artifacts. 2. Hard Negative Mining: Using an initial model to retrieve top-k nearest neighbors for each query, then incorporating these hard negatives into training. 3. Positive-aware Hard Negative Mining: Reducing false negatives using techniques introduced in NV-Retriever. - Performance may vary when processing documents with unconventional layouts or unusual visual elements - While it handles multiple languages, performance is strongest on English content - Processing very large or complex documents may require dividing them into smaller chunks - Performance on documents with handwriting or heavily stylized fonts may be reduced - Nomic Embed Ecosystem: https://www.nomic.ai/embed - Website: https://nomic.ai - Twitter: https://twitter.com/nomicai - Discord: https://discord.gg/myY5YDR8z8 If you find this model useful in your research or applications, please consider citing:

Qwen3-Coder-480B-A35B-Instruct-fork

Wan2.2-TI2V-5B-fork

Qwen2.5-Omni-3B-Fork

Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness). OmniBench Speech | Sound Event | Music | Avg Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91% Librispeech dev-clean | dev other | test-clean | test-other SALMONN -|-|2.1|4.9 Common Voice 15 en | zh | yue | fr Whisper-large-v3 9.3|12.8|10.9|10.8 Wenetspeech test-net | test-meeting Seed-ASR-Chinese 4.7|5.7 CoVoST2 en-de | de-en | en-zh | zh-en SALMONN 18.6|-|33.1|- MusicCaps LP-MusicCaps 0.291|0.149|0.089| 0.061 |0.129|0.130 Qwen2.5-Omni-3B 0.325| 0.163 | 0.093 |0.057| 0.132 | 0.229 Qwen2.5-Omni-7B 0.328 |0.162|0.090|0.055|0.127|0.225 MMAU Sound | Music | Speech | Avg Gemini-Pro-V1.5 56.75|49.40|58.55|54.90 VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU Ultravox-v0.4.1-LLaMA-3.1-8B 4.55 |3.90|53.35|47.17 VoiceBench OpenBookQA | IFEval | AdvBench | Avg Ultravox-v0.4.1-LLaMA-3.1-8B 65.27| 66.88 |98.46|71.45 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMU val | 59.2 | 53.1 | 53.9 | 58.6 | 60.0 | | MMMU-Pro overall | 36.6 | 29.7 | - | 38.3 | 37.6 | | MathVista testmini | 67.9 | 59.4 | 71.9 | 68.2 | 52.5 | | MathVision full | 25.0 | 20.8 | 23.1 | 25.1 | - | | MMBench-V1.1-EN test | 81.8 | 77.8 | 80.5 | 82.6 | 76.0 | | MMVet turbo | 66.8 | 62.1 | 67.5 | 67.1 | 66.9 | | MMStar | 64.0 | 55.7 | 64.0 | 63.9 | 54.8 | | MME sum | 2340 | 2117 | 2372 | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | 59.2 | - | | CRPE relation | 76.5 | 73.7 | - | 76.4 | - | | RealWorldQA avg | 70.3 | 62.6 | 71.9 | 68.5 | - | | MME-RealWorld en | 61.6 | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | 6.3 | - | | AI2D | 83.2 | 79.5 | 85.8 | 83.9 | - | | TextVQA val | 84.4 | 79.8 | 83.2 | 84.9 | - | | DocVQA test | 95.2 | 93.3 | 93.5 | 95.7 | - | | ChartQA test Avg | 85.3 | 82.8 | 84.9 | 87.3 | - | | OCRBenchV2 en | 57.8 | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcoco val | 90.5 | 88.7 | 90.0 | 90.6 | 73.2 | | Refcoco textA | 93.5 | 91.8 | 92.5 | 93.2 | 72.9 | | Refcoco textB | 86.6 | 84.0 | 85.4 | 88.2 | 74.6 | | Refcoco+ val | 85.4 | 81.1 | 84.2 | 88.2 | 62.5 | | Refcoco+ textA | 91.0 | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+ textB | 79.3 | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+ val | 87.4 | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+ test | 87.9 | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | 55.0 | 36.7 | | PointGrounding | 66.5 | 46.2 | 67.3 | - | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MME w/o sub | 64.3 | 62.0 | 63.9 | 65.1 | 64.8 | | Video-MME w sub | 72.4 | 68.6 | 67.9 | 71.6 | - | | MVBench | 70.3 | 68.7 | 67.2 | 69.6 | - | | EgoSchema test | 68.6 | 61.4 | 63.2 | 65.0 | - | SEED test-zh | test-en | test-hard Seed-TTSICL 1.11 | 2.24 | 7.58 SEED test-zh | test-en | test-hard Seed-TTSICL 0.796 | 0.762 | 0.776 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | 56.3 | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | 75.4 | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench 0831 | 29.6 | 22.3 | 35.9 | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | 36.4 | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | 75.5 | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | 91.6 | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | 84.8 | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | 79.2 | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | 70.4 | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench 2305-2409 | 24.6 | 16.5 | 28.7 | 19.9 | 23.9 | 8.3 | 18.9 | Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenomniutils`: |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend | | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attnimplementation="flashattention2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here. Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `returnaudio=False` is set. Here is an example. Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. Use audio in video In the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter in these places must be set to the same, otherwise unexpected results will occur. The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disabletalker()` after init the model. This option will save about `~2GB` of GPU memory but the `returnaudio` option for `generate` function will only allow to be set at `False`. In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs to get text responses faster. Change voice type of output audio Qwen2.5-Omni supports the ability to change the voice of the output audio. The `"Qwen/Qwen2.5-Omni-3B"` checkpoint support two voice types as follow: | Voice Type | Gender | Description | |------------|--------|-------------| | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.| Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`. First, make sure to install the latest version of Flash Attention 2: Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. To load and run a model using FlashAttention-2, add `attnimplementation="flashattention2"` when loading the model: If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

Janus-Pro-7B-Fork

Qwen2.5-Omni-7B-fork

Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness). OmniBench Speech | Sound Event | Music | Avg Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91% Librispeech dev-clean | dev other | test-clean | test-other SALMONN -|-|2.1|4.9 Common Voice 15 en | zh | yue | fr Whisper-large-v3 9.3|12.8|10.9|10.8 Wenetspeech test-net | test-meeting Seed-ASR-Chinese 4.7|5.7 CoVoST2 en-de | de-en | en-zh | zh-en SALMONN 18.6|-|33.1|- MusicCaps LP-MusicCaps 0.291|0.149|0.089| 0.061 |0.129|0.130 Qwen2.5-Omni-3B 0.325| 0.163 | 0.093 |0.057| 0.132 | 0.229 Qwen2.5-Omni-7B 0.328 |0.162|0.090|0.055|0.127|0.225 MMAU Sound | Music | Speech | Avg Gemini-Pro-V1.5 56.75|49.40|58.55|54.90 VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU Ultravox-v0.4.1-LLaMA-3.1-8B 4.55 |3.90|53.35|47.17 VoiceBench OpenBookQA | IFEval | AdvBench | Avg Ultravox-v0.4.1-LLaMA-3.1-8B 65.27| 66.88 |98.46|71.45 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMU val | 59.2 | 53.1 | 53.9 | 58.6 | 60.0 | | MMMU-Pro overall | 36.6 | 29.7 | - | 38.3 | 37.6 | | MathVista testmini | 67.9 | 59.4 | 71.9 | 68.2 | 52.5 | | MathVision full | 25.0 | 20.8 | 23.1 | 25.1 | - | | MMBench-V1.1-EN test | 81.8 | 77.8 | 80.5 | 82.6 | 76.0 | | MMVet turbo | 66.8 | 62.1 | 67.5 | 67.1 | 66.9 | | MMStar | 64.0 | 55.7 | 64.0 | 63.9 | 54.8 | | MME sum | 2340 | 2117 | 2372 | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | 59.2 | - | | CRPE relation | 76.5 | 73.7 | - | 76.4 | - | | RealWorldQA avg | 70.3 | 62.6 | 71.9 | 68.5 | - | | MME-RealWorld en | 61.6 | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | 6.3 | - | | AI2D | 83.2 | 79.5 | 85.8 | 83.9 | - | | TextVQA val | 84.4 | 79.8 | 83.2 | 84.9 | - | | DocVQA test | 95.2 | 93.3 | 93.5 | 95.7 | - | | ChartQA test Avg | 85.3 | 82.8 | 84.9 | 87.3 | - | | OCRBenchV2 en | 57.8 | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcoco val | 90.5 | 88.7 | 90.0 | 90.6 | 73.2 | | Refcoco textA | 93.5 | 91.8 | 92.5 | 93.2 | 72.9 | | Refcoco textB | 86.6 | 84.0 | 85.4 | 88.2 | 74.6 | | Refcoco+ val | 85.4 | 81.1 | 84.2 | 88.2 | 62.5 | | Refcoco+ textA | 91.0 | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+ textB | 79.3 | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+ val | 87.4 | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+ test | 87.9 | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | 55.0 | 36.7 | | PointGrounding | 66.5 | 46.2 | 67.3 | - | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MME w/o sub | 64.3 | 62.0 | 63.9 | 65.1 | 64.8 | | Video-MME w sub | 72.4 | 68.6 | 67.9 | 71.6 | - | | MVBench | 70.3 | 68.7 | 67.2 | 69.6 | - | | EgoSchema test | 68.6 | 61.4 | 63.2 | 65.0 | - | SEED test-zh | test-en | test-hard Seed-TTSICL 1.11 | 2.24 | 7.58 SEED test-zh | test-en | test-hard Seed-TTSICL 0.796 | 0.762 | 0.776 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | 56.3 | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | 75.4 | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench 0831 | 29.6 | 22.3 | 35.9 | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | 36.4 | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | 75.5 | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | 91.6 | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | 84.8 | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | 79.2 | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | 70.4 | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench 2305-2409 | 24.6 | 16.5 | 28.7 | 19.9 | 23.9 | 8.3 | 18.9 | Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenomniutils`: |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend | | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attnimplementation="flashattention2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here. Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `returnaudio=False` is set. Here is an example. Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. Use audio in video In the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter in these places must be set to the same, otherwise unexpected results will occur. The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disabletalker()` after init the model. This option will save about `~2GB` of GPU memory but the `returnaudio` option for `generate` function will only allow to be set at `False`. In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs to get text responses faster. Change voice type of output audio Qwen2.5-Omni supports the ability to change the voice of the output audio. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types as follow: | Voice Type | Gender | Description | |------------|--------|-------------| | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.| Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`. First, make sure to install the latest version of Flash Attention 2: Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. To load and run a model using FlashAttention-2, add `attnimplementation="flashattention2"` when loading the model: If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

NuMarkdown-8B-Thinking-fork