wikeeyang

25 models • 5 total models in database
Sort by:

SRPO-Refine-Quantized-v1.0

=================================================================================== 本模型为 https://huggingface.co/tencent/SRPO 模型的 精调 和 8bit/4bit (fp8e4m3fn/Q80/Q41) 量化版本,主要提升出图的清晰度和模型的兼容性(第一张图片中的 SRPO-fp8 量化生成的图片,显得特别模糊,主要是由于采用 ComfyUI 模型加载并直接量化的方式造成,并非模型 fp8 精度下的实际表现,实际表现请参阅第二张对比图,为避免使用者误解,特提供第二张对比图,模型在不同精度下的表现是正常的)。 This model is the refine and quantized version of the model: https://huggingface.co/tencent/SRPO, it improve the clarity of the generated images and the compatibility of the models. (In below image, the SRPO-fp8 means load and quantized directly by ComfyUI diffusion model loader nodes) For FP16 version: Pls refer to: https://civitai.com/models/1961797 or my: https://www.modelscope.cn/models/wikeeyang/SRPO-Refine-Quantized Compare SRPO offical and R&Q v1.0 in the same quantized accuracy: Please fall under SRPO license refer license.txt file and refer to the FLUX.1 [dev] Non-Commercial License. =================================================================================== Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference Xiangwei Shen 1,2 , Zhimin Li 1 , Zhantao Yang 1 , Shiyi Zhang 3 , Yingfang Zhang 1 , Donghao Li 1 , 2 School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 3 Shenzhen International Graduate School, Tsinghua University Abstract Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x. Checkpoints The `diffusionpytorchmodel.safetensors` is online version of SRPO based on FLUX.1 Dev, trained on HPD dataset with HPSv2 License SRPO is licensed under the License Terms of SRPO. See `./License.txt` for more details. Citation If you use SRPO for your research, please cite our paper:

18,925
53

Flux2-Klein-9B-True-V2

NaNK
10,408
44

Flux2-Klein-9B-True-V1

NaNK
license:apache-2.0
10,168
80

SRPO-for-ComfyUI

=================================================================================== 本模型为 https://huggingface.co/tencent/SRPO 模型的 转换 和 8bit/4bit (fp8e4m3fn/Q80/Q41) 量化版本,以适配 ComfyUI 用户环境正常加载和出图,保持原模型正常的出图效果。 This model is the converted and quantized version of the model: https://huggingface.co/tencent/SRPO, To adapt the ComfyUI environment for normal loading and output of images, maintaining the original model's normal effects. For bf16 version, Pls download it from: https://www.modelscope.cn/models/wikeeyang/SRPO-for-ComfyUI Please fall under SRPO license refer license.txt file and refer to the FLUX.1 [dev] Non-Commercial License. =================================================================================== Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference Xiangwei Shen 1,2 , Zhimin Li 1 , Zhantao Yang 1 , Shiyi Zhang 3 , Yingfang Zhang 1 , Donghao Li 1 , 2 School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 3 Shenzhen International Graduate School, Tsinghua University Abstract Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x. Checkpoints The `diffusionpytorchmodel.safetensors` is online version of SRPO based on FLUX.1 Dev, trained on HPD dataset with HPSv2 License SRPO is licensed under the License Terms of SRPO. See `./License.txt` for more details. Citation If you use SRPO for your research, please cite our paper:

5,692
21

Flux1-Dev-DedistilledMixTuned-v4

1,470
4

Real-Qwen-Image-v1.0

本模型为 QwenImage 微调模型,主要提升了出图的清晰度和写实感。具体效果参见示例图片, 图片中也附带有 ComfyUI 工作流 ,本模型极易使用、快速出图、LoRA兼容性良好。 The model is the QwenImage fine-tuned model, It enhances the clarity and realism of the generated images. For specific effects, please refer to the example images, which also include the ComfyUI workflow . The model is very easy to use and quickly generates images, and have a good LoRA compatibility. Also on: https://www.modelscope.cn/models/wikeeyang/Real-Qwen-Image https://civitai.com/models/1898752 基本组合:euler+simple,cfg 1.0,steps 20 - 30,您可以尝试不同的组合。 Basic: euler+simple, cfg 1.0, steps 20 - 30, You can try more different combinations. 💜 Qwen Chat &nbsp&nbsp | &nbsp&nbsp🤗 Hugging Face &nbsp&nbsp | &nbsp&nbsp🤖 ModelScope &nbsp&nbsp | &nbsp&nbsp 📑 Tech Report &nbsp&nbsp | &nbsp&nbsp 📑 Blog &nbsp&nbsp 🖥️ Demo &nbsp&nbsp | &nbsp&nbsp💬 WeChat (微信) &nbsp&nbsp | &nbsp&nbsp🫨 Discord &nbsp&nbsp

license:apache-2.0
1,285
17

Magic-Wan-Image-V2

NaNK
license:apache-2.0
994
17

Magic-Wan-T2IV-V3

NaNK
license:apache-2.0
931
15

Z-Image-Turbo-Art

license:apache-2.0
854
21

Magic-Wan-Image-v1.0

本模型是一个实验模型,是 Wan2.2-T2V-14B 文生视频模型的混调版本,目的是能让广大 Wan 2.2 模型的爱好者,能像使用 Flux 一样,简单方便的使用 Wan2.2 T2V 模型来生成各种图片。Wan 2.2 模型擅长写实类图像的生成,同时兼顾多种风格,由于是视频模型演化而来,模型在生图能力的泛化性方面稍微弱一点。本模型最大的兼顾了模型的写实能力和风格变化,同时尽可能的体现更多的细节,基本上达到与 Flux.1-Dev 模型相当的创作力和表现力。模型的混调方法是将 Wan2.2-T2V-14B 模型的 High-Noise 和 Low-Noise 两部分分层进行不同权重比例的混合,再简单微调而成。目前是一个实验模型,可能还存在一些不足之处,欢迎大家试用并反馈信息,以便在未来版本改进。 This model is an experimental model. A mixed and finetuned version of the Wan2.2-T2V-14B text-to-video model, Let many enthusiasts of the Wan 2.2 model to easily use the Wan2.2 T2V model to generate various images, similar to use the Flux model. The Wan 2.2 model excels at generating realistic images while also accommodating various styles. However, since it evolved from a video model, its generative capabilities for raw images are slightly weaker. This model balances the realistic capabilities and style variations while striving to include more details, essentially achieving creativity and expressiveness comparable to the Flux.1-Dev model. The mixing method used for this model involves layering the High-Noise and Low-Noise parts of the Wan2.2-T2V-14B model and blending them with different weight ratios, followed by simple fine-tuning. Currently, it is an experimental model that may still have some shortcomings, and we welcome everyone to try it out and provide feedback for improvements in future versions. Also on: https://civitai.com/models/1927692, https://www.modelscope.cn/models/wikeeyang/Magic-Wan-Image GGUF Version: Pls refer to https://huggingface.co/befox/Magic-Wan-Image-v1.0-GGUF 请参见例图工作流,ComfyUIexampleworkflow00001.png, ComfyUIexampleworkflow00002.png, ComfyUIexampleworkflow00003.png sample / schduler: deis/simple 或 euler/beta,或任意组合,您可以自己尝试。

NaNK
license:apache-2.0
741
17

Qwen-Image-Pruning-for-ComfyUI

license:apache-2.0
493
9

Emu35-Image-NF4

=================================================================================== 本模型为:https://huggingface.co/BAAI/Emu3.5-Image 的 NF4 量化版本,可用官方 inference 代码直接加载,需加装 bitsandbytes 依赖。 模型全部加载到显卡的情况下,需占用 24GB,跑图最大需要 32GB 显存。(根据本人测试情况,安装 flashattn==2.7.4 预编译 whl 也行) Prompt: "Live shot, close-up, full-body photo, a snow leopard standing on a rock, the body is standing sideways, standing slightly upward on the rock, the tail is slightly cocked, the head is twisted to face the camera, the eyes are looking directly at the camera, the expression is majestic, the background is slightly blurred in the distance, gray rocks and mountains." Emu3.5-Image 是由国内 BAAI 智源研究院 开源的最新全模态模型,效果比肩 Google-Banana,以下介绍内容引用自官方模型介绍页。本模型为社区研究学习使用,请遵守官方相关版权协议。 =================================================================================== Emu3.5: Native Multimodal Models are World Learners | 🔹 | Core Concept | Description | | :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | | 🧠 | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. | | 🧩 | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved vision–language sequences. | | 📚 | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. | | 🔄 | Native Multimodal I/O | Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads. | | 🎯 | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. | | ⚡ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss. | | 🖼️ | Versatile Generation | Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation. | | 🌐 | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. | | 🏆 | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. | 1. Model & Weights 2. Quick Start 3. Schedule 4. Citation | Model name | HF Weight | | ------------------------ | --------- | | Emu3.5 | 🤗 HF link | | Emu3.5-Image | 🤗 HF link | | Emu3.5-VisionTokenizer | 🤗 HF link | - Paths: `modelpath`, `vqpath` - Task template: `tasktype in {t2i, x2i, howto, story, explore, vla}`, `useimage` controls ` ` usage (set to true when reference images are provided) - Sampling: `samplingparams` (classifierfreeguidance, temperature, topk/topp, etc.) Protobuf outputs are written to `outputs/ /proto/`. For better throughput, we recommend ≥2 GPUs. - [x] Inference Code - [ ] Advanced Image Decoder - [ ] Discrete Diffusion Adaptation(DiDA)

license:apache-2.0
472
10

Flux.1-Dedistilled-Mix-Tuned-fp8

387
34

Flux1-DedistilledMixTuned-V2

254
6

Flux1-Dev-DedistilledMixTuned-V3

170
6

Real-Qwen-Image-V2

license:apache-2.0
116
2

Hunyuan Image 30 Qint4

================================================================================== 本模型为 https://huggingface.co/tencent/HunyuanImage-3.0 模型的 qint4 量化版本,采用 https://github.com/huggingface/optimum-quanto 技术量化,采用非官方技术保存的权重文件。 本量化模型目前在 H20 96GB 单卡上通过测试。模型加载方式,采用非官方代码,详见 loadquantizedmodel.py 代码,目前里面包含两种加载方式,供大家参考,欢迎大家相互交流、共同研究学习,谢谢! 加载方式一:模型初始化加载需要 CPU 大约 160GB 左右,GPU 初始占用 50GB;推理开始后 CPU 占用降至 70GB 左右,GPU 占用约 55-60 GB。模型加载时会出现模型键值的警告信息,但不影响使用。 加载方式二:模型初始化加载需要 CPU 大约 75GB,GPU 初始占用 50GB;推理开始后 CPU 保持 75GB 占用, GPU 占用约 55-60GB。模型加载时,由于提供了键值 map , 所以不会出现任何警告信息。 ================================================================================== HunyuanImage-3.0 是一个非常出色的全模态混合专家模型!以下介绍内容引用自官方原模型介绍页。本项目提供的模型和代码仅用于社区分享和技术研究学习使用,请遵守腾讯混元官方的 License 相关规定。 ================================================================================== 🎨 HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation 👏 Join our WeChat and Discord | 💻 Official website(官网) Try our model! &nbsp&nbsp 🔥🔥🔥 News - September 28, 2025: 📖 HunyuanImage-3.0 Technical Report Released - Comprehensive technical documentation now available - September 28, 2025: 🚀 HunyuanImage-3.0 Open Source Release - Inference code and model weights publicly available If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know. - HunyuanImage-3.0 (Image Generation Model) - [x] Inference - [x] HunyuanImage-3.0 Checkpoints - [ ] HunyuanImage-3.0-Instruct Checkpoints (with reasoning) - [ ] VLLM Support - [ ] Distilled Checkpoints - [ ] Image-to-Image Generation - [ ] Multi-turn Interaction 🗂️ Contents - 🔥🔥🔥 News - 🧩 Community Contributions - 📑 Open-source Plan - 📖 Introduction - ✨ Key Features - 🛠️ Dependencies and Installation - 💻 System Requirements - 📦 Environment Setup - 📥 Install Dependencies - Performance Optimizations - 🚀 Usage - 🔥 Quick Start with Transformers - 🏠 Local Installation & Usage - 🎨 Interactive Gradio Demo - 🧱 Models Cards - 📝 Prompt Guide - Manually Writing Prompts - System Prompt For Automatic Rewriting the Prompt - Advanced Tips - More Cases - 📊 Evaluation - 📚 Citation - 🙏 Acknowledgements - 🌟🚀 Github Star History HunyuanImage-3.0 is a groundbreaking native multimodal model that unifies multimodal understanding and generation within an autoregressive framework. Our text-to-image module achieves performance comparable to or surpassing leading closed-source models. 🧠 Unified Multimodal Architecture: Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation. 🏆 The Largest Image Generation MoE Model: This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance. 🎨 Superior Image Generation Performance: Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details. 💭 Intelligent World-Knowledge Reasoning: The unified multimodal architecture endows HunyuanImage-3.0 with powerful reasoning capabilities. It leverages its extensive world knowledge to intelligently interpret user intent, automatically elaborating on sparse prompts with contextually appropriate details to produce superior, more complete visual outputs. If you find HunyuanImage-3.0 useful in your research, please cite our work: We extend our heartfelt gratitude to the following open-source projects and communities for their invaluable contributions: 🤗 Transformers - State-of-the-art NLP library 🎨 Diffusers - Diffusion models library 🌐 HuggingFace - AI model hub and community ⚡ FlashAttention - Memory-efficient attention 🚀 FlashInfer - Optimized inference engine [](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0) [](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0) [](https://www.star-history.com/#Tencent-Hunyuan/HunyuanImage-3.0&Date)

NaNK
100
9

Flux1-Dev-DedistilledMixTuned-V3-PAP

78
3

GNER-T5-xxl-encoder-only

NaNK
license:apache-2.0
73
8

Emu35-NF4

=================================================================================== 本模型为:https://huggingface.co/BAAI/Emu3.5 的 NF4 量化版本,可用官方 inference 代码直接加载,需加装 bitsandbytes 依赖。 模型全部加载到显卡的情况下,需占用 24GB,跑图最大需要 32GB 显存。(根据本人测试情况,安装 flashattn==2.7.4 预编译 whl 也行) Prompt: "Live shot, close-up, full-body photo, a snow leopard standing on a rock, the body is standing sideways, standing slightly upward on the rock, the tail is slightly cocked, the head is twisted to face the camera, the eyes are looking directly at the camera, the expression is majestic, the background is slightly blurred in the distance, gray rocks and mountains." Emu3.5-Image 是由国内 BAAI 智源研究院 开源的最新全模态模型,效果比肩 Google-Banana,以下介绍内容引用自官方模型介绍页。本模型为社区研究学习使用,请遵守官方相关版权协议。 =================================================================================== Emu3.5: Native Multimodal Models are World Learners | 🔹 | Core Concept | Description | | :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | | 🧠 | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. | | 🧩 | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved vision–language sequences. | | 📚 | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. | | 🔄 | Native Multimodal I/O | Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads. | | 🎯 | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. | | ⚡ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss. | | 🖼️ | Versatile Generation | Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation. | | 🌐 | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. | | 🏆 | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. | 1. Model & Weights 2. Quick Start 3. Schedule 4. Citation | Model name | HF Weight | | ------------------------ | --------- | | Emu3.5 | 🤗 HF link | | Emu3.5-Image | 🤗 HF link | | Emu3.5-VisionTokenizer | 🤗 HF link | - Paths: `modelpath`, `vqpath` - Task template: `tasktype in {t2i, x2i, howto, story, explore, vla}`, `useimage` controls ` ` usage (set to true when reference images are provided) - Sampling: `samplingparams` (classifierfreeguidance, temperature, topk/topp, etc.) Protobuf outputs are written to `outputs/ /proto/`. For better throughput, we recommend ≥2 GPUs. - [x] Inference Code - [ ] Advanced Image Decoder - [ ] Discrete Diffusion Adaptation(DiDA)

license:apache-2.0
63
5

UniWorld-V1-NF4

本模型是官方 https://huggingface.co/LanguageBind/UniWorld-V1 模型的 BnB 4bit 预量化版,大大减少模型的下载、存放和显存占用空间。 This is the https://huggingface.co/LanguageBind/UniWorld-V1 BnB 4bit quantization version. 如何使用:该 Repo 与官方原始 FP32 模型加载方式一致,请确认 Python 环境已安装 bitsandbytes 依赖包。 How to load: The loading method is the same as the official original FP32 model, and you need to confirm that the bitsandbytes dependency package is installed at first. UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation [](https://github.com/user-attachments/files/20573816/report.pdf) [](https://huggingface.co/LanguageBind/UniWorld-V1) [](https://huggingface.co/datasets/LanguageBind/UniWorld-V1) [](https://github.com/PKU-YuanGroup/UniWorld/blob/main/LICENSE) [](https://x.com/LinBin46984/status/1929905024349679682) [](http://8.130.165.159:8800/) [](http://8.130.165.159:8801/) [](http://8.130.165.159:8802/) [](http://8.130.165.159:8803/) [](http://8.130.165.159:8804/) [](http://8.130.165.159:8805/) [](http://8.130.165.159:8806/) [](http://8.130.165.159:8807/) [](https://github.com/PKU-YuanGroup/UniWorld-V1/stargazers)  [](https://github.com/PKU-YuanGroup/UniWorld-V1/network)  [](https://github.com/PKU-YuanGroup/UniWorld-V1/watchers)  [](https://github.com/PKU-YuanGroup/UniWorld-V1/archive/refs/heads/main.zip) [](https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors) [](https://github.com/PKU-YuanGroup/UniWorld-V1/commits/main/) [](https://github.com/PKU-YuanGroup/UniWorld-V1/pulls) [](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aopen+is%3Aissue) [](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aissue+is%3Aclosed) [2025.06.03] 🤗 We release UniWorld, a unified framework for understanding, generation, and editing. All data, models, training code, and evaluation code are open-sourced. Checking our report for more details. Welcome to watch 👀 this repository for the latest updates. UniWorld, trained on only 2.7M samples, consistently outperforms BAGEL (trained on 2665M samples) on the ImgEdit-Bench for image manipulation. It also surpasses the specialized image editing model Step1X-Edit across multiple dimensions, including add, adjust, and extract on ImgEdit-Bench. 1. All Resources Fully Open-Sourced - We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures. - We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on. - We annotate 286K long-caption samples using Qwen2-VL-72B. We use GPT-4o to filter ImgEdit, result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found here. 2. Contrastive Semantic Encoders as Reference Control Signals - Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images. - For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions. 3. Image Priors via VLM Encoding Without Learnable Tokens - We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format ` ` is particularly important. Highly recommend trying out our web demo by the following command. 1. Clone this repository and navigate to UniWorld folder Download the data from LanguageBind/UniWorld-V1. The dataset consists of two parts: source images and annotation JSON files. 2. The second column is the corresponding annotation JSON file. 3. The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others. We provide a simple online verification tool to check whether your paths are set in `data.txt` correctly. - BLIP3o-60k: We add text-to-image instructions to half of the data. [108 GB storage usage.] - OSP1024-286k: Sourced from internal data of the Open-Sora Plan, with captions generated using Qwen2-VL-72B. Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.] - imgedit-724k: Data is filtered using GPT-4o, retaining approximately half. [2.1T storage usage.] - OmniEdit-368k: For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.] - SEED-Data-Edit-Part1-Openimages-65k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.] - SEED-Data-Edit-Part2-3-12k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.] - PromptfixData-18k: For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.] - StyleBooth-11k: For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.] - Ghibli-36k: For transfer style data, images have a short side ≥ 1024 pixels. Warning: This data has not been quality filtered. [170 GB storage usage.] - vitonhd-23k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.] - deepfashion-27k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.] - shopproduct-23k: Sourced from internal data of the Open-Sora Plan, focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.] - coco2017captioncanny-236k: img->canny & canny->img [25 GB storage usage.] - coco2017captiondepth-236k: img->depth & depth->img [8 GB storage usage.] - coco2017captionhed-236k: img->hed & hed->img [13 GB storage usage.] - coco2017captionmlsd-236k: img->mlsd & mlsd->img [ GB storage usage.] - coco2017captionnormal-236k: img->normal & normal->img [10 GB storage usage.] - coco2017captionopenpose-62k: img->pose & pose->img [2 GB storage usage.] - coco2017captionsketch-236k: img->sketch & sketch->img [15 GB storage usage.] - unsplashcanny-20k: img->canny & canny->img [2 GB storage usage.] - openpose-40k: img->pose & pose->img [4 GB storage usage.] - mscoco-controlnet-canny-less-colors-236k: img->canny & canny->img [13 GB storage usage.] - coco2017segbox-448k: img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.] - vitonhd-11k: img->pose [1 GB storage usage.] - deepfashion-13k: img->pose [1 GB storage usage.] Prepare pretrained weights Download black-forest-labs/FLUX.1-dev to `$FLUXPATH`. Download Qwen/Qwen2.5-VL-7B-Instruct to `$QWENVLPATH`. We also support other sizes of Qwen2.5-VL. Download flux-redux-siglipv2-512.bin and set its path to `pretrainedsiglipmlppath` in `stage2.yaml`. The weight is sourced from ostris/Flex.1-alpha-Redux, we just re-organize the weight. 💡 How to Contribute We greatly appreciate your contributions to the UniWorld open-source community and helping us make it even better than it is now! For more details, please refer to the Contribution Guidelines. 👍 Acknowledgement and Related Work ImgEdit: ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs. Open-Sora Plan: An open‑source text-to-image/video foundation model, which provides a lot of caption data. SEED-Data-Edit: A hybrid dataset for instruction-guided image editing. Qwen2.5-VL: The new flagship vision-language model of Qwen. FLUX.1-Redux-dev: Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image. SigLIP 2: New multilingual vision-language encoders. Step1X-Edit: A state-of-the-art image editing model. BLIP3-o: A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models. BAGEL: An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. 🔒 License See LICENSE for details. The FLUX weights fall under the [FLUX.1 [dev] Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md). [](https://www.star-history.com/#PKU-YuanGroup/UniWorld&Date) This model is presented in the paper: UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

NaNK
license:mit
62
6

Flux1-Dev-DedistilledMixTuned-V3-Krea

V3.0-Krea Version(also https://civitai.com/models/941929): Flux.1-Dev-Krea 模型改善了 Dev 模型的艺术风格与写实摄影的能力,但人像的清晰度和美学方面有所减弱,特别是与原 Dev 模型训练的 Lora 兼容性很差。本 V3.0-Krea 保留了 Krea 模型的主要长处,改善了图像清晰度及与原 Dev 模型 Lora 的兼容性,但 Lora 兼容性方面改善不多,不太理想,这是该版本比较遗憾的地方,请大家慎重下载。 Flux.1-Dev-Krea has improved the artistic style and realistic photography ability of the Dev version model, but the clarity and aesthetics of portraits have weakened, especially with poor compatibility with Lora trained on the original Dev model. The V3.0-Krea retains the main features of the Krea model, improves image clarity, and enhances compatibility with the Lora, but the Lora compatibility is improved minimal and not ideal, which is a disappointing aspect of this version. Please download cautiously. 推荐使用 GNER-T5-XXL 代替 T5-XXL,获得更好的提示词理解能力。Recommended to use GNER-T5-XXL instead of T5-XXL for better prompt understanding capabilities. Basic: deis+simple / euler+beta, You can try more different combinations.

35
6

Ming-Lite-Omni-v1.5-NF4

=================================================================================== 本模型为 https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5 模型的 NF4 量化版本,可用官方代码直接加载使用,从而能使该项目在 24GB 消费级显卡内进行推理测试和学习研究。 这里只是主模型部分的量化权重,其它部分如:connector / mlp / talker / transformer / vae 等,请从官方 repo 下载,其中 transformer 也可以按需量化。 This model is the NF4 quantized version of the model: https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5, it can be directly loaded using the official demo code, let the project can be perform inference testing and learning research on a 24GB consumer GPU. Here is only the quantized weights of the main model part. Other parts such as: connector / mlp / talker / transformer / vae, please download from the official repo, and the transformer part can also be quantized as needed. =================================================================================== 📑 Technical Report |📖 Project Page |🤗 Hugging Face | 🤖 ModelScope Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of Ming-lite-omni. It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon Ling-lite-1.5, Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models. [2025.07.15] 🔥 We release Ming-lite-omni v1.5 with significant improvements across all modalities. [2025.06.12] 🔥 Our Technical Report is in public on arxiv. [2025.05.28] 🔥 The official version of Ming-lite-omni v1 is released, with better performance and image generation support. [2025.05.04] 🔥 We release the test version of Ming-lite-omni:Ming-lite-omni-Preview. Key Features Compared to Ming-lite-omni, Ming-lite-omni v1.5 features key optimizations in the following 3 areas: - Enhanced Video Understanding—MRoPE & Curriculum Learning: Ming-lite-omni v1.5 significantly improves video understanding through MRoPE's 3D spatiotemporal encoding and a curriculum learning strategy for handling long videos, enabling precise comprehension of complex visual sequences. - Refined Multi-modal Generation-Consistency & Perception Control: Ming-lite-omni v1.5 offers superior generation, featuring dual-branch image generation with ID & Scene Consistency Loss for coherent editing, and perception enhancement for detailed visual control. Its new audio decoder and BPE encoding also deliver high-quality, real-time speech synthesis. - Comprehensive Data Upgrades-Broadened & Refined fine-grained Data: Ming-lite-omni v1.5's capabilities are built on extensive data upgrades, including new structured text data, expanded high-quality product information, and refined fine-grained visual and speech perception data (including dialects). This provides a richer, more accurate foundation for all modalities. Evaluation In various modality benchmark tests, Ming-lite-omni v1.5 demonstrates highly competitive results compared to industry-leading models of similar scale. Image-text Understanding Ming-lite-omni v1.5 shows significant improvements in general image-text understanding, visual object localization, and universal object recognition capabilities, providing a more powerful base model for a wide range of visual applications. | Task Type | Dataset | Qwen2.5-VL-7B | Ming-lite-omni v1.5 | |------------------|---------------------------------------------------------------------------------------------------|---------------|----------------| | OpenCompass | AI2D | 84.36 | 84.91 | | | HallusionBench | 55.77 | 54.59 | | | MMBenchTESTV11 | 82.75 | 80.73 | | | MMMU | 56.56 | 54.33 | | | MMStar | 65.27 | 65.07 | | | MMVet | 71.61 | 73.99 | | | MathVista | 68.10 | 72.00 | | | OCRBench | 87.80 | 88.90 | | | average | 71.5 | 71.8 | | Localization | RefCOCOval/testA/testB | 90.00/92.5.85.4 | 91.40/93.2/87.1 | | | RefCOCO+val/testA/testB | 84.20/89.1/76.9 | 86.30/90.5/79.2 | | | RefCoCogval/test | 87.2/87.2 | 87.1/87.6 | | Recognition | General Recognition | 92.42 | 92.53 | | | Vertical domains for natural encyclopedias (animals, plants, ingredients, vehicles, dishes, etc.) | 47.79 | 54.27 | Document Understanding Ming-lite-omni v1.5 generally performs on par with Qwen2.5-VL-7B in complex document understanding tasks. Notably, it achieves SOTA results among models under 10B parameters on OCRBench, which focuses on text-visual understanding, and on ChartQA, which requires in-depth chart visual analysis and logical reasoning. | Task Type | dataset | Qwen2.5-VL-7B | Ming-lite-omni v1 | Ming-lite-omni v1.5 | |:---------------------------------------------------|:--------------------| :------------ | :----------- | :------------- | | OCR Understanding | ChartQAtest | 87.24 | 85.1 | 88.84 | | | DocVQAtest | 95.57 | 93.0 | 93.68 | | | TextVQAval | 85.06 | 82.8 | 82.27 | | | OCRBench | 87.8 | 88.4 | 88.90 | | | Average | 88.91 | 87.32 | 88.42 | | Document Analysis | OmniDocBench↓ en/zh | 30.8/39.8 | 34 /34.4 | 34.9/34.9 | | OCR Comprehensive Capability | OCRBenchV2 en/zh | 56.3/57.2 | 53.3/52 | 52.1/55.2 | Video Understanding Ming-lite-omni v1.5 achieves a leading position among models of its size in video understanding tasks. | Benchmark | Qwen2.5-VL-7B | Qwen2.5-Omni-7B | InternVL3-8B | Ming-lite-omni v1.5 | |:-----------------------| :------------: | :--------------: | :----------: | :----------------: | | VideoMME(w/o subs) | 65.10 | 64.30 | 66.30 | 67.07 | | VideoMME(w/ subs) | 71.60 | 72.40 | 68.90 | 72.59 | | VideoMME(avg) | 68.35 | 68.35 | 67.60 | 69.83 | | MVBench | 69.60 | 70.30 | 75.40 | 69.43 | | LongVideoBench | 56.00 | 54.82 | 58.80 | 59.54 | | OvOBench | 51.10 | 50.46 | 51.91 | 52.17 | Ming-lite-omni v1.5 further improves upon Ming-lite-omni in speech understanding. It supports English, Mandarin, Cantonese, Sichuanese, Shanghainese, Minnan, and other dialects, maintaining an industry-leading position in open-source English and Mandarin ASR (Automatic Speech Recognition) and Audio QA (Question Answering) tasks. | Model | Average on All/Open-source Benchmarks(↓) | aishell1 | aishell2testandroid | aishell2testios | cv15zh | fleurszh | wenetspeechtestmeeting | wenetspeechtestnet | librispeechtestclean | librispeechtestother | multilinguallibrispeech | cv15en | fleursen | voxpopuliv1.0en | speechioleaderboard | dialecthunan | dialectminnan | dialectguangyue | dialectchuanyu | dialectshanghai | noisyjrgj | zxbchat | zxbgovern | zxbhealth | zxbknowledge | zxblocallive | |:-------------------|:-----------------------------------------| :------- | :-------------------- | :---------------- | :------ | :-------- | :---------------------- | :------------------ | :--------------------- | :--------------------- | :----------------------- | :------ | :-------- | :---------------- | :------------------- | :------------ | :------------- | :--------------- | :-------------- | :--------------- | :--------- | :------- | :--------- | :--------- | :------------ | :------------- | | Ming-lite-omni-1.5 | 4.67(+0.15)/3.83(+0.05) | 1.3 | 2.47 | 2.46 | 5.66 | 2.87 | 6.19 | 5.24 | 1.25 | 2.61 | 4.14 | 6.95 | 3.28 | 6.43 | 2.81 | 6.96 | 12.74 | 3.7 | 3.8 | 9.95 | 10.9 | 2.6 | 1.77 | 2.97 | 3.41 | 1.88 | | Ming-lite-omni | 4.82/3.88 | 1.47 | 2.55 | 2.52 | 6.31 | 2.96 | 5.95 | 5.46 | 1.44 | 2.80 | 4.15 | 6.89 | 3.39 | 5.80 | 2.65 | 7.88 | 13.84 | 4.36 | 4.33 | 10.49 | 11.62 | 2.34 | 1.77 | 3.31 | 3.69 | 2.44 | | Qwen2.5-Omni | 8.81/ 4.37 | 1.18 | 2.75 | 2.63 | 5.2 | 3.0 | 5.9 | 7.7 | 1.8 | 3.4 | 7.56 | 7.6 | 4.1 | 5.8 | 2.54 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | 11.11 | 3.68 | 2.23 | 4.02 | 3.17 | 2.03 | | Qwen2-Audio | 12.34 / 5.41 | 1.53 | 2.92 | 2.92 | 6.9 | 7.5 | 7.16 | 8.42 | 1.6 | 3.6 | 5.40 | 8.6 | 6.90 | 6.84 | - | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | - | 4.29 | 2.70 | 4.18 | 3.33 | 2.34 | | Kimi-Audio | 12.75/4.42 | 0.60 | 2.64 | 2.56 | 7.21 | 2.69 | 6.28 | 5.37 | 1.28 | 2.42 | 5.88 | 10.31 | 4.44 | 7.97 | 2.23 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 | 24.40 | 2.96 | 2.03 | 2.38 | 1.98 | 2.05 | | Model | Average(Open-ended QA) | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench | |:--------------------------|:-----------------------| :--------- | :--------- | :---- | :---- | :--------- | :----- | :-------- | | Ming-lite-omni v1.5[omni] | 4.474(+0.134) | 4.648 | 4.3 | 61.16 | 45.77 | 65.934 | 55.599 | 98.076 | | Ming-lite-omni[omni] | 4.34 | 4.63 | 4.06 | 58.84 | 47.53 | 61.98 | 58.36 | 99.04 | | MiniCPM-o [omni] | 4.285 | 4.42 | 4.15 | 50.72 | 54.78 | 78.02 | 49.25 | 97.69 | | Kimi-Audio [audio] | 4.215 | 4.46 | 3.97 | 63.12 | 62.17 | 83.52 | 61.10 | 100.00 | | Qwen2.5-Omni [omni] | 4.21 | 4.49 | 3.93 | 55.71 | 61.32 | 81.10 | 52.87 | 99.42 | | GLM-4-Voice [audio] | 3.77 | 4.06 | 3.48 | 43.31 | 40.11 | 52.97 | 24.91 | 88.08 | | Qwen2-Audio-chat [audio] | 3.545 | 3.69 | 3.40 | 35.35 | 35.43 | 49.01 | 22.57 | 98.85 | | Step-Audio-chat [audio] | 3.49 | 3.99 | 2.99 | 46.84 | 31.87 | 29.19 | 65.77 | 86.73 | Speech Generation Ming-lite-omni v1.5 shows significant improvement over Ming-lite-omni in English and Mandarin voice cloning tasks. | Model | seed-tts-eval-zhwer | seed-tts-eval-zhsim | seed-tts-eval-enwer | seed-tts-eval-ensim | |:---------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:| | Seed-TTS | 1.11 | 0.796 | 2.24 | 0.762 | | MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 | | E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 | | F5-TTS | 1.56 | 0.741 | 1.83 | 0.647 | | CosyVoice 2 | 1.45 | 0.748 | 2.57 | 0.652 | | Qwen2.5-Omni-7B | 1.70 | 0.752 | 2.72 | 0.632 | | Ming-lite-omni | 1.69 | 0.68 | 4.31 | 0.509 | | Ming-lite-omni v1.5 | 1.93 | 0.68 | 3.75 | 0.54 | Image Generation Ming-lite-omni v1.5 demonstrates significant advantages in maintaining scene and person ID consistency during human image editing. It also expands its support for perception tasks such as generative segmentation, depth prediction, object detection, and edge contour generation. | Gen-eval | 1-Obj | 2-Obj | Counting | Colors | Position | Color Attr | Avg. | |---------------------| :---: | :---: | :---: | :---: |:---: |:---: |:---: | | Ming-lite-omni | 0.99 | 0.77 | 0.68 | 0.78 | 0.46 |0.42 |0.64 | | Ming-lite-omni v1.5 | 0.99 | 0.93 | 0.86 | 0.87 |0.90 |0.66 |0.87 | | prompt | ours | Qwen-VLo | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --- | --- | | Make the person in the image smile slightly without altering the original structure | | | | input | Segmentation | Semantic Segmantation | Panoptic Segmentation | | --- | --- | --- | --- | | | prompt: Given the following instructions: little girl, pink, your monitors colors off friend p pink shirt girl; please perform referring segmentation on this image. | prompt: Please segment different classes in this image | prompt: Please segment different instances in this image. | | Input | Depth Map | Detection Box | Contour | |---|--------------------------------------------------------------------------------------------------------------------------------------| --- | --- | | | | | | You can download our latest model from both Huggingface and ModelScope. For previous version model like Ming-Lite-Omni v1, Please refer to this link. | Model | Input modality | Oput modality | Download | |:-------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ming-Lite-Omni-1.5 | Image,text,video,audio | Image,text,audio | 🤗 HuggingFace 🤖 ModelScope | If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope . Note: This download process will take several minutes to several hours, depending on your network conditions. Additional demonstration cases are available on our project page. You can also initialize the environment by building the docker image. First clone this repository: Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while: At last, start the container with the current repo directory mounted: You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming/`) or mount the downloaded model path when starting the container. Step 2 - Download the model weights and create a soft link to the source code directory Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model. We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb. Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni-1.5 in bfloat16 takes about 42G GPU memory. We provide a graphical user interface based on Gradio to facilitate the use of Ming-lite-omni. This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory. If you find our work helpful, feel free to give us a cite.

NaNK
license:mit
12
3

Nexus-GenV2-nf4-fp8

NaNK
license:apache-2.0
4
4

byt5-xxl-enc-nf4

本模型是官方 https://huggingface.co/google/byt5-xxl 模型的 textencoder 部分提取及 BnB 4bit 量化版,大大减少模型的空间,方便下游作为 encoder 使用。 比如:作为 https://huggingface.co/OPPOer/MultilingualFLUX.1-adapter 项目的 textencoder 使用。 This model extract from the https://huggingface.co/google/byt5-xxl only textencoder part,and BnB 4bit quantization version, easy for encoder use. For example: use for project https://huggingface.co/OPPOer/MultilingualFLUX.1-adapter as textencoder. ================================================================================ ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-xxl` significantly outperforms mt5-xxl on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

NaNK
license:apache-2.0
2
1