Alpha-VLLM
Lumina-DiMOO
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding [📑 Technical Report]   [💜 Project Page (Demo & Benchmark)]   [🌐 Code ] ¹Shanghai Innovation Institute, ²Shanghai AI Laboratory, ³Shanghai Jiao Tong University ⁶The Chinese University of Hong Kong, ⁷Tsinghua University 📚 Introduction We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations: - Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. - Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding. - Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x. - Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field. 📽️ Qualitative Results Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page. Controllable & Subject-Driven Generation Comparison - Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation. - Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128. 💬 Discussion You can reach us with this WeChat QR code! 📜 Acknowledgements This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huawei‘s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.
Lumina-Image-2.0
Lumina-Image-2.0 is a 2 billion parameter flow-based diffusion transformer capable of generating images from text descriptions. For more information, visit our GitHub. We provide an official Gradio demo. You can use the link we provided to try it out. This is a Hugging Face Diffusers implementation of the paper Lumina-Image 2.0: A Unified and Efficient Image Generative Framework. If you find the provided code or models useful for your research, consider citing them as:
Lumina-mGPT-7B-768
Lumina-mGPT-7B-768-Omni
Lumina-Next-SFT-diffusers
Lumina-mGPT-7B-512
Lumina-mGPT-2.0
Chameleon_7B_mGPT
Lumina-mGPT-7B-1024
Lumina-Next-T2I
Lumina-mGPT-2.0-Omni
📚 Introduction Lumina-mGPT 2.0 is a stand-alone, decoder-only autoregressive model, trained from scratch, that unifies a broad spectrum of image generation tasks, including text-to-image generation, image pair generation, subject-driven generation, multi-turn image editing, controllable generation, and dense prediction. 🚀 Usage We provide the implementation of Lumina-mGPT 2.0, as well as sampling code, visit our GitHub. If you find the provided code or models useful for your research, consider citing them as: