showlab
ShowUI-2B
magvitv2
OmniConsistency
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data [[Official Code]](https://github.com/showlab/OmniConsistency) [[Paper]](https://huggingface.co/papers/2505.18445) [[Dataset]](https://huggingface.co/datasets/showlab/OmniConsistency) We recommend using Python 3.10 and PyTorch with CUDA support. To set up the environment: You can download the OmniConsistency model and trained LoRAs directly from Hugging Face. Or download using Python script: Usage Here's a basic example of using OmniConsistency: Datasets Our datasets have been uploaded to the Hugging Face. and is available for direct use via the datasets library. You can easily load any of the 22 style subsets like this:
Show O2 7B
[//]: # ( Show-o2: Improved Unified Multimodal Models ) Jinheng Xie 1 Zhenheng Yang 2 Mike Zheng Shou 1 1 Show Lab, National University of Singapore 2 Bytedance [](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/tree/main/show-o2) [](https://github.com/showlab/Show-o/blob/main/docs/wechatqa3.jpg) This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL . What is the new about Show-o2? We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation. Pre-trained Model Weigths The Show-o2 checkpoints can be found on Hugging Face: showlab/show-o2-1.5B showlab/show-o2-1.5B-HQ showlab/show-o2-7B showlab/show-o2-1.5B (further unified fine-tuning on video understanding data) showlab/show-o2-7B (further unified fine-tuning on video understanding data) Login your wandb account on your machine or server. Download Wan2.1 3D causal VAE model weight here and put it on the current directory. Demo for Multimodal Understanding and you can find the results on wandb. Demo for Text-to-Image Generation and you can find the results on wandb. Citation To cite the paper and model, please use the below: Acknowledgments This work is heavily based on Show-o.
show-o
show-1-base
show-o2-1.5B
[//]: # ( Show-o2: Improved Unified Multimodal Models ) Jinheng Xie 1 Zhenheng Yang 2 Mike Zheng Shou 1 1 Show Lab, National University of Singapore 2 Bytedance [](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/tree/main/show-o2) [](https://github.com/showlab/Show-o/blob/main/docs/wechatqa3.jpg) This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL . What is the new about Show-o2? We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation. Pre-trained Model Weigths The Show-o2 checkpoints can be found on Hugging Face: showlab/show-o2-1.5B showlab/show-o2-1.5B-HQ showlab/show-o2-7B showlab/show-o2-1.5B (further unified fine-tuning on video understanding data) showlab/show-o2-7B (further unified fine-tuning on video understanding data) Login your wandb account on your machine or server. Download Wan2.1 3D causal VAE model weight here and put it on the current directory. Demo for Multimodal Understanding and you can find the results on wandb. Demo for Text-to-Image Generation and you can find the results on wandb. Citation To cite the paper and model, please use the below: Acknowledgments This work is heavily based on Show-o.
show-1-interpolation
Show O 512x512
show-o2-7B-w-video-und
show-1-sr1
show-1-sr2
show-o2-1.5B-HQ
show-o-w-clip-vit-512x512
show-o-w-clip-vit
show-o2-1.5B-w-video-und
[//]: # ( Show-o2: Improved Unified Multimodal Models ) Jinheng Xie 1 Zhenheng Yang 2 Mike Zheng Shou 1 1 Show Lab, National University of Singapore 2 Bytedance [](https://arxiv.org/abs/2506.15564) [](https://github.com/showlab/Show-o/tree/main/show-o2) [](https://github.com/showlab/Show-o/blob/main/docs/wechatqa3.jpg) This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL . What is the new about Show-o2? We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation. Pre-trained Model Weigths The Show-o2 checkpoints can be found on Hugging Face: showlab/show-o2-1.5B showlab/show-o2-1.5B-HQ showlab/show-o2-7B showlab/show-o2-1.5B (further unified fine-tuning on video understanding data) showlab/show-o2-7B (further unified fine-tuning on video understanding data) Login your wandb account on your machine or server. Download Wan2.1 3D causal VAE model weight here and put it on the current directory. Demo for Multimodal Understanding and you can find the results on wandb. Demo for Text-to-Image Generation and you can find the results on wandb. Citation To cite the paper and model, please use the below: Acknowledgments This work is heavily based on Show-o.