bytedance-research

26 models • 7 total models in database

Sort by:

ChatTS-14B

UMO

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward 📖 Introduction Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Please note that UNO gets unstable results on parts of OmniContext due to the different prompt format with its training data (UNO-1M), leading to similar issue with UMO based on it. To get better results with these two models, we recommend using description prompt instead of instruction one, using resolution 768~1024 instead of 512. We open-source this project for academic research. The vast majority of images used in this project are either generated or licensed. If you have any concerns, please contact us, and we will promptly remove any inappropriate content. Our code is released under the Apache 2.0 License. This research aims to advance the field of generative AI. Users are free to create images using this tool, provided they comply with local laws and exercise responsible usage. The developers are not liable for any misuse of the tool by users. Citation If UMO is helpful, please help to ⭐ the repo. If you find this project useful for your research, please consider citing our paper:

NaNK

license:apache-2.0

694

ATI

ATI: Any Trajectory Instruction for Controllable Video Generation [](https://arxiv.org/pdf/2505.22944)  [](https://anytraj.github.io/)  > ATI: Any Trajectory Instruction for Controllable Video Generation > Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, Chongyang Ma, > Intelligent Creation Team, ByteDance This is the repo for Wan2.1 ATI (Any Trajectory Instruction for Controllable Video Generation), a trajectory-based motion control framework that unifies object, local and camera movements in video generation. This repo is based on Wan2.1 offical implementation. Code: https://github.com/bytedance/ATI ATI requires a same environment as offical Wan 2.1. Follow the instruction of INSTALL.md (Wan2.1). First you need to download the 14B original model of Wan2.1. Then download ATI-Wan model from our huggingface repo. Finally, copy VAE, T5 and other misc checkpoint from origin Wan2.1 folder to ATI checkpoint location where `-p` is the path to the config file, `-c` is the path to the checkpoint, `-o` is the path to the output directory, `-g` defines the number of gpus to use (if unspecificed, all avalible GPUs will be used; if `1` is given, will run on single process mode). Once finished, you will expect to fine: - `samples/outputs` for the raw output videos. - `samples/imagestracks` shows the input image togather with the user specified trajectories. - `samples/outputsvis` shows the output videos togather with the user specified trajectories. We provide an interactive tool that allow users to draw and edit trajectories on their images. then open this url localhost:5000 in the browser. Note if you run the editor on the server, you need to replace `localhost` with the server's IP address. 2. Get the interface shown below, then click Choose File to open a local image. a. Free Trajectory: Click and then drag with the mouse directly on the image. b. Circular (Camera Control): - Place a circle on the image, then drag to set its size for frame 0. - Place a few (3–4 recommended) track points on the circle. - Drag the radius control to achieve zoom-in/zoom-out effects. c. Static Point: A point that remains stationary over time. Note: Pay attention to the progress bar in the box to control motion speed. 4. Trajectory Editing: Select a trajectory here, then delete, edit, or copy it. In edit mode, drag the trajectory directly on the image. The selected trajectory is highlighted by color. 5. Camera Pan Control: Enter horizontal (X) or vertical (Y) speed (pixels per frame). Positive X moves right; negative X moves left. Positive Y moves down; negative Y moves up. Click Add to Selected to apply to the current trajectory, or Add to All to apply to all trajectories. The selected points will gain a constant pan motion on top of their existing movement. 6. Important: After editing, click Store Tracks to save. Each image (not each trajectory) must be saved separately after drawing all trajectories. 7. Once all edits are complete, locate the `videosexample` folder in the Trajectory Editor. Citation Please cite our paper if you find our work useful:

NaNK

license:apache-2.0

589

Vidi-7B

NaNK

license:cc-by-nc-4.0

565

ChatTS-8B

NaNK

license:apache-2.0

541

HuMo

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning [](https://arxiv.org/abs/2509.08519)  [](https://phantom-video.github.io/HuMo/)  > HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning > Liyang Chen , Tianxiang Ma , Jiawei Liu, Bingchuan Li &dagger; , Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu § > Equal contribution, &dagger; Project lead, § Corresponding author > Tsinghua University | Intelligent Creation Te...

license:apache-2.0

461

238

LVFace

[ICCV 2025 Highlight] LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition If you like LVFace, please give us a star ⭐ on GitHub for the latest update. --> This is the official PyTorch implementation for the inference of LVFace. Drawing inspiration from the massive data support, multi-stage training paradigm, and Transformer architecture of large-model technology, this method, based on Large Vision Transformer, has carried out progressive optimization of the face clustering space through multiple stages on massive datasets. --> News - 🔥🔥🔥 We have released the training weights of LVFace. Please click here to download it. (August, 2025 UTC) - 🎉🎉🎉 LVFace has been recommended as ICCV Highlight. (July, 2025 UTC) - 🎉🎉🎉 LVFace is accepted by ICCV 2025. (July, 2025 UTC) - 🔥🔥🔥 We have updated the arXiv report of LVFace. Please click here to view it. (March, 2025 UTC) - 🎉🎉🎉 LVFace secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge (academic track). (December, 2024 UTC) Requirements All required dependencies are listed in `requirements.txt`: Install all dependencies with a single command: pip install -r requirements.txt Datasets Test datasets for inference validation can be downloaded from the following sources: - IJB-C & IJB-B: Google Drive - MFR-Ongoing: Challenge Page LVFace Pretrained Models Pretrained model weights for inference are available below in both ONNX and PyTorch (.pt) formats: | Training Data | Model | IJB-C(1e-6) | IJB-C(1e-5) | IJB-C(1e-4) | IJB-B(1e-4) | Download | |---------------|---------------|-------------|-------------|-------------|-------------|---------------| | Glint360K | LVFace-T | 88.53 | 95.63 | 96.67 | 95.41 | HuggingFace | | Glint360K | LVFace-S | 90.06 | 96.52 | 97.31 | 96.14 | HuggingFace | | Glint360K | LVFace-B | 90.06 | 97.00 | 97.70 | 96.51 | HuggingFace | | Glint360K | LVFace-L | 89.51 | 97.02 | 97.66 | 96.51 | HuggingFace | | WebFace42M | LVFace-B | - | - | - | - | come soon | First, clone the repository and navigate to the project directory: Then install all required dependencies using the provided `requirements.txt`: 2. Download Pretrained Models Download the ONNX-format pretrained weights from the LVFace Pretrained Models section, then place them in a directory (e.g., `./LVFacemodel/`). 3. Run Inference Execute the `inferenceonnx.py` script to perform feature extraction and similarity calculation. A complete workflow example: Note: The `LVFaceONNXInferencer` class is defined in `inferenceonnx.py`, which handles ONNX model loading, image preprocessing, feature extraction, and similarity calculation in a unified interface. Ensure the model path and image paths are correctly specified before running. Evaluation Steps 1. Modify the test dataset path (e.g., IJB-C, IJB-B) in the corresponding evaluation script (`evalijbc.py`). 2. Run the evaluation with pretrained model weights using the commands below: License The code of LVFace is released under the MIT License. There is no limitation for both academic and commercial usage. The models downloaded from our repo follow the above license policy (which is for non-commercial research purposes only). Citation If you find this work useful, please cite our paper and give us a star ⭐: Acknowledgments We sincerely thank Professor Jiankang Deng for his valuable guidance and insights throughout the research. We also appreciate the InsightFace for their excellent and research support.

license:mit

402

Valley2.5

license:apache-2.0

273

Valley3

🎮️ Github &nbsp&nbsp | &nbsp&nbsp 🤗 Hugging Face &nbsp&nbsp | &nbsp&nbsp🤖 ModelScope &nbsp&nbsp | &nbsp&nbsp 📑 Home Page &nbsp&nbsp | &nbsp&nbsp 📙 Paper Introduction Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model - Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models. - Demonstrated comparatively outstanding performance in the OpenCompass Benchmark. Release - [2025/11/27] 🔥🔥🔥 We have released the technical report of Valley2.5! Check out the full paper here: Valley2.5 Technical Report. - [2025/10/26] 🔥🔥🔥 Update Valley2.5, significantly enhance multimodal understanding and reasoning capabilities, achieving 74.3 on OpenCompass Multi-modal Academic Leaderboard! - [2025/02/15] 🔥 Update Valley2-DPO, achieve 69.6 on OpenCompass Multi-modal Academic Leaderboard and update AutoModel usage for checkpoints. - [2025/01/13] 🔥 Release TechReport. Valley2: Exploring Multimodal Models with Scalable Vision-Language Design - [2024/12/23] 🔥 Announcing Valley2 (Valley-Eagle-7B)! Architecture For the LLM, we select Qwen3-8B-Base, chosen for its strong reasoning and language comprehension abilities. The Vision Encoder leverages Qwen2-VL-ViT, capable of processing dynamic-resolution inputs—a more robust alternative to the commonly used tiling approach when dealing with images of extreme aspect ratios. The Projector employs a 2×2 pixelshuffle downsampling on visual tokens, followed by a two-layer MLP with a 64k hidden dimension, providing high alignment capacity between modalities. This architectural design ensures that Valley2.5 achieves a balanced trade-off between representational power, computational efficiency, and multimodal adaptability. License Agreement All of our open-source models are licensed under the Apache-2.0 license. Related Project We list related Project - Valley: Video Assistant with Large Language model Enhanced abilitY - LLaVA: Large Language and Vision Assistant - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step - Qwen3 License Agreement All of our open-source models are licensed under the Apache-2.0 license. We are Hiring The Data-Ecommerce-Platform Governance-Basic Algorithms Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, continuously delving deeply into this field. Our mission is to optimize algorithms and collaborate with business teams to comprehensively govern the quality and ecosystem of ByteDance's e-commerce products. Currently, the team has a strong demand for foundational algorithm expertise in NLP, CV, and multimodal technologies. We welcome inquiries and look forward to working on challenging projects with talented individuals like you! Contact & Resume Submission: [email protected] > Tiktok-电商，基础算法团队专注于多模态大模型算法和基础算法的研发，并在此方向上持续深耕，期待和优秀的你（实习/全职），一起做有挑战的事情！ > > 岗位城市：北京/上海/新加坡 > > 咨询&简历投递：[email protected]

license:apache-2.0

273

USO

Unified Style and Subject-Driven Generation via Disentangled and Reward Learning Paper: USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning Abstract Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: this https URL - Directly run the inference scripts, the checkpoints will be downloaded automatically by the `hfhubdownload` function in the code. ✍️ Inference Start from the examples below to explore and spark your creativity. ✨ We open-source this project for academic research. The vast majority of images used in this project are either generated or from open-source datasets. If you have any concerns, please contact us, and we will promptly remove any inappropriate content. Our project is released under the Apache 2.0 License. If you apply to other base models, please ensure that you comply with the original licensing terms. This research aims to advance the field of generative AI. Users are free to create images using this tool, provided they comply with local laws and exercise responsible usage. The developers are not liable for any misuse of the tool by users. Citation We also appreciate it if you could give a star ⭐ to our Github repository. Thanks a lot! If you find this project useful for your research, please consider citing our paper:

MammothModa2 Preview

MammothModa2: Jointly Optimized Autoregressive-Diffusion Models for Unified Multimodal Understanding and Generation [](https://github.com/bytedance/mammothmoda) [](https://ali-vilab.github.io/MammothModa-Page/) [](https://huggingface.co/bytedance-research/MammothModa2-Preview) MammothModa2 is a unified Autoregressive-Diffusion (AR-Diffusion) framework designed for comprehensive multimodal understanding and generation. The model adopts a novel serial architecture: the AR backbone utilizes Mamm...

—

pasa-7b-crawler

NaNK

license:cc-by-nc-sa-4.0

DynamicCoT

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. Our open-source 3B model (based on Qwen2.5-VL-3B-Instruct) achieves state-of-the-art result (61.90 on MMKP). 🧾 License DynamicCoT are derived from Qwen2.5-VL-3B-Instruct, which is subject to Qwen RESEARCH LICENSE AGREEMENT. We retain ownership of all intellectual property rights in and to any derivative works and modifications that we made. This project is not possible without multiple great open-sourced code bases. We list some notable examples below. - transformers - LLaMA-Factory - Qwen2.5-VL - InternVL If this work is helpful for your research, please consider citing the following BibTeX entry.

—

EchoVideo

—

OneReward

Official checkpoint of OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning [](https://arxiv.org/abs/2508.21066) [](https://github.com/bytedance/OneReward) [](https://one-reward.github.io/) Introduction We propose OneReward, a novel RLHF methodology for the visual domain by employing Qwen2.5-VL as a generative reward model to enhance multitask reinforcement learning, significantly improving the policy model’s generation ability across multiple subtask. Building on OneReward, we develop Seedream 3.0 Fill, a unified SOTA image editing model capable of effec-tively handling diverse tasks including image fill, image extend, object removal, and text rendering. It surpasses several leading commercial and open-source systems, including Ideogram, Adobe Photoshop, and FLUX Fill [Pro]. Finally, based on FLUX Fill [dev], we are thrilled to release FLUX.1-Fill-dev-OneReward, which outperforms closed-source FLUX Fill [Pro] in inpainting and outpainting tasks, serving as a powerful new baseline for future research in unified image editing. 1. Make sure your transformers>=4.51.3 (Supporting Qwen2.5-VL) The following contains a code snippet illustrating how to use the model to generate images based on text prompts and input mask, support inpaint(image-fill), outpaint(image-extend), eraser(object-removal). As the model is fully trained, FluxFillCFGPipeline with cfg is needed, you can find it in our github. Model FLUX.1-Fill-dev[OneReward], trained with Alg.1 in paper FLUX.1-Fill-dev[OneRewardDynamic], trained with Alg.2 in paper Object Removal with Lora As the base model flux fill have undergone heavy SFT for object generation, the improvement on removal is not obvious. we release a lora for object removal separately and might be helpful for you. License Agreement Code is licensed under Apache 2.0. Model is licensed under CC BY NC 4.0.

license:cc-by-nc-4.0

UNO

NaNK

license:apache-2.0

182

Phantom

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment > Lijie Liu , Tianxiang Ma , Bingchuan Li , Zhuowei Chen , Jiawei Liu , Gen Li , Siyu Zhou , Qian He , Xinglong Wu > Equal contribution, &dagger; Project lead > Intelligent Creation Team, ByteDance 🔥 Latest News! May 27, 2025: 🎉 We have released the Phantom-Wan-14B model, a more powerful Subject-to-Video generation model. Apr 23, 2025: 😊 Thanks to ComfyUI-WanVideoWrapper for adapting ComfyUI to Phantom-Wan-1.3B. Everyone is welcome to use it! Apr 21, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the Wan2.1 video generation model. The inference codes and checkpoint have been released. Apr 10, 2025: We have updated the full version of the Phantom paper, which now includes more detailed descriptions of the model architecture and dataset pipeline. Feb 16, 2025: We proposed a novel subject-consistent video generation model, Phantom, and have released the report publicly. For more video demos, please visit the project page. 📑 Todo List - [x] Inference codes and Checkpoint of Phantom-Wan-1.3B - [x] Checkpoint of Phantom-Wan-14B - [ ] Checkpoint of Phantom-Wan-14B Pro - [ ] Open source Phantom-Data - [ ] Training codes of Phantom-Wan 📖 Overview Phantom is a unified video generation framework for single and multi-subject references, built on existing text-to-video and image-to-video architectures. It achieves cross-modal alignment using text-image-video triplet data by redesigning the joint text-image injection model. Additionally, it emphasizes subject consistency in human generation while enhancing ID-preserving video generation. Model Download | Models | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | Phantom-Wan-1.3B | 🤗 Huggingface | Supports both 480P and 720P | Phantom-Wan-14B | 🤗 Huggingface | Supports both 480P and 720P First you need to download the 1.3B original model of Wan2.1, since our Phantom-Wan model relies on the Wan2.1 VAE and Text Encoder model. Download Wan2.1-1.3B using huggingface-cli: Then download the Phantom-Wan-1.3B and Phantom-Wan-14B model: Alternatively, you can manually download the required models and place them in the `Phantom-Wan-Models` folder. > 💡Note: > Changing `--refimage` can achieve single reference Subject-to-Video generation or multi-reference Subject-to-Video generation. The number of reference images should be within 4. > To achieve the best generation results, we recommend that you describe the visual content of the reference image as accurately as possible when writing `--prompt`. For example, "examples/ref1.png" can be described as "a toy camera in yellow and red with blue buttons". > When the generated video is unsatisfactory, the most straightforward solution is to try changing the `--baseseed` and modifying the description in the `--prompt`. For inferencing examples, please refer to "infer.sh". You will get the following generated results: > 💡Note: > The currently released Phantom-Wan-14B model was trained on 480P data but can also be applied to generating videos at 720P and higher resolutions, though the results may be less stable. We plan to release a version further trained on 720P data in the future. > The Phantom-Wan-14B model was trained on 24fps data, but it can also generate 16fps videos, similar to the native Wan2.1. However, the quality may experience a slight decline. For more inference examples, please refer to "infer.sh". You will get the following generated results: Acknowledgements We would like to express our gratitude to the SEED team for their support. Special thanks to Lu Jiang, Haoyuan Guo, Zhibei Ma, and Sen Wang for their assistance with the model and data. In addition, we are also very grateful to Siying Chen, Qingyang Li, and Wei Han for their help with the evaluation. If you find this project useful for your research, please consider citing our paper. 📧 Contact If you have any comments or questions regarding this open-source project, please open a new issue or contact Tianxiang Ma.

license:apache-2.0

174

HyperLoRA

license:cc-by-nc-4.0

Dreamfit

[AAAI2025] DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder Ente Lin&dagger;, Xujie Zhang&dagger;, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, Xiaodan Liang Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit , which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. Lightweight training : with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. Anything-Dressing : Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. Plug-and-play : DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both 768 × 512 high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation. Our method constructs an Anything-Dressing Encoder utilizing LoRA layers. The reference image features are extracted by the Anything-Dressing Encoder and then passed into the denoising UNet via adaptive attention. Furthermore, we incorporate Large Multimodal Models (LMM) into the inference process to reduce the text prompt gap between the training and testing. 3. Install our dependencies by running the following command: Models 1. You can download the pretrained models Here. Download the checkpoint to `pretrainedmodels` folder. 2. If you want to inference with StableDiffusion1.5 version, you need to download the stable-diffusion-v1-5, sd-vae-ft-mse to `pretrainedmodels`. If you want to generate images of different styles, you can download the corresponding stylized model, such as RealisticVision, to `pretrainedmodels`. 3. If you want to inference with Flux version, you need to download the flux-dev to `pretrainedmodels` folder 4. If you want to inference with pose control, you need to download the Annotators to `pretrainedmodels` folder Tips: 1. If you have multiple pieces of clothing, you can splice them onto one picture, as shown in the second row. 2. Use `--help` to check the meaning of each argument. A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text. A young woman with a casual yet stylish look, wearing a blue top, black skirt, and comfortable cream slip-on shoes. A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text. Tips: 1. Keep image is obtained by drawing the openpose on the garment-agnostic region. 2. The generation code for keep image cannot be open-sourced for the time being. As an alternative, we have provided several keep images for testing. A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text and a blue jeans. Disclaimer Most images used in this repository are sourced from the Internet. These images are solely intended to demonstrate the capabilities of our research. If you have any concerns, please contact us, and we will promptly remove any inappropriate content. This project aims to make a positive impact on the field of AI-driven image generation. Users are free to create images using this tool, but they must comply with local laws and use it responsibly. The developers do not assume any responsibility for potential misuse by users. Acknowledgements Thanks to x-flux and Moore-AnimateAnyone repositories, for their open research and exploration. Contact If you have any comments or questions, please open a new issue or feel free to contact Ente Lin and Xin Dong.

—

RealCustom

Existing text-to-image customization methods (i.e., subject-driven generation) face a fundamental challenge due to the entangled influence of visual and textual conditions. This inherent conflict forces a trade-off between subject fidelity and textual controllability, preventing simultaneous optimization of both objectives.We present RealCustom to disentangle subject similarity from text controllability and thereby allows both to be optimized simultaneously without conflicts. The core idea of RealCustom is to represent given subjects as real words that can be seamlessly integrated with given texts, and further leveraging the relevance between real words and image regions to disentangle visual condition from text condition. Citation If you find this project useful for your research, please consider citing our papers:

license:cc-by-nc-nd-4.0

MammothModa

license:apache-2.0

MammothModa2-Dev

—

bytedance-research

ChatTS-14B

UMO

ATI

Vidi-7B

ChatTS-8B

HuMo

LVFace

Valley2.5

Valley3

USO

Valley2-DPO

Valley-Eagle-7B

Timer-S1

pasa-7b-selector

MammothModa2 Preview

pasa-7b-crawler

DynamicCoT

EchoVideo

OneReward

UNO

Phantom

HyperLoRA

Dreamfit

RealCustom

MammothModa

MammothModa2-Dev