Kwai-Kolors
Kolors-diffusers
Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis š Introduction Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by the Kuaishou Kolors team. Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and proprietary models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this technical report . š Quick Start Using with Diffusers Make sure you upgrade to the latest version of diffusers==0.30.0.dev0: Notes: - The pipeline uses the `EulerDiscreteScheduler` by default. We recommend using this scheduler with `guidance scale=5.0` and `numinferencesteps=50`. - The pipeline also supports the `EDMDPMSolverMultistepScheduler`. `guidance scale=5.0` and `numinferencesteps=25` is a good default for this scheduler. - In addition to Text-to-Image, `KolorsImg2ImgPipeline` also supports Image-to-Image. š License&Citation License Kolors are fully open-sourced for academic research. For commercial use, please fill out this questionnaire and sent it to [email protected] for registration. We open-source Kolors to promote the development of large text-to-image models in collaboration with the open-source community. The code of this project is open-sourced under the Apache-2.0 license. We sincerely urge all developers and users to strictly adhere to the open-source license, avoiding the use of the open-source model, code, and its derivatives for any purposes that may harm the country and society or for any services not evaluated and registered for safety. Note that despite our best efforts to ensure the compliance, accuracy, and safety of the data during training, due to the diversity and combinability of generated content and the probabilistic randomness affecting the model, we cannot guarantee the accuracy and safety of the output content, and the model is susceptible to misleading. This project does not assume any legal responsibility for any data security issues, public opinion risks, or risks and liabilities arising from the model being misled, abused, misused, or improperly utilized due to the use of the open-source model and code. Citation If you find our work helpful, please cite it! Acknowledgments - Thanks to Diffusers for providing the codebase. - Thanks to ChatGLM3 for providing the powerful Chinese language model. If you want to leave a message for our R&D team and product team, feel free to join our WeChat group. You can also contact us via email ([email protected]).
Kolors-IP-Adapter-Plus
Kolors-IP-Adapter-FaceID-Plus
Kolors-IP-Adapter-FaceID-Plus weights and inference code We provide Kolors-IP-Adapter-FaceID-Plus module weights and inference code based on Kolors-Basemodel. Examples of Kolors-IP-Adapter-FaceID-Plus results are as follows: - Our Kolors-IP-Adapter-FaceID-Plus module is trained on a large-scale and high-quality face dataset. We use the face ID embeddings generated by insightface and the CLIP features of face area to keep the face ID and structure information. š Evaluation For evaluation, we constructed a test set consisting of over 200 reference images and text prompts. We invited several image experts to provide fair ratings for the generated results of different models. The experts assessed the generated images based on five criteria: visual appeal, text faithfulness, face similarity, facial aesthetics and overall satisfaction. Visual appeal and text faithfulness are used to measure the text-to-image generation capability, adhering to the evaluation standards of BaseModel. Meanwhile, face similarity and facial aesthetics are used to evaluate the performance of the proposed Kolors-IP-Adapter-FaceID-Plus. The results are summarized in the table below, where Kolors-IP-Adapter-FaceID-Plus outperforms SDXL-IP-Adapter-FaceID-Plus across all metrics. | Model | Average Text Faithfulness | Average Visual Appeal | Average Face Similarity | Average Facial Aesthetics | Average Overall Satisfaction | | :--------------: | :--------: | :--------: | :--------: | :--------: | :--------: | | SDXL-IP-Adapter-FaceID-Plus | 4.014 | 3.455 | 3.05 | 2.584 | 2.448 | | Kolors-IP-Adapter-FaceID-Plus | 4.235 | 4.374 | 4.415 | 3.887 | 3.561 | ------ Kolors-IP-Adapter-FaceID-Plus employs chinese prompts, while SDXL-IP-Adapter-FaceID-Plus use english prompts. The dependencies and installation are basically the same as the Kolors-BaseModel. Acknowledgments - Thanks to insightface for the face representations. - Thanks to IP-Adapter for the codebase.
Kolors
Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis š Introduction Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by the Kuaishou Kolors team. Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and proprietary models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this technical report . Python 3.8 or later PyTorch 1.13.1 or later Transformers 4.26.1 or later Recommended: CUDA 11.7 or later Using with Diffusers Please refer to https://huggingface.co/Kwai-Kolors/Kolors-diffusers. š License&Citation License Kolors are fully open-sourced for academic research. For commercial use, please fill out this questionnaire and sent it to [email protected] for registration. We open-source Kolors to promote the development of large text-to-image models in collaboration with the open-source community. The code of this project is open-sourced under the Apache-2.0 license. We sincerely urge all developers and users to strictly adhere to the open-source license, avoiding the use of the open-source model, code, and its derivatives for any purposes that may harm the country and society or for any services not evaluated and registered for safety. Note that despite our best efforts to ensure the compliance, accuracy, and safety of the data during training, due to the diversity and combinability of generated content and the probabilistic randomness affecting the model, we cannot guarantee the accuracy and safety of the output content, and the model is susceptible to misleading. This project does not assume any legal responsibility for any data security issues, public opinion risks, or risks and liabilities arising from the model being misled, abused, misused, or improperly utilized due to the use of the open-source model and code. Citation If you find our work helpful, please cite it! Acknowledgments - Thanks to Diffusers for providing the codebase. - Thanks to ChatGLM3 for providing the powerful Chinese language model. If you want to leave a message for our R&D team and product team, feel free to join our WeChat group. You can also contact us via email ([email protected]).
Kolors-ControlNet-Depth
Kolors-ControlNet-Canny
Kolors-ControlNet-Pose
Kolors-CoTyle
CoTyle
Kolors-Inpainting
We provide Kolors-Inpainting inference code and weights which were initialized with Kolors-Basemodel. Examples of Kolors-Inpainting results are as follows: - For inpainting, the UNet has 5 additional input channels (4 for the encoded masked image and 1 for the mask itself). The weights for the encoded masked-image channels were initialized from the non-inpainting checkpoint, while the weights for the mask channel were zero-initialized. - To improve the robustness of the inpainting model, we adopt a more diverse strategy for generating masks, including random masks, subject segmentation masks, rectangular masks, and masks based on dilation operations. For evaluation, we created a test set comprising 200 masked images and text prompts. We invited several image experts to provide unbiased ratings for the generated results of different models. The experts assessed the generated images based on four criteria: visual appeal, text faithfulness, inpainting artifacts, and overall satisfaction. Inpainting artifacts measure the perceptual boundaries in the inpainting results, while the other criteria adhere to the evaluation standards of the BaseModel. The specific results are summarized in the table below, where Kolors-Inpainting achieved the highest overall satisfaction score. | Model | Average Overall Satisfaction | Average Inpainting Artifacts | Average Visual Appeal | Average Text Faithfulness | | :-----------------: | :-----------: | :-----------: | :-----------: | :-----------: | | SDXL-Inpainting | 2.573 | 1.205 | 3.000 | 4.299 | | Kolors-Inpainting | 3.493 | 0.204 | 3.855 | 4.346 | The higher the scores for Average Overall Satisfaction, Average Visual Appeal, and Average Text Faithfulness, the better. Conversely, the lower the score for Average Inpainting Artifacts, the better. The comparison results of SDXL-Inpainting and Kolors-Inpainting are as follows: Kolors-Inpainting employs Chinese prompts, while SDXL-Inpainting uses English prompts. The dependencies and installation are basically the same as the Kolors-BaseModel.