Manojb

34 models • 1 total models in database

Sort by:

Qwen3-4B-toolcalling-gguf-codex

- ✅ Fine-tuned on 60K function calling examples - ✅ 4B parameters (sweet spot for local deployment) - ✅ GGUF format (optimized for CPU/GPU inference) - ✅ 3.99GB download (fits on any modern system) - ✅ Production-ready with 0.518 training loss - Base Model: Qwen3-4B-Instruct - Fine-tuning: LoRA on function calling dataset - Format: GGUF (optimized for local inference) - Context Length: 262K tokens - Precision: FP16 optimized - Memory: Gradient checkpointing enabled - Building AI agents that need tool calling - Creating local coding assistants - Learning function calling without cloud dependencies - Prototyping AI applications on a budget - Privacy-sensitive development work | Feature | This Model | Cloud APIs | Other Local Models | |---------|------------|------------|-------------------| | Cost | Free after download | $0.01-0.10 per call | Often larger/heavier | | Privacy | 100% local | Data sent to servers | Varies | | Speed | Instant | Network dependent | Often slower | | Reliability | Always available | Service dependent | Depends on setup | | Customization | Full control | Limited | Varies | - GPU: 6GB+ VRAM (RTX 3060, RTX 4060, etc.) - RAM: 8GB+ system RAM - Storage: 5GB free space - OS: Windows, macOS, Linux - Function Call Accuracy: 94%+ on test set - Parameter Extraction: 96%+ accuracy - Tool Selection: 92%+ correct choices - Response Quality: Maintains conversational ability PERFECT for developers who want: - Local AI coding assistant (like Codex but private) - Function calling without API costs - 6GB VRAM compatibility (runs on most gaming GPUs) - Zero internet dependency once downloaded - Ollama integration (one-command setup) Apache 2.0 - Use freely for personal and commercial projects

NaNK

ollama

2,977

Qwen3-4b-toolcall-gguf-llamacpp-codex

A specialized 4B parameter model fine-tuned for function calling and tool usage, optimized for local deployment with llama-cpp-python. - 4B Parameters - Sweet spot for local deployment - Function Calling - Fine-tuned on 60K function calling examples - GGUF Format - Optimized for CPU/GPU inference - 3.99GB Download - Fits on any modern system - 262K Context - Large context window for complex tasks - VRAM - Full context within 6GB! - Base Model: Qwen3-4B-Instruct-2507 - Fine-tuning: LoRA on Salesforce xlam-function-calling-60k dataset - Quantization: Q80 (8-bit) for optimal performance/size ratio - Architecture: Qwen3 with specialized tool calling tokens - License: Apache 2.0 - Python 3.8+ - 6GB+ RAM (8GB+ recommended) - 5GB+ free disk space Alternative: Install with specific llama-cpp-python build For better performance, you can install llama-cpp-python with specific optimizations: To use this model with Codex, you need to run a local server that Codex can connect to: In your Codex configuration, set: - Server URL: `http://localhost:8000` - API Key: (not required for local server) - Model: `Qwen3-4B-Function-Calling-Pro` | Component | Minimum | Recommended | |-----------|---------|-------------| | RAM | 6GB | 8GB+ | | Storage | 5GB | 10GB+ | | CPU | 4 cores | 8+ cores | | GPU | Optional | NVIDIA RTX 3060+ | - Inference Speed: ~75-100 tokens/second (CPU) - Memory Usage: ~4GB RAM - Model Size: 3.99GB (Q80 quantized) - Context Length: 262K tokens - Function Call Accuracy: 94%+ on test set - AI Agents - Building intelligent agents that can use tools - Local Coding Assistants - Function calling without cloud dependencies - API Integration - Seamless tool orchestration - Privacy-Sensitive Development - 100% local processing - Learning Function Calling - Educational purposes The model includes specialized tokens for tool calling: - ` ` - Start of tool call - ` ` - End of tool call - ` ` - Start of tool response - ` ` - End of tool response The model uses a custom chat template optimized for tool calling: This project is licensed under the MIT License - see the LICENSE file for details. - llama-cpp-python - Python bindings for llama.cpp - Qwen3 - Base model - xlam-function-calling-60k - Training dataset

NaNK

llama-cpp

1,125

Qwen-7B-toolcalling-ReSearch-gguf-Q8_0-codex

This is a packged Q80 only model from https://huggingface.co/mradermacher/ReSearch-Qwen-7B-GGUF that runs on 9-12GB VRAM without any quality loss. weighted/imatrix quants are available at https://huggingface.co/mradermacher/ReSearch-Qwen-7B-i1-GGUF ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files. (sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants) | Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | Q2K | 3.1 | | | GGUF | Q3KS | 3.6 | | | GGUF | Q3KM | 3.9 | lower quality | | GGUF | Q3KL | 4.2 | | | GGUF | IQ4XS | 4.4 | | | GGUF | Q4KS | 4.6 | fast, recommended | | GGUF | Q4KM | 4.8 | fast, recommended | | GGUF | Q5KS | 5.4 | | | GGUF | Q5KM | 5.5 | | | GGUF | Q6K | 6.4 | very good quality | | GGUF | Q80 | 8.2 | fast, best quality | | GGUF | f16 | 15.3 | 16 bpw, overkill | Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):

NaNK

ollama

193

TRELLIS

The image conditioned version of TRELLIS, a large 3D genetive model. It was introduced in the paper Structured 3D Latents for Scalable and Versatile 3D Generation.

license:mit

control_v11f1p_sd15_depth

Controlnet v1.1 is the successor model of Controlnet v1.0 and was released in lllyasviel/ControlNet-v1-1 by Lvmin Zhang. This checkpoint is a conversion of the original checkpoint into `diffusers` format. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5. For more details, please also have a look at the 🧨 Diffusers docs. ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on depth images. Model Details - Developed by: Lvmin Zhang, Maneesh Agrawala - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Resources for more information: GitHub Repository, Paper. - Cite as: @misc{zhang2023adding, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Maneesh Agrawala}, year={2023}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV} } Controlnet was proposed in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Maneesh Agrawala. We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small ( | Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |lllyasviel/controlv11esd15ip2p | Trained with pixel to pixel instruction | No condition .| | | |lllyasviel/controlv11psd15inpaint | Trained with image inpainting | No condition.| | | |lllyasviel/controlv11psd15mlsd | Trained with multi-level line segment detection | An image with annotated line segments.| | | |lllyasviel/controlv11f1psd15depth | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.| | | |lllyasviel/controlv11psd15normalbae | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15seg | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15lineart | Trained with line art generation | An image with line art, usually black lines on a white background.| | | |lllyasviel/controlv11psd15s2lineartanime | Trained with anime line art generation | An image with anime-style line art.| | | |lllyasviel/controlv11psd15openpose | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.| | | |lllyasviel/controlv11psd15scribble | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.| | | |lllyasviel/controlv11psd15softedge | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.| | | |lllyasviel/controlv11esd15shuffle | Trained with image shuffling | An image with shuffled patches or regions.| | | |lllyasviel/controlv11f1esd15tile | Trained with image tiling | A blurry image or part of an image .| | | - The training dataset of previous cnet 1.0 has several problems including (1) a small group of greyscale human images are duplicated thousands of times (!!), causing the previous model somewhat likely to generate grayscale human images; (2) some images has low quality, very blurry, or significant JPEG artifacts; (3) a small group of images has wrong paired prompts caused by a mistake in our data processing scripts. The new model fixed all problems of the training dataset and should be more reasonable in many cases. - The new depth model is a relatively unbiased model. It is not trained with some specific type of depth by some specific depth estimation method. It is not over-fitted to one preprocessor. This means this model will work better with different depth estimation, different preprocessor resolutions, or even with real depth created by 3D engines. - Some reasonable data augmentations are applied to training, like random left-right flipping. - The model is resumed from depth 1.0, and it should work well in all cases where depth 1.0 works well. If not, please open an issue with image, and we will take a look at your case. Depth 1.1 works well in many failure cases of depth 1.0. - If you use Midas depth (the "depth" in webui plugin) with 384 preprocessor resolution, the difference between depth 1.0 and 1.1 should be minimal. However, if you try other preprocessor resolutions or other preprocessors (like leres and zoe), the depth 1.1 is expected to be a bit better than 1.0. For more information, please also have a look at the Diffusers ControlNet Blog Post and have a look at the official docs.

NaNK

—

stable-diffusion-v1-5

⚠️ This repository is a mirror of the now deprecated `ruwnayml/stable-diffusion-v1-5`, this repository or organization are not affiliated in any way with RunwayML. Modifications to the original model card are in red or green Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. For more information about how Stable Diffusion functions, please have a look at 🤗's Stable Diffusion blog. The Stable-Diffusion-v1-5 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling. You can use this both with the 🧨Diffusers library and RunwayML GitHub repository ( now deprecated ), ComfyUI, Automatic1111, SD.Next, InvokeAI . For more detailed instructions, use-cases and examples in JAX follow the instructions here Use with GitHub Repository (now deprecated) , ComfyUI or Automatic1111 1. Download the weights - v1-5-pruned-emaonly.safetensors - ema-only weight. uses less VRAM - suitable for inference - v1-5-pruned.safetensors - ema+non-ema weights. uses more VRAM - suitable for fine-tuning 3. Use locally with ComfyUI , AUTOMATIC1111 , SD.Next , InvokeAI Model Details - Developed by: Robin Rombach, Patrick Esser - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (CLIP ViT-L/14) as suggested in the Imagen paper. - Resources for more information: GitHub Repository, Paper. - Cite as: @InProceedings{Rombach2022CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn}, title = {High-Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684-10695} } Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. ### Misuse, Malicious Use, and Out-of-Scope Use Note: This section is taken from the DALLE-MINI model card, but applies in the same way to Stable Diffusion v1. The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Mis- and disinformation - Representations of egregious violence and gore - Sharing of copyrighted or licensed material in violation of its terms of use. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - The model does not achieve perfect photorealism - The model cannot render legible text - The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere” - Faces and people in general may not be generated properly. - The model was trained mainly with English captions and will not work as well in other languages. - The autoencoding part of the model is lossy - The model was trained on a large-scale dataset LAION-5B which contains adult material and is not fit for product use without additional safety mechanisms and considerations. - No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images. While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of LAION-2B(en), which consists of images that are primarily limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts. The intended use of this model is with the Safety Checker in Diffusers. This checker works by checking model outputs against known hard-coded NSFW concepts. The concepts are intentionally hidden to reduce the likelihood of reverse-engineering this filter. Specifically, the checker compares the class probability of harmful concepts in the embedding space of the `CLIPTextModel` after generation of the images. The concepts are passed into the model with the generated image and compared to a hand-engineered weight for each NSFW concept. Training Data The model developers used the following dataset for training the model: - LAION-2B (en) and subsets thereof (see next section) Training Procedure Stable Diffusion v1-5 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training, - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4 - Text prompts are encoded through a ViT-L/14 text-encoder. - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention. - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. Currently six Stable Diffusion checkpoints are provided, which were trained as follows. - `stable-diffusion-v1-1`: 237,000 steps at resolution `256x256` on laion2B-en. 194,000 steps at resolution `512x512` on laion-high-resolution (170M examples from LAION-5B with resolution `>= 1024x1024`). - `stable-diffusion-v1-2`: Resumed from `stable-diffusion-v1-1`. 515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en, filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an improved aesthetics estimator). - `stable-diffusion-v1-3`: Resumed from `stable-diffusion-v1-2` - 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling. - `stable-diffusion-v1-4` Resumed from `stable-diffusion-v1-2` - 225,000 steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling. - `stable-diffusion-v1-5` Resumed from `stable-diffusion-v1-2` - 595,000 steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling. - `stable-diffusion-inpainting` Resumed from `stable-diffusion-v1-5` - then 440,000 steps of inpainting training at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything. - Hardware: 32 x 8 x A100 GPUs - Optimizer: AdamW - Gradient Accumulations: 2 - Batch: 32 x 8 x 2 x 4 = 2048 - Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant Evaluation Results Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PNDM/PLMS sampling steps show the relative improvements of the checkpoints: Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution. Not optimized for FID scores. Environmental Impact Stable Diffusion v1 Estimated Emissions Based on that information, we estimate the following CO2 emissions using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact. - Hardware Type: A100 PCIe 40GB - Hours used: 150000 - Cloud Provider: AWS - Compute Region: US-east - Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 11250 kg CO2 eq. This model card was written by: Robin Rombach and Patrick Esser and is based on the DALL-E Mini model card.

—

stable-diffusion-2-1-base

Stable Diffusion v2-1-base Model Card This model card focuses on the model associated with the Stable Diffusion v2-1-base model. This `stable-diffusion-2-1-base` model fine-tunes stable-diffusion-2-base (`512-base-ema.ckpt`) with 220k extra steps taken, with `punsafe=0.98` on the same dataset. - Use it with the `stablediffusion` repository: download the `v2-1512-ema-pruned.ckpt` here. - Use it with 🧨 `diffusers` Model Details - Developed by: Robin Rombach, Patrick Esser - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: CreativeML Open RAIL++-M License - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H). - Resources for more information: GitHub Repository. - Cite as: @InProceedings{Rombach2022CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn}, title = {High-Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684-10695} } Using the 🤗's Diffusers library to run Stable Diffusion 2 in a simple and efficient manner. Running the pipeline (if you don't swap the scheduler it will run with the default PNDM/PLMS scheduler, in this example we are swapping it to EulerDiscreteScheduler): Notes: - Despite not being a dependency, we highly recommend you to install xformers for memory efficient attention (better performance) - If you have low GPU RAM available, make sure to add a `pipe.enableattentionslicing()` after sending it to `cuda` for less VRAM usage (to the cost of speed) Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. ### Misuse, Malicious Use, and Out-of-Scope Use Note: This section is originally taken from the DALLE-MINI model card, was used for Stable Diffusion v1, but applies in the same way to Stable Diffusion v2. The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Mis- and disinformation - Representations of egregious violence and gore - Sharing of copyrighted or licensed material in violation of its terms of use. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - The model does not achieve perfect photorealism - The model cannot render legible text - The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere” - Faces and people in general may not be generated properly. - The model was trained mainly with English captions and will not work as well in other languages. - The autoencoding part of the model is lossy - The model was trained on a subset of the large-scale dataset LAION-5B, which contains adult, violent and sexual content. To partially mitigate this, we have filtered the dataset using LAION's NFSW detector (see Training section). Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion vw was primarily trained on subsets of LAION-2B(en), which consists of images that are limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts. Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer discretion must be advised irrespective of the input or its intent. Training Data The model developers used the following dataset for training the model: - LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "punsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's NeurIPS 2022 paper and reviewer discussions on the topic. Training Procedure Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training, - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4 - Text prompts are encoded through the OpenCLIP-ViT/H text-encoder. - The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention. - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512. We currently provide the following checkpoints, for various versions: Version 2.1 - `512-base-ema.ckpt`: Fine-tuned on `512-base-ema.ckpt` 2.0 with 220k extra steps taken, with `punsafe=0.98` on the same dataset. - `768-v-ema.ckpt`: Resumed from `768-v-ema.ckpt` 2.0 with an additional 55k steps on the same dataset (`punsafe=0.1`), and then fine-tuned for another 155k extra steps with `punsafe=0.98`. Version 2.0 - `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with `punsafe=0.1` and an aesthetic score >= `4.5`. 850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`. - `768-v-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for 150k steps using a v-objective on the same dataset. Resumed for another 140k steps on a `768x768` subset of our dataset. - `512-depth-ema.ckpt`: Resumed from `512-base-ema.ckpt` and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by MiDaS (`dpthybrid`) which is used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized. - `512-inpainting-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for another 200k steps. Follows the mask-generation strategy presented in LAMA which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized. The same strategy was used to train the 1.5-inpainting checkpoint. - `x4-upscaling-ema.ckpt`: Trained for 1.25M steps on a 10M subset of LAION containing images `>2048x2048`. The model was trained on crops of size `512x512` and is a text-guided latent upscaling diffusion model. In addition to the textual input, it receives a `noiselevel` as an input parameter, which can be used to add noise to the low-resolution input according to a predefined diffusion schedule. - Hardware: 32 x 8 x A100 GPUs - Optimizer: AdamW - Gradient Accumulations: 1 - Batch: 32 x 8 x 2 x 4 = 2048 - Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant Evaluation Results Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 steps DDIM sampling steps show the relative improvements of the checkpoints: Evaluated using 50 DDIM steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution. Not optimized for FID scores. Stable Diffusion v1 Estimated Emissions Based on that information, we estimate the following CO2 emissions using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact. - Hardware Type: A100 PCIe 40GB - Hours used: 200000 - Cloud Provider: AWS - Compute Region: US-east - Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 15000 kg CO2 eq. Citation @InProceedings{Rombach2022CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn}, title = {High-Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684-10695} } This model card was written by: Robin Rombach, Patrick Esser and David Ha and is based on the Stable Diffusion v1 and DALL-E Mini model card.

—

sd-controlnet-canny

ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Canny edges. It can be used in combination with Stable Diffusion. Model Details - Developed by: Lvmin Zhang, Maneesh Agrawala - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Resources for more information: GitHub Repository, Paper. - Cite as: @misc{zhang2023adding, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Maneesh Agrawala}, year={2023}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV} } Controlnet was proposed in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Maneesh Agrawala. We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small ( Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |lllyasviel/sd-controlnet-depth Trained with Midas depth estimation |A grayscale image with black representing deep areas and white representing shallow areas.| | | |lllyasviel/sd-controlnet-hed Trained with HED edge detection (soft edge) |A monochrome image with white soft edges on a black background.| | | |lllyasviel/sd-controlnet-mlsd Trained with M-LSD line detection |A monochrome image composed only of white straight lines on a black background.| | | |lllyasviel/sd-controlnet-normal Trained with normal map |A normal mapped image.| | | |lllyasviel/sd-controlnetopenpose Trained with OpenPose bone image |A OpenPose bone image.| | | |lllyasviel/sd-controlnetscribble Trained with human scribbles |A hand-drawn monochrome image with white outlines on a black background.| | | |lllyasviel/sd-controlnetseg Trained with semantic segmentation |An ADE20K's segmentation protocol image.| | | It is recommended to use the checkpoint with Stable Diffusion v1-5 as the checkpoint has been trained on it. Experimentally, the checkpoint can be used with other diffusion models such as dreamboothed stable diffusion. Note: If you want to process an image to create the auxiliary conditioning, external dependencies are required as shown below: The canny edge model was trained on 3M edge-image, caption pairs. The model was trained for 600 GPU-hours with Nvidia A100 80G using Stable Diffusion 1.5 as a base model. For more information, please also have a look at the official ControlNet Blog Post.

NaNK

—

control_v11p_sd15_inpaint

NaNK

—

control_v11e_sd15_ip2p

Controlnet v1.1 was released in lllyasviel/ControlNet-v1-1 by Lvmin Zhang. This checkpoint is a conversion of the original checkpoint into `diffusers` format. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5. For more details, please also have a look at the 🧨 Diffusers docs. ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on instruct pix2pix images. Model Details - Developed by: Lvmin Zhang, Maneesh Agrawala - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Resources for more information: GitHub Repository, Paper. - Cite as: @misc{zhang2023adding, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Maneesh Agrawala}, year={2023}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV} } Controlnet was proposed in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Maneesh Agrawala. We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small ( | Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |lllyasviel/controlv11esd15ip2p | Trained with pixel to pixel instruction | No condition .| | | |lllyasviel/controlv11psd15inpaint | Trained with image inpainting | No condition.| | | |lllyasviel/controlv11psd15mlsd | Trained with multi-level line segment detection | An image with annotated line segments.| | | |lllyasviel/controlv11f1psd15depth | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.| | | |lllyasviel/controlv11psd15normalbae | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15seg | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15lineart | Trained with line art generation | An image with line art, usually black lines on a white background.| | | |lllyasviel/controlv11psd15s2lineartanime | Trained with anime line art generation | An image with anime-style line art.| | | |lllyasviel/controlv11psd15openpose | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.| | | |lllyasviel/controlv11psd15scribble | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.| | | |lllyasviel/controlv11psd15softedge | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.| | | |lllyasviel/controlv11esd15shuffle | Trained with image shuffling | An image with shuffled patches or regions.| | | |lllyasviel/controlv11f1esd15tile | Trained with image tiling | A blurry image or part of an image .| | | For more information, please also have a look at the Diffusers ControlNet Blog Post and have a look at the official docs.

NaNK

—

control_v11p_sd15_openpose

Controlnet v1.1 is the successor model of Controlnet v1.0 and was released in lllyasviel/ControlNet-v1-1 by Lvmin Zhang. This checkpoint is a conversion of the original checkpoint into `diffusers` format. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5. For more details, please also have a look at the 🧨 Diffusers docs. ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on openpose images. Model Details - Developed by: Lvmin Zhang, Maneesh Agrawala - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Resources for more information: GitHub Repository, Paper. - Cite as: @misc{zhang2023adding, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Maneesh Agrawala}, year={2023}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV} } Controlnet was proposed in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Maneesh Agrawala. We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small ( Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |lllyasviel/controlv11esd15ip2p Trained with pixel to pixel instruction | No condition .| | | |lllyasviel/controlv11psd15inpaint Trained with image inpainting | No condition.| | | |lllyasviel/controlv11psd15mlsd Trained with multi-level line segment detection | An image with annotated line segments.| | | |lllyasviel/controlv11f1psd15depth Trained with depth estimation | An image with depth information, usually represented as a grayscale image.| | | |lllyasviel/controlv11psd15normalbae Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15seg Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15lineart Trained with line art generation | An image with line art, usually black lines on a white background.| | | |lllyasviel/controlv11psd15s2lineartanime Trained with anime line art generation | An image with anime-style line art.| | | |lllyasviel/controlv11psd15openpose Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.| | | |lllyasviel/controlv11psd15scribble Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.| | | |lllyasviel/controlv11psd15softedge Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.| | | |lllyasviel/controlv11esd15shuffle Trained with image shuffling | An image with shuffled patches or regions.| | | - The improvement of this model is mainly based on our improved implementation of OpenPose. We carefully reviewed the difference between the pytorch OpenPose and CMU's c++ openpose. Now the processor should be more accurate, especially for hands. The improvement of processor leads to the improvement of Openpose 1.1. - More inputs are supported (hand and face). - The training dataset of previous cnet 1.0 has several problems including (1) a small group of greyscale human images are duplicated thousands of times (!!), causing the previous model somewhat likely to generate grayscale human images; (2) some images has low quality, very blurry, or significant JPEG artifacts; (3) a small group of images has wrong paired prompts caused by a mistake in our data processing scripts. The new model fixed all problems of the training dataset and should be more reasonable in many cases. For more information, please also have a look at the Diffusers ControlNet Blog Post and have a look at the official docs.

NaNK

—

control_v11p_sd15_normalbae

Controlnet v1.1 is the successor model of Controlnet v1.0 and was released in lllyasviel/ControlNet-v1-1 by Lvmin Zhang. This checkpoint is a conversion of the original checkpoint into `diffusers` format. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5. For more details, please also have a look at the 🧨 Diffusers docs. ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on normalbae images. Model Details - Developed by: Lvmin Zhang, Maneesh Agrawala - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Resources for more information: GitHub Repository, Paper. - Cite as: @misc{zhang2023adding, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Maneesh Agrawala}, year={2023}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV} } Controlnet was proposed in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Maneesh Agrawala. We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small ( | Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |lllyasviel/controlv11esd15ip2p | Trained with pixel to pixel instruction | No condition .| | | |lllyasviel/controlv11psd15inpaint | Trained with image inpainting | No condition.| | | |lllyasviel/controlv11psd15mlsd | Trained with multi-level line segment detection | An image with annotated line segments.| | | |lllyasviel/controlv11f1psd15depth | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.| | | |lllyasviel/controlv11psd15normalbae | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15seg | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15lineart | Trained with line art generation | An image with line art, usually black lines on a white background.| | | |lllyasviel/controlv11psd15s2lineartanime | Trained with anime line art generation | An image with anime-style line art.| | | |lllyasviel/controlv11psd15openpose | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.| | | |lllyasviel/controlv11psd15scribble | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.| | | |lllyasviel/controlv11psd15softedge | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.| | | |lllyasviel/controlv11esd15shuffle | Trained with image shuffling | An image with shuffled patches or regions.| | | |lllyasviel/controlv11f1esd15tile | Trained with image tiling | A blurry image or part of an image .| | | - The normal-from-midas method in Normal 1.0 is neither reasonable nor physically correct. That method does not work very well in many images. The normal 1.0 model cannot interpret real normal maps created by rendering engines. - This Normal 1.1 is much more reasonable because the preprocessor is trained to estimate normal maps with a relatively correct protocol (NYU-V2's visualization method). This means the Normal 1.1 can interpret real normal maps from rendering engines as long as the colors are correct (blue is front, red is left, green is top). - In our test, this model is robust and can achieve similar performance to the depth model. In previous CNET 1.0, the Normal 1.0 is not very frequently used. But this Normal 2.0 is much improved and has potential to be used much more frequently. For more information, please also have a look at the Diffusers ControlNet Blog Post and have a look at the official docs.

NaNK

—

stable-diffusion-2-base

Stable Diffusion v2-base Model Card This model card focuses on the model associated with the Stable Diffusion v2-base model, available here. The model is trained from scratch 550k steps at resolution `256x256` on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with `punsafe=0.1` and an aesthetic score >= `4.5`. Then it is further trained for 850k steps at resolution `512x512` on the same dataset on images with resolution `>= 512x512`. - Use it with the `stablediffusion` repository: download the `512-base-ema.ckpt` here. - Use it with 🧨 `diffusers` Model Details - Developed by: Robin Rombach, Patrick Esser - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: CreativeML Open RAIL++-M License - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H). - Resources for more information: GitHub Repository. - Cite as: @InProceedings{Rombach2022CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn}, title = {High-Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684-10695} } Using the 🤗's Diffusers library to run Stable Diffusion 2 in a simple and efficient manner. Running the pipeline (if you don't swap the scheduler it will run with the default PNDM/PLMS scheduler, in this example we are swapping it to EulerDiscreteScheduler): Notes: - Despite not being a dependency, we highly recommend you to install xformers for memory efficient attention (better performance) - If you have low GPU RAM available, make sure to add a `pipe.enableattentionslicing()` after sending it to `cuda` for less VRAM usage (to the cost of speed) Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. ### Misuse, Malicious Use, and Out-of-Scope Use Note: This section is originally taken from the DALLE-MINI model card, was used for Stable Diffusion v1, but applies in the same way to Stable Diffusion v2. The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Mis- and disinformation - Representations of egregious violence and gore - Sharing of copyrighted or licensed material in violation of its terms of use. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - The model does not achieve perfect photorealism - The model cannot render legible text - The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere” - Faces and people in general may not be generated properly. - The model was trained mainly with English captions and will not work as well in other languages. - The autoencoding part of the model is lossy - The model was trained on a subset of the large-scale dataset LAION-5B, which contains adult, violent and sexual content. To partially mitigate this, we have filtered the dataset using LAION's NFSW detector (see Training section). Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion vw was primarily trained on subsets of LAION-2B(en), which consists of images that are limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts. Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer discretion must be advised irrespective of the input or its intent. Training Data The model developers used the following dataset for training the model: - LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "punsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's NeurIPS 2022 paper and reviewer discussions on the topic. Training Procedure Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training, - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4 - Text prompts are encoded through the OpenCLIP-ViT/H text-encoder. - The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention. - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called v-objective, see https://arxiv.org/abs/2202.00512. - `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with `punsafe=0.1` and an aesthetic score >= `4.5`. 850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`. - `768-v-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for 150k steps using a v-objective on the same dataset. Resumed for another 140k steps on a `768x768` subset of our dataset. - `512-depth-ema.ckpt`: Resumed from `512-base-ema.ckpt` and finetuned for 200k steps. Added an extra input channel to process the (relative) depth prediction produced by MiDaS (`dpthybrid`) which is used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized. - `512-inpainting-ema.ckpt`: Resumed from `512-base-ema.ckpt` and trained for another 200k steps. Follows the mask-generation strategy presented in LAMA which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning. The additional input channels of the U-Net which process this extra information were zero-initialized. The same strategy was used to train the 1.5-inpainting checkpoint. - `x4-upscaling-ema.ckpt`: Trained for 1.25M steps on a 10M subset of LAION containing images `>2048x2048`. The model was trained on crops of size `512x512` and is a text-guided latent upscaling diffusion model. In addition to the textual input, it receives a `noiselevel` as an input parameter, which can be used to add noise to the low-resolution input according to a predefined diffusion schedule. - Hardware: 32 x 8 x A100 GPUs - Optimizer: AdamW - Gradient Accumulations: 1 - Batch: 32 x 8 x 2 x 4 = 2048 - Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant Evaluation Results Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 steps DDIM sampling steps show the relative improvements of the checkpoints: Evaluated using 50 DDIM steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution. Not optimized for FID scores. Stable Diffusion v1 Estimated Emissions Based on that information, we estimate the following CO2 emissions using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact. - Hardware Type: A100 PCIe 40GB - Hours used: 200000 - Cloud Provider: AWS - Compute Region: US-east - Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 15000 kg CO2 eq. Citation @InProceedings{Rombach2022CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn}, title = {High-Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684-10695} } This model card was written by: Robin Rombach, Patrick Esser and David Ha and is based on the Stable Diffusion v1 and DALL-E Mini model card.

—

Qwen3.5-9B-Q8_0.gguf

NaNK

—

Qwen3.5-9B-Q4_K_M.gguf

NaNK

—

TRELLIS-image-large

The image conditioned version of TRELLIS, a large 3D genetive model. It was introduced in the paper Structured 3D Latents for Scalable and Versatile 3D Generation.

license:mit

taesdxl

TAESDXL is very tiny autoencoder which uses the same "latent API" as SDXL-VAE. TAESDXL is useful for real-time previewing of the SDXL generation process. This repo contains `.safetensors` versions of the TAESDXL weights. For SD1.x / SD2.x, use TAESD instead (the SD and SDXL VAEs are incompatible).

license:mit

control_v11f1e_sd15_tile

Controlnet v1.1 was released in lllyasviel/ControlNet-v1-1 by Lvmin Zhang. This checkpoint is a conversion of the original checkpoint into `diffusers` format. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5. For more details, please also have a look at the 🧨 Diffusers docs. ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on tiled image. Conceptually, it is similar to a super-resolution model, but its usage is not limited to that. It is also possible to generate details at the same size as the input (conditione) image. Model Details - Developed by: Lvmin Zhang, Maneesh Agrawala - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. - Resources for more information: GitHub Repository, Paper. - Cite as: @misc{zhang2023adding, title={Adding Conditional Control to Text-to-Image Diffusion Models}, author={Lvmin Zhang and Maneesh Agrawala}, year={2023}, eprint={2302.05543}, archivePrefix={arXiv}, primaryClass={cs.CV} } Introduction Controlnet was proposed in Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang, Maneesh Agrawala. We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small ( | Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |lllyasviel/controlv11esd15ip2p | Trained with pixel to pixel instruction | No condition .| | | |lllyasviel/controlv11psd15inpaint | Trained with image inpainting | No condition.| | | |lllyasviel/controlv11psd15mlsd | Trained with multi-level line segment detection | An image with annotated line segments.| | | |lllyasviel/controlv11f1psd15depth | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.| | | |lllyasviel/controlv11psd15normalbae | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15seg | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.| | | |lllyasviel/controlv11psd15lineart | Trained with line art generation | An image with line art, usually black lines on a white background.| | | |lllyasviel/controlv11psd15s2lineartanime | Trained with anime line art generation | An image with anime-style line art.| | | |lllyasviel/controlv11psd15openpose | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.| | | |lllyasviel/controlv11psd15scribble | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.| | | |lllyasviel/controlv11psd15softedge | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.| | | |lllyasviel/controlv11esd15shuffle | Trained with image shuffling | An image with shuffled patches or regions.| | | |lllyasviel/controlv11f1esd15tile | Trained with image tiling | A blurry image or part of an image .| | | For more information, please also have a look at the Diffusers ControlNet Blog Post and have a look at the official docs.

NaNK

—

flux-game

license:apache-2.0

zero123-xl-diffusers

Uses Note: This section is originally taken from the Stable Diffusion v2 model card, but applies in the same way to Zero-1-to-3. Direct Use The model is intended for research purposes only. Possible research areas and tasks include: - Safe deployment of large-scale models. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. Misuse, Malicious Use, and Out-of-Scope Use The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Mis- and disinformation - Representations of egregious violence and gore - Sharing of copyrighted or licensed material in violation of its terms of use. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - The model does not achieve perfect photorealism. - The model cannot render legible text. - Faces and people in general may not be parsed or generated properly. - The autoencoding part of the model is lossy. - Stable Diffusion was trained on a subset of the large-scale dataset LAION-5B, which contains adult, violent and sexual content. To partially mitigate this, Stability AI has filtered the dataset using LAION's NSFW detector. - Zero-1-to-3 was subsequently finetuned on a subset of the large-scale dataset Objaverse, which might also potentially contain inappropriate content. To partially mitigate this, our demo applies a safety check to every uploaded image. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion was primarily trained on subsets of LAION-2B(en), which consists of images that are limited to English descriptions. Images and concepts from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as Western cultures are often overrepresented. Stable Diffusion mirrors and exacerbates biases to such a degree that viewer discretion must be advised irrespective of the input or its intent. Safety Module The intended use of this model is with the Safety Checker in Diffusers. This checker works by checking model inputs against known hard-coded NSFW concepts. Specifically, the checker compares the class probability of harmful concepts in the embedding space of the uploaded input images. The concepts are passed into the model with the image and compared to a hand-engineered weight for each NSFW concept.

license:mit

3DTopia-XL

This repo contains the pretrained weights for 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion. Introduction 3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text/image input which is ready for the graphics pipeline to use. Model Details The model is trained on a ~256K subset of Objaverse. For more details, please refer to our paper. Please refer to our repo for more details on loading and inference.

license:apache-2.0

dpt-large

license:apache-2.0

CLIP-ViT-L-14-DataComp.XL-s13B-b90K

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-L/14 model trained with the DataComp-1B (https://github.com/mlfoundations/datacomp) using OpenCLIP (https://github.com/mlfoundations/openclip). As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the DataComp paper (https://arxiv.org/abs/2304.14108) include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. This model was trained with the 1.4 Billion samples of the DataComp-1B dataset (https://arxiv.org/abs/2304.14108). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. Evaluation done on 38 datasets, using the DataComp repo and the LAION CLIP Benchmark. The testing is performed on a suite of 38 datasets. See our paper for more details (https://arxiv.org/abs/2304.14108). The model achieves a 79.2% zero-shot top-1 accuracy on ImageNet-1k. See our paper for more details and results (https://arxiv.org/abs/2304.14108). Acknowledging stability.ai for the compute used to train this model.

NaNK

license:mit

TRELLIS-text-large

The text conditioned version of TRELLIS with model size L, a large 3D genetive model. It was introduced in the paper Structured 3D Latents for Scalable and Versatile 3D Generation.

license:mit

TRELLIS-text-base

The text conditioned version of TRELLIS with model size B, a large 3D genetive model. It was introduced in the paper Structured 3D Latents for Scalable and Versatile 3D Generation.

license:mit

stable-zero123-diffusers

license:mit

Hunyuan3D-2

—

dinov2-small-imagenet1k-1-layer

Vision Transformer (small-sized model) trained using DINOv2 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. and first released in this repository. Disclaimer: The team releasing DINOv2 did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion. Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the model for classifying an image among one of the 1000 ImageNet labels. See the model hub to look for other fine-tuned versions on a task that interests you.

license:apache-2.0

CUA_benchmark_local_small_models

—

Qwen3-Coder-Next-f16-GGUF

license:apache-2.0

Qwen-Image-Edit-Rapid-AIO-MultipleAngle

license:apache-2.0

Manojb

Qwen3-4B-toolcalling-gguf-codex

Qwen3-4b-toolcall-gguf-llamacpp-codex

Qwen-7B-toolcalling-ReSearch-gguf-Q8_0-codex

TRELLIS

control_v11f1p_sd15_depth

stable-diffusion-v1-5

stable-diffusion-2-1-base

sd-controlnet-canny

control_v11p_sd15_inpaint

control_v11e_sd15_ip2p

control_v11p_sd15_openpose

control_v11p_sd15_normalbae

stable-diffusion-2-base

Qwen3.5-9B-Q8_0.gguf

Qwen3.5-9B-Q4_K_M.gguf

TRELLIS-image-large

taesdxl

control_v11f1e_sd15_tile

flux-game

zero123-xl-diffusers

3DTopia-XL

dpt-large

CLIP-ViT-L-14-DataComp.XL-s13B-b90K

trellis-normal-v0-1

TRELLIS-text-xlarge

Meta-Llama-3.1-8B-Instruct-hf

TRELLIS-text-large

TRELLIS-text-base

stable-zero123-diffusers

Hunyuan3D-2

dinov2-small-imagenet1k-1-layer

CUA_benchmark_local_small_models

Qwen3-Coder-Next-f16-GGUF

Qwen-Image-Edit-Rapid-AIO-MultipleAngle