KBlueLeaf

41 models • 5 total models in database
Sort by:

DanTagGen-delta-rev2

DanTagGen - delta (rev2) DanTagGen(Danbooru Tag Generator) is inspired from p1atdev's dart project. But with different arch, dataset, format and different training strategy. Difference between versions - alpha: pretrain on 2M dataset, smaller batch size. Limited ability - beta: pretrain on 5.3M dataset, larger batch size. More stable, better ability with only a few information provided. - delta: pretrain on 7.2M dataset, larger batch size. Slightly underfit but better diversity. quality tag introduced. - rev2: resumed from delta, same dataset, 2 more epoch. Model arch This version of DTG is trained from scratch with 400M param LLaMA arch.(In my personal preference I will call it NanoLLaMA) Since it is llama arch. Theoritically it should be able to be used in any LLaMA inference interface. This repo also provided converted FP16 gguf model and quantized 8bit/6bit gguf models. Basically it is recommended to use llama.cpp or llama-cpp-python to run this model. Which will be very fast. Dataset and Training I use the trainer I implemented in HakuPhi to run the training. with Total 12epoch on 7.2M data. This model have roughly 10~15B token seen. The dataset is exported by HakuBooru with my danbooru sqlite database. Use the percentile of favcount on each rating to filter the data. (2M = top 25%, 5.3M = top 75%) Utilities - HF space: https://huggingface.co/spaces/KBlueLeaf/DTG-demo - Demo for DTG + Kohaku XL Epsilon: https://huggingface.co/spaces/KBlueLeaf/This-Cute-Dragon-Girl-Doesnt-Exist - SD-WebUI Extension: https://github.com/KohakuBlueleaf/z-a1111-sd-webui-dtg - ComfyUI Node: https://github.com/toyxyz/a1111-sd-webui-dtgcomfyui

llama
14,338
37

DanTagGen-beta

DanTagGen - beta DanTagGen(Danbooru Tag Generator) is inspired from p1atdev's dart project. But with different arch, dataset, format and different training strategy. Difference between versions alpha: pretrain on 2M dataset, smaller batch size. Limited ability beta: pretrain on 5.3M dataset, larger batch size. More stable, better ability with only a few information provided. ||Without DTG|DTG-Alpha|DTG-Beta| |-|-|-|-| |Prompts|Base prompt|Base propmt + "mole under eye, tail, twintails, open mouth, single ear cover, horse ears, breasts, looking at viewer, visor cap, streaked hair, long hair, horse tail, hair between eyes, cowboy shot, blue nails, purple eyes, covered navel, horse girl, competition swimsuit, blush, multicolored hair, collarbone, two-tone swimsuit, animal ears, mole, white hair, ear covers, smile, ear ornament, swimsuit, solo, blue eyes, brown hair, one-piece swimsuit, white headwear, medium breasts, white one-piece swimsuit, bare shoulders,"| base propmt + "blue bikini, tail, twintails, single ear cover, horse ears, striped clothes, ear piercing, cleavage, breasts, blue ribbon, looking at viewer, ribbon, streaked hair, long hair, horse tail, hair between eyes, :3, purple eyes, horse girl, blush, multicolored hair, hair ribbon, collarbone, bikini skirt, piercing, animal ears, striped bikini, sitting, white hair, ear covers, :d, smile, swimsuit, solo, brown hair, ocean, white headwear, medium breasts, bikini,"| |Result image|||| |Performance|It can't even generate vivlos|It can generate image with correct character features but not enough detail and some features are wrong/lacked |Way better than alpha, provide good character features, also provide lot more details and better composition| ||Without DTG|DTG-Alpha|DTG-Beta| |-|-|-|-| |Prompts|Base prompt|Base propmt + "plant, necktie, tail, indoors, skirt, looking at viewer, cup, lounge chair, green theme, book, alternate costume, potted plant, hair ornament, blue jacket, blush, medium hair, black necktie, green eyes, jacket, animal ears, black hair, round eyewear, bookshelf, adjusting eyewear, ahoge, smile, solo, window, brown hair, crossed legs, glasses, closed mouth, book stack,"| base propmt + "jacket, sitting on table, food, tail, collar, horse racing, black hair, boots, school bag, bag, full body, blue eyes, hair ornament, animal ears, ahoge, sitting, thighhighs, blurry background, looking at viewer, school uniform, long hair, blurry, cup, window, crossed legs, alternate costume, medium breasts, breasts, calendar \(object\), casual, door, solo, disposable cup,"| |Result image|||| |Performance| |It can generate image with more elements and details, but the coherence with character is not good|Way better than alpha, also provide lot more details and better composition| Model arch This version of DTG is trained from scratch with 400M param LLaMA arch.(In my personal preference I will call it NanoLLaMA) Since it is llama arch. Theoritically it should be able to be used in any LLaMA inference interface. This repo also provided converted FP16 gguf model and quantized 8bit/6bit gguf models. Basically it is recommended to use llama.cpp or llama-cpp-python to run this model. Which will be very fast. Dataset and Training I use the trainer I implemented in HakuPhi to run the training. with 10epoch on 5.3M data. This model have roughly 6~12B token seen. The dataset is exported by HakuBooru with my danbooru sqlite database. Use the percentile of favcount on each rating to filter the data. (2M = top 25%, 5.3M = top 75%) Utilities I'm implementing a gradio UI for this thing and other dev can utilize the API in it to make different app. I'm also planning to make sd-webui extension.

llama
10,920
71

TIPO-500M-ft

TIPO: Text to Image with text presampling for Prompt Optimization 500M LLaMA arch model trained for TIPO. Tech Report: https://arxiv.org/abs/2411.08127 In this project, we introduce "TIPO" (Text to Image with text presampling for Prompt Optimization), an innovative framework designed to significantly enhance the quality and usability of Text-to-Image (T2I) generative models. TIPO utilizes the Large Language Models (LLMs) to perform "Text Presampling" within the inference pipeline of text-to-image generative modeling. By refining and extending user input prompts, TIPO enables generative models to produce superior results with minimal user effort, making T2I systems more accessible and effective for a wider range of users. Use updated version of DTG extension (renamed to z-tipo-extension), current version of z-tipo-extension support stable-diffusion-webui, stable-diffusion-webui-forge and ComfyUI. SD-Next haven't been tested. https://github.com/KohakuBlueleaf/z-tipo-extension This model is LLaMA arch with 200M parameters, the training data is combined version of Danbooru2023, Coyo-HD-11M. The total token seen is around 50B tokens. For more information please refer to the tech report and following table. | | TIPO-200M | TIPO-500M-ft | TIPO-500M | | ----------------- | ------------------------------------------------------------------------------ | ---------------------------------- | ------------------------------------------------------------------------------ | | Arch | LLaMA | LLaMA | LLaMA | | Max ctx length | 1024 | 1024 | 1024 | | Batch Size | 2048 | 3584 | 3584 | | Training dataset | Danbooru, GBC10M, 5epoch Danbooru, GBC10M, Coyo11M, 3epoch | Danbooru(pixtral), GBC10M, Coyo11M, 2epoch | Danbooru, GBC10M, Coyo11M, 5epoch | | Real Token Seen | 40B token | 42B (12B more from TIPO-500M) | 30B token | | Training Hardware | RTX 3090 x 4 | RTX 3090 x 4 | H100 x 8 | | Training Time | 420 hour` | 290 hour` | 100 hour` | | Huggingface | KBlueLeaf/TIPO-200M · Hugging Face | You Are HERE | KBlueLeaf/TIPO-500M · Hugging Face | : We only count "non-padding token" in the token seen, since all the training data have very large length range. `: Since the training data is pretty short, it cost more time to reach same token seen than general LLM pretraining. As reference, with 4096 as max ctx length and almost all the data have reach that length, you may only need 2days to reach 10B token seen on RTX 3090 x 4 with 200M model. Evaluation Evaluation are done on TIPO-200M model We have tested TIPO compared to other Model in several test and metrics: In this test we use single "scenery" tag as input. (With some certain meta) To test each prompt gen method to see if they can obtain the desired distribution of outputs while maintain the quality of images. | Scenery Tag Test | Original | GPT4o-mini | Prompt DB | Promptis | TIPO(ours) | | ---- | ---- | ---- | ---- | ---- | ---- | | FDD ↓ | 0.3558 | 0.5414 | 0.3247 | 0.2350 | 0.2282 | | Aesthetic ↑ | 5.0569 | 6.3676 | 6.1609 | 5.9468 | 6.2571 | | AI Corrupt ↑ | 0.4257 | 0.7490 | 0.5024 | 0.5669 | 0.9195 | In this test we use short caption or manually truncated caption from GBC10M and CoyoHD11M. This test examine the ability of prompt gen method on handling almostly completed prompts. | Short | Original | GPT4o-mini | Prompt DB | Promptis | TIPO(ours) | | ---- | ---- | ---- | ---- | ---- | ---- | | FDD ↓ | 0.0957 | 0.1668 | 0.0980 | 0.1783 | 0.1168 | | Aesthetic ↑ | 5.8370 | 6.0589 | 5.8213 | 5.7963 | 5.8531 | | AI Corrupt ↑ | 0.7113 | 0.6985 | 0.7064 | 0.6314 | 0.7131 | | Truncated Long | Original | GPT4o-mini | Prompt DB | Promptis | TIPO(ours) | | ---- | ---- | ---- | ---- | ---- | ---- | | FDD ↓ | 0.0955 | 0.1683 | 0.1247 | 0.2096 | 0.1210 | | Aesthetic ↑ | 5.7497 | 6.0168 | 5.8191 | 5.7759 | 5.8364 | | AI Corrupt ↑ | 0.6868 | 0.6712 | 0.6741 | 0.5925 | 0.7130 | This model is released under Kohaku License 1.0 You can check the above provided URL or check the LICENSE file in this repo.

llama
10,740
36

DanTagGen-alpha

llama
8,806
42

DanTagGen-delta

llama
8,648
8

TIPO-500M

TIPO: Text to Image with text presampling for Prompt Optimization 500M LLaMA arch model trained for TIPO. Tech Report: https://arxiv.org/abs/2411.08127 In this project, we introduce "TIPO" (Text to Image with text presampling for Prompt Optimization), an innovative framework designed to significantly enhance the quality and usability of Text-to-Image (T2I) generative models. TIPO utilizes the Large Language Models (LLMs) to perform "Text Presampling" within the inference pipeline of text-to-image generative modeling. By refining and extending user input prompts, TIPO enables generative models to produce superior results with minimal user effort, making T2I systems more accessible and effective for a wider range of users. Usage Use updated version of DTG extension (renamed to z-tipo-extension), current version of z-tipo-extension support stable-diffusion-webui, stable-diffusion-webui-forge and ComfyUI. SD-Next haven't been tested. https://github.com/KohakuBlueleaf/z-tipo-extension This model is LLaMA arch with 200M parameters, the training data is combined version of Danbooru2023, Coyo-HD-11M. The total token seen is around 50B tokens. For more information please refer to the tech report and following table. | | TIPO-200M | TIPO-200M-ft | TIPO-500M | | ----------------- | ------------------------------------------------------------------------------ | ---------------------------------- | ------------------------------------------------------------------------------ | | Arch | LLaMA | LLaMA | LLaMA | | Max ctx length | 1024 | 1024 | 1024 | | Batch Size | 2048 | 2048 | 3584 | | Training dataset | Danbooru, GBC10M, 5epoch Danbooru, GBC10M, Coyo11M, 3epoch | Danbooru(pixtral), Coyo11M, 2epoch | Danbooru, GBC10M, Coyo11M, 5epoch | | Real Token Seen | 40B token | 50B (10B more from TIPO-200M) | 30B token | | Training Hardware | RTX 3090 x 4 | RTX 3090 x 4 | H100 x 8 | | Training Time | 420 hour` | 120 hour` | 100 hour` | | Huggingface | KBlueLeaf/TIPO-200M · Hugging Face | KBlueLeaf/TIPO-200M-ft · Hugging Face | You Are HERE | : We only count "non-padding token" in the token seen, since all the training data have very large length range. `: Since the training data is pretty short, it cost more time to reach same token seen than general LLM pretraining. As reference, with 4096 as max ctx length and almost all the data have reach that length, you may only need 2days to reach 10B token seen on RTX 3090 x 4 with 200M model. Evaluation Evaluation are done on TIPO-200M model We have tested TIPO compared to other Model in several test and metrics: In this test we use single "scenery" tag as input. (With some certain meta) To test each prompt gen method to see if they can obtain the desired distribution of outputs while maintain the quality of images. | Scenery Tag Test | Original | GPT4o-mini | Prompt DB | Promptis | TIPO(ours) | | ---- | ---- | ---- | ---- | ---- | ---- | | FDD ↓ | 0.3558 | 0.5414 | 0.3247 | 0.2350 | 0.2282 | | Aesthetic ↑ | 5.0569 | 6.3676 | 6.1609 | 5.9468 | 6.2571 | | AI Corrupt ↑ | 0.4257 | 0.7490 | 0.5024 | 0.5669 | 0.9195 | In this test we use short caption or manually truncated caption from GBC10M and CoyoHD11M. This test examine the ability of prompt gen method on handling almostly completed prompts. | Short | Original | GPT4o-mini | Prompt DB | Promptis | TIPO(ours) | | ---- | ---- | ---- | ---- | ---- | ---- | | FDD ↓ | 0.0957 | 0.1668 | 0.0980 | 0.1783 | 0.1168 | | Aesthetic ↑ | 5.8370 | 6.0589 | 5.8213 | 5.7963 | 5.8531 | | AI Corrupt ↑ | 0.7113 | 0.6985 | 0.7064 | 0.6314 | 0.7131 | | Truncated Long | Original | GPT4o-mini | Prompt DB | Promptis | TIPO(ours) | | ---- | ---- | ---- | ---- | ---- | ---- | | FDD ↓ | 0.0955 | 0.1683 | 0.1247 | 0.2096 | 0.1210 | | Aesthetic ↑ | 5.7497 | 6.0168 | 5.8191 | 5.7759 | 5.8364 | | AI Corrupt ↑ | 0.6868 | 0.6712 | 0.6741 | 0.5925 | 0.7130 | LICENSE This model is released under Kohaku License 1.0 You can check the above provided URL or check the LICENSE file in this repo.

llama
8,059
54

kohaku-v2.1

license:cc-by-nc-nd-4.0
2,883
17

DanTagGen-gamma

llama
1,675
10

HDM-xut-340M-anime

1,380
129

TIPO-200M-ft2

llama
1,267
24

guanaco-7b-leh-v2

NaNK
llama
758
38

llama3-llava-next-8b-gguf

NaNK
647
8

Kohaku XL Zeta

NaNK
474
81

Kohaku Xl Beta5

361
10

TIPO-100M

llama
306
7

TIPO-200M

llama
175
5

EQ-SDXL-VAE

EQ-SDXL-VAE: open sourced reproduction of EQ-VAE on SDXL-VAE Adv-FT is done and achieve better performance than original SDXL-VAE!!! original paper: https://arxiv.org/abs/2502.09509 source code of the reproduction: https://github.com/KohakuBlueleaf/HakuLatent Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image Upper one is original VAE, bottome one is EQ-VAE finetuned VAE. EQ-VAE, short for Equivariance Regularized VAE, is a novel technique introduced in the paper "Equivariance Regularized Latent Space for Improved Generative Image Modeling" to enhance the latent spaces of autoencoders used in generative image models. The core idea behind EQ-VAE is to address a critical limitation in standard autoencoders: their lack of equivariance to semantic-preserving transformations like scaling and rotation. This non-equivariance results in unnecessarily complex latent spaces, making it harder for subsequent generative models (like diffusion models) to learn efficiently and achieve optimal performance. This repository provides the model weight of the open-source reproduction of the EQ-VAE method, specifically applied to the SDXL-VAE. SDXL-VAE is a powerful variational autoencoder known for its use in the popular Stable Diffusion XL (SDXL) image generation models. By fine-tuning the pre-trained SDXL-VAE with the EQ-VAE regularization, we aim to create a more structured and semantically meaningful latent space. This should lead to benefits such as: Improved Generative Performance: A simpler, more equivariant latent space is expected to be easier for generative models to learn from, potentially leading to faster training and improved image quality metrics like FID. Enhanced Latent Space Structure: EQ-VAE encourages the latent representations to respect spatial transformations, resulting in a smoother and more interpretable latent manifold. Compatibility with Existing Models: EQ-VAE is designed as a regularization technique that can be applied to pre-trained autoencoders without requiring architectural changes or training from scratch, making it a practical and versatile enhancement. This reproduction allows you to experiment with EQ-VAE on SDXL-VAE, replicate the findings of the original paper, and potentially leverage the benefits of equivariance regularization in your own generative modeling projects. For a deeper understanding of the theoretical background and experimental results, please refer to the original EQ-VAE paper linked above. The source code in HakuLatent repository provides a straightforward implementation of the EQ-VAE fine-tuning process for any diffusers vae models. Left: original image, Center: latent PCA to 3dim as RGB, Right: decoded image Upper one is original VAE, bottome one is EQ-VAE finetuned VAE. | | | | ------------------- | ------------------- | | | | This model is heavily finetuned from SDXL-VAE and introduce a totally new latent space. YOU CAN'T USE THIS ON YOUR SDXL MODEL. You can try to use this VAE to finetune your sdxl model and expect a better final result, but it may require lot of time to achieve it... To utilize this model in your custom code or setup, use `AutoencoderKL` class from diffusers library and use: Base Model: SDXL-VAE-fp16-fix Dataset: ImageNet-1k-resized-256 Batch Size: 128 (bs 8, grad acc 16) Sample Seen: 3.4M (26500 optimizer step on VAE) Discriminator: HakuNLayerDiscriminator with nlayer=4 Discriminator startup step: 10000 Reconstruction Loss: MSE loss LPIPS loss ConvNeXt perceptual Loss loss weights: recon loss: 1.0 adv(disc) loss: 0.5 kl div loss: 1e-7 For Adv FT recon loss: 1.0 MSE Loss: 1.5 LPIPS Loss: 0.5 ConvNeXt perceptual Loss: 2.0 adv loss: 1.0 kl div loss: 0.0 Encoder freezed We use the validation split and test split (totally 150k images) of imagenet in 256x256 resolution and use MSE loss, PSNR, LPIPS and ConvNeXt perceptual loss as our metric. | Metrics | SDXL-VAE | EQ-SDXL-VAE | EQ-SDXL-VAE Adv FT | | -------- | --------- | ----------- | ------------------ | | MSE Loss | 3.683e-3 | 3.723e-3 | 3.532e-03 | | PSNR | 24.4698 | 24.4030 | 24.6364 | | LPIPS | 0.1316 | 0.1409 | 0.1299 | | ConvNeXt | 1.305e-03 | 1.548e-03 | 1.322e-03 | We can see after the EQ-VAE training without adv loss, the EQ-SDXL-VAE is slightly worse than original VAE. While After finetuning with Adversarial Loss enabled with Encoder freezed, the PSNR and LPIPS even improved to be better than original VAE! Note: This repo contains the weight of EQ-SDXL-VAE Adv FT. After the training is done, I will try to train a small T2I on it to check if EQ-VAE do help the training of Image Gen models. Also, I will try to train a simple approximation decoder which have only 2x upscale or no upscale for the latent, for fast experience (if needed) [1] [[2502.09509] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling](https://arxiv.org/abs/2502.09509) [3] sypsyp97/convnextperceptualloss: This package introduces a perceptual loss implementation based on the modern ConvNeXt architecture. [4] evanarlian/imagenet1kresized256 · Datasets at Hugging Face AmericanPresidentJimmyCarter : Provide implementation of Random Affine transformation.

NaNK
license:apache-2.0
161
61

TIPO-200M-ft

llama
153
19

Kohaku-XL-gamma

96
17

Kohaku-XL-Delta

91
81

Kohaku-XL-Epsilon-rev3

63
32

LatentMaid-F8C16-Nested4-alpha

59
1

Kohaku-XL-Epsilon

48
51

Kohaku-XL-Epsilon-rev2

48
24

guanaco-7B-leh

NaNK
llama
36
40

TIPO-200M-dev

llama
28
32

kohaku-v4-rev1.2

21
1

kohaku-v3-rev2

20
0

guanaco-7B-lora-embed

NaNK
llama
12
4

Kohaku Xl Beta7.1

4
6

kohaku-xl-alpha

3
4

kohaku-xl-beta7

1
1

Stable-Cascade-FP16-fixed

license:cc-by-nc-4.0
0
42

HunYuanDiT-V1.1-fp16-pruned

license:mit
0
13

onimai-locon-test

0
7

Laplace-scheduler

license:mit
0
3

guanaco-lora-head

0
2

dit-testing

0
1

guanaco-lora-embed-head

0
1

imagenet16-embeds

0
1

hyperkohaku-experiments

0
1