opendiffusionai
sdxl-longcliponly
BUGFIXES!!! (Latest update 2025/05/21) Please note that the initial release had a bug in the tokenizer config. Additionally.. I padded out tokenizer2 and textencoder2 so that the normal python code works with this model now. This is the base SDXL model.. but the CLIP-L text encoder swapped out with "LongCLIP".... and then the values for CLIP-G zeroed so it has no effect. In theory, it should be possible to replace the clip-g model with a dummy placeholder, so that the model takes 3G less VRAM overall. But my attempts at that failed. SDXL's largest limitations are primarily due to the lousy CLIP(s) used. Not only are they of poor quality, but they have hidden token count limits, which make effective token count closer to 10. It is believed that one of the reasons CLIP-G was added on was to work around the limits of original CLIP-L. But.... that makes the model harder to train, and needlessly takes up more memory and time. So, I created this version to experimentally prove the better way. This allows use of up to 248 tokens with SDXL natively, without the layering hacks that some diffusion programs do. This prompt is stolen fromn the LongCLIP demo prompts: > The serene lake surface resembled a flawless mirror, reflecting the soft blue sky and the surrounding greenery. > A gentle breeze played across its expanse, ruffling the surface into delicate ripples that gradually spread out, > disappearing into the distance. Along the shore, weeping willows swayed gracefully in the light breeze, > their long branches dipping into the water, creating a soothing sound as they gently brushed against the surface. > In the midst of this serene scene, a pure white swan floated gracefully on the lake. > Its elegant neck curved into a graceful arc, giving it an air of dignity. The point here is that mention of a "swan" is beyond the 77 token limit. So if you see a swan, longclip is working. I had originally expected a need to do finetuning after the modifications. I was pleasantly surprised, then, to see that the new raw model combination performs better than sdxl base, out of the box. Sample image links: Before= https://huggingface.co/opendiffusionai/sdxl-longcliponly/resolve/main/2025-05-1813-26-23-training-sample-0-0-0.png After= https://huggingface.co/opendiffusionai/sdxl-longcliponly/resolve/main/2025-05-1814-49-43-training-sample-0-0-0.png (The face is more realistic, the clothes are better, and there is no duplication of the coffee cup on the table) This raw version of the model seems to work great with 3-5 word prompts... but then decays after that. So I'm working on a finetuned version. Some programs hardcode a CLIP-L token limit of "77". There isnt a valid reason to do this; it is possible to detect the actual token count limit. This is extra unfortunate, since some programs do not merely disallow more tokens... they SEE that the new model supports 248 tokens, then complain, "hey! this model supports 248 tokens! I'm not going to allow you to use it." safetensors file is not known to work (so use the huggingface loader!) I did a blind conversion of the diffusers format to safetensors format using the OneTrainer conversion tools. However, it is not known to work. Even OT itself does not load it. It is provided here in hope that someone so inclined, may use it as input to fix the problems with the relevant checkpoint loader.
XLSD V0.0
This contains the merge of the SDXL VAE, with the SD1.5 unet, in assorted formats. It's not particularly usable by itself. The colors are all wrong, etc. It is mostly here as an open-source documented starting point for the XLsd model, which is currently in development. Feel free to try to train it yourself! :) Note that there are THREE versions of this model a fp32 model, and a bf16 model, in single checkpoint format, and then the full huggingface diffusers format across the multiple subdirs. Current advice, until proven otherwise, is that it is best to train on the fp32 model, in fp32, until it starts giving actual good output. If you are in a hurry, the traditional "mixed precision" training method is to load fp32, tell your training program to use mixed precision (bf16), but then STILL SAVE IN FP32. This is the critical bit. Tbe bf16 model is here basically just for the curious, who want to poke at it on a small VRAM machine. Do not train on it. PS: technically, I manually and seperately recreated the diffusers-format stuff after I had created the lone single-file versions. I would not think anyone would really care, but I believe in full disclosure of resources.
xllsd-alpha0
sd-longclip-ko
This the old stable diffusion v1.5 model, with the latest "LongClip" slapped on it, from https://huggingface.co/zer0int/LongCLIP-KO-LITE-TypoAttack-Attn-ViT-L-14/ Sadly, even though the model will no longer reject token lengths between 77-248 tokens... it does not seem to fully take advantage of it. Retraining is presumably needed.
stablediffusion_t5
Similar hacking to our opendiffusionai/stablediffusionxlt5 model Why do this? Because coming up with a usable finetuning script for SDXL is turning out to be a pain in the rear. So I thought I might regress to the theoretically easier experiment. Note that it will give you an image of SOMETHING... however, it is sort of random output at this point. The unet needs to be retrained to get things to match up. here's how random the output looks. (its equivalent to putting random strings into an sd1.5 prompt I'd guess)
sdxlone
stablediffusionxl_t5
The model files are almost identical to our t5sdxl-v0-bf16 model. However it has had its modelconfig.json adjusted, so that it will work with new code, that will be going into the "community" diffusers pipeline area. Alternatively, there is now a "demo.py" script that can use diffusers pipeline styles relatively cleanly, AS-IS! Precision Note that the unet is, sadly, only bf16 at this time, since we only have 4090s Hmm.. in retrospect... perhaps it would be better to use chatpig/t5-v11-xl-encoder-gguf instead of the xxl version. That is natively 2048 dim, so no need for a projection layer.
xllsd16-v1
A BF16-only trained combination of SDXL VAE, LongCLIP-L (248 tokens!) and the SD15 base. Note that this is "v1". Ideally, there will be more finetuning on this model, for improved image quality, and also improved prompt following, since it uses LongCLIP I am not sure of the state of this model. It was just uploaded for inter-project testing. This version was created May 19, 2025, by OneTrainer. LION, b24a10, 1e-5. Unsure about how many epochs. This model was trained specifically with the goal of realism in mind. Ideally you should be able to use it to finetune decent quality real-world models, without needing negative prompts. Additionally, in theory, if your program supports it, you should be able to use up to 248-token-length prompts. However, that is primarily just an architecture-capable thing at this point. More training is required to better take advantage of this. Usage Use it like any other SD1.5 model. You should be able to specify "opendiffusionai/xllsd16-v1" as the modelid, for programs or pipelines that support "diffusers" format model loading from huggingface. This was trained on a single rtx 4090, using the OneTrainer program. As noted in the metadata here, the starting point was a raw merge of the 3 components of SDXL vae, LongCLIP, and SD1.5 base. LION, LR 4.5e-05, Linear scheduler with 10 epochs, and bottom limit of 4.5e-06 Batch size 32, accum 8 Dataset size 200k (CC12M, 2mp, cleaned, with WD tagging) LION, LR 3e-06, CONST scheduler 60 epochs, but I picked off epoch 49 Batch size 64, accum 1 Dataset size 22k (CC12M, 2mp, "woman" subset of above, WD tagging) LION, LR 4e-06, Const scheduler with 10 epochs (stopped at 16,000 steps) Batch size 32, accum 8 I did many, many, many runs at other values. A key strategy I used, was to set up "validation" set graphs. (I used 144 images from a completely seperate dataset; specifically, some 1mp images from CC12M) It would always reach a general floor after a few epochs. The specific floor value got lower and lower as I raised the LR, until it hit a particular magic number. After that point, raising LR, would only increase the rate at which it reached the floor. So I experimented with setting initial LR to the lowest value reaching the floor, then used linear scheduler to back off slowly. Initially, I was up in the Xe-05 range, and modified the value by one. Then I started playing around with adjusting by 0.1, aka ±1e-06, until I decided I liked 4.5e-05 best. One interesting thing of note is that for the LR ranges that converged on the "floor" value, the ones with smaller LR got there slower, but for my short test runs, seemed to achieve a very slightly better value in validation curve. I wanted to do a cleanup run with the same dataset at low LR. So, I just experimented with a few "lowish" LR values to find the one I liked the most. (I also checked validation graphs.) Initially I went back to comparing validation graphs, this time with a new dataset. Interesting thing here is that I initially started too low, and the validation curve got WORSE the longer it ran. But also, starting at 4e-05 was too high, and also got initially worse over time. I found it worked at 1e-05 and started tweaking from there. I eventually found 9e-05 to work best (with linear decay?) However, there was another factor at play. I found that the "step 0" sample was actually BETTER to my eye than the samples for the next epoch. So I really wanted to preserve the good factors of that, through the next round. I tried using ema, and that was kind of an improvement. But not enough. I tried using warmup, even though I dont usually have to do that with LION, and that was also an improvement. But then I decided to do a bit more examination of the dataset, and discovered that this one had not been trimmed nearly as much as the first dataset has been. So after I pruned it some, I could then use LION, CONST with it and get better results than my prior stuff. One interesting thing is that now I took it down to 4e06, and was getting nice facial details that way. Generally speaking, training in FP32 precision takes 2-4 times as long as bf16. Additionally, it is possible to fit in some things in bf16, that will not fit in fp32. You cannot do batchsize=64 in fp32 on a 4090. It is just barely possible, however, to do b32a8 with LION in bf16. Not with fp32. (Closest clean combination I have found so far is b24a10) Generally speaking, my FP32 experiments started looking human a lot faster than bf16. Perhaps only half the epochs. However, I found it harder to get to where I really wanted to go, and lost patience. After perhaps some more finetuning with bf16, I may return once more to fp32 for attempted maximum quality.
t5-v1_1-base-encoder-only
sd-flow-alpha
What is this? This is an initial version of Stable Diffusion 1.5 base model, with its noise scheduler/prediction replaced with FlowMatchEulerDiscrete This model probably has a buncha low quality stuff in it. Base model SD might give better output in many reguards. The reason this model exists is to allow other people to take advantage of FlowMatch for their own finetunes and other experiments. For that reason, this is a FULL FP32 precision model. But the sample code below loads it as bf16. Original diffusers module for stablediffusion has a hardcode that stops this working. I have submitted a patch that was accepted.. but as far as I know, it has not been added to an official release yet. So, "diffusers 0.34.0" wont work with it. That means that to use this, you currently need to either use my tweaked code, imgsample-hacked.py or manually add in the main git version to use this. eg: pip install git+https://github.com/huggingface/diffusers You should then be able to do the typical diffusers code. For example: from diffusers import DiffusionPipeline import torch.nn as nn, torch, types import os,sys MODEL="opendiffusionai/sd-flow-alpha" pipe = DiffusionPipeline.frompretrained( MODEL, usesafetensors=True, safetychecker=None, requiressafetychecker=False, torchdtype=torch.bfloat16, ) pipe.enablesequentialcpuoffload() prompt="Some pretty photo of something" images = pipe(prompt, numinferencesteps=args.steps, generator=generator).images for i,image in enumerate(images): fname=f"{OUTDIR}/sample{i}.png" print(f"saving to {fname}") image.save(fname) >It works fine in comfy, just load the unet with the load diffusion model node and hook it to a > ModelSamplingSD3 node. > >For the clip/vae you can just use the one from the SD1.5 checkpoint." Doing the training itself, did not take that long. Writing my own functional training code, and trying various pathways to find what works, took WEEKS. That, and putting together a 40k clean ALL-SQUARE IMAGE DATASET If you wanted to recreate your own from scratch, here's the details from one of my runs: (This only takes a few hours to complete, on a 4090) First, download the sd base model in diffusers format, and hand-edit the modelconfig.json and scheduler/schedulerconfig.json file. (I was going to detail it here, but... just copy/look at the files in this repo. I linked them, after all!) time blocks only, 1e-5, 350 steps (result very murky here, thats expected) up.0 and up.1, 1e-6, 75 steps mid, 1e-6, 60 steps up.2, 1e-6, 160 steps up.3, 1e-6, 120 steps Sampling During the first phase, maybe sample every 50 steps. After the first phase, you'll want to take samples every 10 steps. Make sure you use MULTIPLE samples, and ideally of different types. You should have at least one "single token" prompt, and then a few more complex ones.
xlsd32-beta1
SD1.5 model, with SDXL vae grafted on, and then retrained to work properly Currently only in huggingface/diffusers format. May generate a "checkpoint" model later phase 1: FP32 b32a8, optimi LION, LR 1e-5 const, for only 150 steps model locked except for following layers: in, out, up.3, down.0 Note that smaller trainable params lets us use b32 on a 4090 here phase 2: FP32, b16a16, optimi LION, initial LR 1e-5, linear over 6 epochs (1920 effective steps) picked step 1800 phase 2 took around 15 hours, so total time maybe 16 hours In theory, the phase 1 wasnt strictly neccessary. However, in early retraining, it would most likely hit very large changes to the core model, that arent strictly neccessary for vae retraining. So I picked minimal disruption