zer0int
CLIP-GmP-ViT-L-14
š„ Update SUMMER 2025: š„ š¤ New and greatly improved version of the model, check out: - š https://huggingface.co/zer0int/CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14
LongCLIP-GmP-ViT-L-14
š„ Update SUMMER 2025: š„ š¤ New and greatly improved version of the model, check out: - š https://huggingface.co/zer0int/LongCLIP-KO-LITE-TypoAttack-Attn-ViT-L-14 ---- A fine-tune of Long-CLIP - original model: BeichenZhang/LongCLIP-L - ā¤ļø this CLIP? Help feed it if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! š¤ - Want to feed it yourself? All code for fine-tuning and much more is on my GitHub. ---- - # Note for using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion: - Get the ComfyUI Long-CLIP nodes here: https://github.com/SeaArtLab/ComfyUI-Long-CLIP - If you don't use Comfy, it's at least a starting point for reverse engineering & applying it to your code! š¤ ---- šØ IMPORTANT NOTE for loading with HuggingFace Transformers: š ā Error due to mismatch with defined 77 tokens in Transformers library š Option 1 (simple & worse): Truncate to 77 tokens `CLIPModel.frompretrained(modelid, ignoremismatchedsizes=True)` - ### Solution for implementation of 248 tokens / thanks @kk3dmax š¤ - Obtain a full example script using this solution for Flux.1 inference on my GitHub ---- Update 12/AUG/2024: New BEST model, custom loss with label smoothing. Small gain for a diverse and large good quality dataset, but big relative gains for an overfit-prone fine-tune (small batch size, 1 GPU, narrow dataset of e.g. 'sneakers', etc.) are possible! Fine-tune your model with the provided code for GmP-Smooth: https://github.com/zer0int/Long-CLIP The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81). Made possible with Geometric Parametrization (GmP): ā The model / statedict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any statedict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using SeaArtLab/ComfyUI-Long-CLIP custom nodes! š¤ For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: https://github.com/zer0int/Long-CLIP Pre-trained CLIP model by OpenAI, License: MIT License
LongCLIP-L-Diffusers
LongCLIP-KO-LITE-TypoAttack-Attn-ViT-L-14
CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14
CLIP-KO: Knocking Out Typographic Attacks in CLIP šŖš¤ Finally, a CLIP without a 'text obsession'! š¤ ā¤ļø this CLIP? Donate if you can / want. TY! š± CLIP-KO-LITE is slightly less robust, but the Text Encoder won't produce OOD embeddings. - š Read the paper (PDF) here. - If you're looking for a a Text Encoder, you'll probably want these: - š¼ļø Download The Text Encoder for generative AI - š¼ļø Download an alternatve Text Encoder without Adversarial Training - š¤ Wanna fine-tune yourself? Get the code on my GitHub. - Included: Code for fine-tuning and all benchmarks / claims (as per the paper) --------- š ALL: Flux.1-dev, NO T5 - CLIP only! CFG=5, Heun, fixed seed. Prompts, in order: 1. "bumblewordoooooooo bumblefeelmbles blbeinbumbleghue" (weird CLIP words / text obsession / prompt injection) 2. "a photo of a disintegrimpressionism rag hermit" (one weird CLIP word only) 3. "a photo of a breakfast table with a highly detailed iridescent mandelbrot sitting on a plate that says 'maths for life!'" (note: "mandelbrot" literally means "almond bread" in German) 4. "mathematflake tessswirl psychedsphere zanziflake aluminmathematdeeply mathematzanzirender methylmathematrender detailed mandelmicroscopy mathematfluctucarved iridescent mandelsurface mandeltrippy mandelhallucinpossessed pbr" (Complete CLIP gibberish math rant) 5. "spiderman in the moshpit, berlin fashion, wearing punk clothing, they are fighting very angry" (CLIP Interrogator / BLIP) 6. "epstein mattypixelart crying epilepsy pixelart dannypixelart mattyteeth trippy talladepixelart retarphotomedit hallucincollage gopro destroyed mathematzanzirender mathematgopro" (CLIP rant) ------ Evaluation Results | Section | Measurement / Task | Pre-Trained | KO-CLIP | KO-LITE | |-----------------------------|-----------------------------------|-------------|----------|----------| | RTA 100 Typographic | Zero-Shot Acc | 0.4330 | 0.7210šļø | 0.6260 | | | | | | | | BLISS / SCAM | NoSCAM | 0.9905 | 0.9897 | 0.9897 | | | SCAM | 0.4165 | 0.7823šļø | 0.7367 | | | SynthSCAM | 0.3219 | 0.7358šļø | 0.6790 | | | | | | | | ILSVRC2012 Linear Probe | Top-1 | 69.86% | 70.58% | 72.65% | | | Top-5 | 92.70% | 93.79% | 94.08% | | | | | | | | ObjectNet (ZS) | Accuracy | 0.846 | 0.898 | 0.9029šļø | | | | | | | | ImageNet 1k (ZS) | acc1 | 0.32696 | 0.43440 | 0.46882 | | | acc5 | 0.52997 | 0.65297 | 0.68845šļø | | | meanperclassrecall | 0.32609 | 0.43252 | 0.46695 | | | | | | | | VoC-2007 (ZS) | mAP | 0.7615 | 0.8579 | 0.8626šļø | | | | | | | | mscoco ZS Retrieval | imageretrievalrecall@5 | 0.2196 | 0.3296 | 0.3385 | | | textretrievalrecall@5 | 0.3032 | 0.4396 | 0.4745 | | | | | | | | xm3600 ZS Retrieval | imageretrievalrecall@5 | 0.30593 | 0.43338 | 0.43700 | | | textretrievalrecall@5 | 0.24293 | 0.38884 | 0.42324 | | | | | | | | SugarCrepe (PT) | Add ATT: acc | 0.77745 | 0.84537 | 0.87427 | | | Add OBJ: acc | 0.80358 | 0.84093 | 0.84772 | | | Replace ATT: acc | 0.76903 | 0.81091 | 0.82106 | | | Replace OBJ: acc | 0.87832 | 0.90617 | 0.91162 | | | Replace REL: acc | 0.71550 | 0.73470 | 0.74253 | | | Swap ATT: acc | 0.58558 | 0.62912 | 0.63363 | | | Swap OBJ: acc | 0.57959 | 0.60816 | 0.62040 | | | | | | | | Flickr-8k Cross-modal | Euclidean Gap ā | 0.8276 | 0.8657 | 0.8182 | | | JSD ā | 0.5200 | 0.2863 | 0.1455 | | | Wasserstein Distance ā | 0.4084 | 0.4166 | 0.3889 | | | Img-Text Cos Sim (mean) ā | 0.2723 | 0.3077 | 0.3300 | | | Img-Text Cos Sim (std) | 0.0362 | 0.0645 | 0.0690 | | | Text-Text Cos Sim (mean) | 0.6807 | 0.7243 | 0.7189 | | | Text-Text Cos Sim (std) | 0.1344 | 0.1377 | 0.1387 |
LongCLIP-SAE-ViT-L-14
Long-CLIP ViT-L/14 finetune: SAE-informed adversarial training - SAE = Sparse autoencoder. All training info & code: github.com/zer0int/CLIP-SAE-finetune - This Long-CLIP, š direct download Text Encoder š is also the best Long-CLIP to use with HunyuanVideo. - Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little). - ā Buy me a coffee The original CLIP model has 77 tokens max input - but only ~20 tokens effective length. See the original Long-CLIP paper for details. HunyuanVideo demo: 69 tokens, normal scene: - Lens: 16mm. Aperture: f/2.8. Color Grading: Blue-green monochrome. Lighting: Low-key with backlit silhouettes. Background: Gothic cathedral at night, stained glass windows breaking. Camera angle: Over the shoulder of a ninja, tracking her mid-air leap as she lands on a rooftop. 52 tokens, OOD (Out-of-Distribution) scene: Superior handling for consistency and prompt-following despite OOD concept. - In this surreal nightmare documentary, a sizable spider with a human face is peacefully savoring her breakfast at a diner. The spider has a spider body, but a lady's face on the front, and regular human hands at the end of the spider legs.
CLIP-SAE-ViT-L-14
CLIP ViT-L/14 finetune: SAE-informed adversarial training - SAE = Sparse autoencoder - Accuracy ImageNet/ObjectNet my GmP: 91% > SAE (this): 89% > OpenAI pre-trained: 84.5% - But, it's fun to use with e.g. Flux.1 - get the Text-Encoder TE only version ā¬ļø and try it! - And this SAE CLIP has best results for linear probe @ LAION-AI/CLIPbenchmark (see below) - This CLIP direct download is also the best CLIP to use for HunyuanVideo. - Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little). - Interesting things with adversarial robustness to try: Right-click and download individual images: Image 1 -- Image 2 -- Image 3 - Upload each into zero-shot [hopefully available soon on the right here->] - Try labels (class names): a photo of a cat, a photo of a dog, a photo of a text - Repeat the same with e.g. my GmP models models and see what happens. =) - I'm really hoping the HF format .safetensors conversion didn't mess anything up (it happens!); just in case it did, or if there's no inference API available to use: - I put a script that will do the same thing (on the not-converted model) on my GitHub repo. Plus, you can just reproduce the fine-tune yourself, as that code is also available! š¤ - š All training info & code: github.com/zer0int/CLIP-SAE-finetune - ā Buy me a coffee
LongCLIP-Registers-Gated_MLP-ViT-L-14
Long-CLIP Needs Registers. And Gated MLPs. And +20M params. Fixing Long-CLIP's modality gap via happy little accidents. - ā¤ļø this CLIP? Donate if you can / want. TY! (and enjoy, either way!) š¤ - You can now load the model with HF 'transformers'. ā - Unfortunately, AutoModel produced nonsense / I couldn't get "trustremotecode=True" to work properly - (using that was suggested in response to my pull request on GitHub). š” Alas, you will need to: - Download the 'hfmodel' folder - Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json - Minimal example code: I just want a new Text Encoder.. - ...for my Text-to-Image (Text-to-Video) AI! \o/ - Here you go: š direct download š - The model has 248 tokens max input (instead of CLIP-L: 77) - Replace your CLIP-L with this Long-CLIP (e.g. ComfyUI natively supports Long-CLIP). - Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!) ā ļø Full model (currently) not HuggingFace Transformers compatible. ā ļø - The ViT (Vision Encoder) is basically a big mutant. Alas: - The full model .safetensors have the 'import clip' (OpenAI) structure inside. - Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!). - For more info, see also (not-a-long CLIP-L version, 77 tokens): zer0int/CLIP-Registers-GatedMLP-ViT-L-14 ā Info / Using the full model - Models available: FULL, TE-only (Text Encoder only), LongCLIP-L.safetensors - LongCLIP-L the original model from BeichenZhang/LongCLIP-L. - It's just so you don't have to download a danger-pickle. :) - git clone my repo, github.com/zer0int/CLIP-fine-tune-registers-gated - Put the FULL model and the LongCLIP-L from this HF in a 'models' subfolder - You're all set! I made an entire playground for the models (+ safetensors loading)! š - PS: All code for fine-tuning it yourself is also included on my Git! š¤ | Metric | LongCLIP-L Original | Long-ViT-L/14 Register-Tokens, X-GATED | |-------------------------------------|---------------------|----------------------------------------| | VoC-2007 multilabel, mAP | 0.8221 | 0.8403 | | MSCOCO Image Retrieval Recall@5 | 0.2761 | 0.3663 | | MSCOCO Text Retrieval Recall@5 | 0.3314 | 0.5398 | | CIFAR10 Linear Probe Acc@1 | 0.9809 | 0.9812 | | CIFAR10 Linear Probe Acc@5 | 0.9998 | 0.9997 | | CIFAR10 LP Mean Recall | 0.9809 | 0.9812 | | ImageNet/ObjectNet MVT (Zero-Shot) | 0.8103 | 0.8724 | | ILSVRC2012 LP, Top-1 | 66.95% | 66.84% | | ILSVRC2012 LP, Top-5 | 91.87% | 91.70% | | Modality Gap (Euclidean) | 1.0672 ā ļø | 0.5781 ā | | Img-Text Cosine (Mean) | 0.2666 | 0.4711 | | Img-Text Cosine (Std Dev) | 0.0191 | 0.0726 | | Txt-Text Cosine (Mean) | 0.8421 | 0.7046 | | Txt-Text Cosine (Std Dev) | 0.0707 | 0.1498 | | Jensen-Shannon Divergence (JSD) | 0.3847 | 0.1894 | | Wasserstein Distance | 0.5755 | 0.2335 |
CLIP-KO-ViT-L-14-336-TypoAttack
CLIP-Registers-Gated_MLP-ViT-L-14
CLIP Needs Registers. And Gated MLPs. And +20M params. Fixing CLIP's modality gap via happy little accidents. - ā¤ļø this CLIP? Donate if you can / want. TY! (and enjoy, either way!) š¤ - You can now load the model with HF 'transformers'. ā - Unfortunately, AutoModel produced nonsense / I couldn't get "trustremotecode=True" to work properly - (using that was suggested in response to my pull request on GitHub). š” Alas, you will need to: - Download the 'hfmodel' folder - Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json - Minimal example code: - ...for my Text-to-Image (Text-to-Video) AI! \o/ - I recommend this one, the 'sweet spot' ckpt12: š direct download š - Even lower modality gap (text 'more alike' to image, but less accurate): direct download - Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!) ā ļø Full model (currently) not HuggingFace Transformers compatible. ā ļø - The ViT (Vision Encoder) is basically a big mutant. Alas: - The full model .safetensors have the 'import clip' (OpenAI) structure inside. - It's just so you don't need to load any 'danger pickles'. :) - Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!). - However, for now, I made an entire playground for the CLIP models (+ safetensors loading)! š: - š https://github.com/zer0int/CLIP-fine-tune-registers-gated ⨠- All code for fine-tuning it yourself is also included on my Git! š¤ Wait, but what is this?! - The Vision Transformer has +4 tokens (Register Tokens). - ...And gated ReLU MLPs inside each layer + final Fusion MLP. - +20M parameters (~430M -> now: ~450M) - It's now a CLIP with an extremely low modality gap. - See the table below for details. - And if you want to know more about modality gaps & all details please check out the GitHub! Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14: Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance: | Task / Dataset | Metric | ViT-L/14 OpenAI (Pre-trained) | X-GATED (ckpt20 xtreme) | X-GATED (ckpt12 balanced) | X-GATED (ckpt12 balanced, ablated) | |----------------|--------|-------------------------------|-------------------------|---------------------------|------------------------------------| | VoC-2007 (Multilabel) | mAP | 0.7615 | 0.8140 | 0.8471 | 0.8247 | | MSCOCO Retrieval | Image Recall@5 | 0.2194 | 0.3565 | 0.3532 | 0.3349 | | | Text Recall@5 | 0.3034 | 0.5425 | 0.5278 | 0.5086 | | Linear Probe CIFAR-10 | Acc@1 | 0.9535 | 0.9813 | 0.9813 | 0.9811 | | | Acc@5 | 0.9966 | 0.9997 | 0.9997 | 0.9997 | | | Mean Class Recall | 0.9535 | 0.9813 | 0.9813 | 0.9811 | | MVT ImageNet/ObjectNet (Zero-Shot) | Accuracy | 0.8453 | 0.8686 | 0.8830 | 0.8815 | | Linear Probe ILSVRC2012 | Top-1 | 69.86% | 66.43% | 67.10% | 68.99% | | | Top-5 | 92.70% | 91.52% | 91.83% | 92.64% | | Modality Gap Metrics | Euclidean Gap ā | 0.8276 | 0.4740 | 0.5395 | 0.7486 | | | JSD ā | 0.5200 | 0.1601 | 0.1303 | 0.3310 | | | Wasserstein Distance ā | 0.4084 | 0.1742 | 0.2102 | 0.3262 | | | Img-Text Cos Sim (mean) ā | 0.2723 | 0.4926 | 0.4794 | 0.3634 | | | Img-Text Cos Sim (std) | 0.0362 | 0.0814 | 0.0758 | 0.0537 | | | Text-Text Cos Sim (mean) | 0.6807 | 0.6657 | 0.6896 | 0.6896 | | | Text-Text Cos Sim (std) | 0.1344 | 0.1671 | 0.1535 | 0.1535 | Bolded values represent the best performance for each metric.
CLIP-KO-TypoAttack-Attn-Dropout-ViT-L-14
CLIP-KO-ViT-B-32-TypoAttack
CLIP-Regression-ViT-L-14
CLIP-KO-ViT-B-16-TypoAttack
clip-vit-large-patch14-336-text-encoder
This is NOT a fine-tune. This is the original OpenAI CLIP ViT-L/14@336 Text Encoder, converted to HuggingFace 'transformers' format. All credits to the original authors. Why? - It's a normal "CLIP-L" Text Encoder and can be used as such. - See below (Flux.1-dev, CLIP only guidance, CFG 3.5, Heun). - For my fine-tuned KO-CLIP ViT-L/14@336 -> see here