zer0int

15 models • 2 total models in database
Sort by:

CLIP-GmP-ViT-L-14

šŸ”„ Update SUMMER 2025: šŸ”„ šŸ¤– New and greatly improved version of the model, check out: - šŸŒ‘ https://huggingface.co/zer0int/CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14

NaNK
license:mit
7,787
509

LongCLIP-GmP-ViT-L-14

šŸ”„ Update SUMMER 2025: šŸ”„ šŸ¤– New and greatly improved version of the model, check out: - šŸŒ‘ https://huggingface.co/zer0int/LongCLIP-KO-LITE-TypoAttack-Attn-ViT-L-14 ---- A fine-tune of Long-CLIP - original model: BeichenZhang/LongCLIP-L - ā¤ļø this CLIP? Help feed it if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! šŸ¤— - Want to feed it yourself? All code for fine-tuning and much more is on my GitHub. ---- - # Note for using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion: - Get the ComfyUI Long-CLIP nodes here: https://github.com/SeaArtLab/ComfyUI-Long-CLIP - If you don't use Comfy, it's at least a starting point for reverse engineering & applying it to your code! šŸ¤— ---- 🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: šŸ‘€ āŒ Error due to mismatch with defined 77 tokens in Transformers library šŸ‘‡ Option 1 (simple & worse): Truncate to 77 tokens `CLIPModel.frompretrained(modelid, ignoremismatchedsizes=True)` - ### Solution for implementation of 248 tokens / thanks @kk3dmax šŸ¤— - Obtain a full example script using this solution for Flux.1 inference on my GitHub ---- Update 12/AUG/2024: New BEST model, custom loss with label smoothing. Small gain for a diverse and large good quality dataset, but big relative gains for an overfit-prone fine-tune (small batch size, 1 GPU, narrow dataset of e.g. 'sneakers', etc.) are possible! Fine-tune your model with the provided code for GmP-Smooth: https://github.com/zer0int/Long-CLIP The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81). Made possible with Geometric Parametrization (GmP): āœ… The model / statedict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any statedict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using SeaArtLab/ComfyUI-Long-CLIP custom nodes! šŸ¤— For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: https://github.com/zer0int/Long-CLIP Pre-trained CLIP model by OpenAI, License: MIT License

—
3,113
79

LongCLIP-L-Diffusers

license:mit
2,641
6

LongCLIP-KO-LITE-TypoAttack-Attn-ViT-L-14

license:mit
1,389
16

CLIP-KO-LITE-TypoAttack-Attn-Dropout-ViT-L-14

CLIP-KO: Knocking Out Typographic Attacks in CLIP šŸ’ŖšŸ¤– Finally, a CLIP without a 'text obsession'! šŸ¤— ā¤ļø this CLIP? Donate if you can / want. TY! 🌱 CLIP-KO-LITE is slightly less robust, but the Text Encoder won't produce OOD embeddings. - šŸ“ Read the paper (PDF) here. - If you're looking for a a Text Encoder, you'll probably want these: - šŸ–¼ļø Download The Text Encoder for generative AI - šŸ–¼ļø Download an alternatve Text Encoder without Adversarial Training - šŸ¤“ Wanna fine-tune yourself? Get the code on my GitHub. - Included: Code for fine-tuning and all benchmarks / claims (as per the paper) --------- šŸ‚ ALL: Flux.1-dev, NO T5 - CLIP only! CFG=5, Heun, fixed seed. Prompts, in order: 1. "bumblewordoooooooo bumblefeelmbles blbeinbumbleghue" (weird CLIP words / text obsession / prompt injection) 2. "a photo of a disintegrimpressionism rag hermit" (one weird CLIP word only) 3. "a photo of a breakfast table with a highly detailed iridescent mandelbrot sitting on a plate that says 'maths for life!'" (note: "mandelbrot" literally means "almond bread" in German) 4. "mathematflake tessswirl psychedsphere zanziflake aluminmathematdeeply mathematzanzirender methylmathematrender detailed mandelmicroscopy mathematfluctucarved iridescent mandelsurface mandeltrippy mandelhallucinpossessed pbr" (Complete CLIP gibberish math rant) 5. "spiderman in the moshpit, berlin fashion, wearing punk clothing, they are fighting very angry" (CLIP Interrogator / BLIP) 6. "epstein mattypixelart crying epilepsy pixelart dannypixelart mattyteeth trippy talladepixelart retarphotomedit hallucincollage gopro destroyed mathematzanzirender mathematgopro" (CLIP rant) ------ Evaluation Results | Section | Measurement / Task | Pre-Trained | KO-CLIP | KO-LITE | |-----------------------------|-----------------------------------|-------------|----------|----------| | RTA 100 Typographic | Zero-Shot Acc | 0.4330 | 0.7210šŸŽ–ļø | 0.6260 | | | | | | | | BLISS / SCAM | NoSCAM | 0.9905 | 0.9897 | 0.9897 | | | SCAM | 0.4165 | 0.7823šŸŽ–ļø | 0.7367 | | | SynthSCAM | 0.3219 | 0.7358šŸŽ–ļø | 0.6790 | | | | | | | | ILSVRC2012 Linear Probe | Top-1 | 69.86% | 70.58% | 72.65% | | | Top-5 | 92.70% | 93.79% | 94.08% | | | | | | | | ObjectNet (ZS) | Accuracy | 0.846 | 0.898 | 0.9029šŸŽ–ļø | | | | | | | | ImageNet 1k (ZS) | acc1 | 0.32696 | 0.43440 | 0.46882 | | | acc5 | 0.52997 | 0.65297 | 0.68845šŸŽ–ļø | | | meanperclassrecall | 0.32609 | 0.43252 | 0.46695 | | | | | | | | VoC-2007 (ZS) | mAP | 0.7615 | 0.8579 | 0.8626šŸŽ–ļø | | | | | | | | mscoco ZS Retrieval | imageretrievalrecall@5 | 0.2196 | 0.3296 | 0.3385 | | | textretrievalrecall@5 | 0.3032 | 0.4396 | 0.4745 | | | | | | | | xm3600 ZS Retrieval | imageretrievalrecall@5 | 0.30593 | 0.43338 | 0.43700 | | | textretrievalrecall@5 | 0.24293 | 0.38884 | 0.42324 | | | | | | | | SugarCrepe (PT) | Add ATT: acc | 0.77745 | 0.84537 | 0.87427 | | | Add OBJ: acc | 0.80358 | 0.84093 | 0.84772 | | | Replace ATT: acc | 0.76903 | 0.81091 | 0.82106 | | | Replace OBJ: acc | 0.87832 | 0.90617 | 0.91162 | | | Replace REL: acc | 0.71550 | 0.73470 | 0.74253 | | | Swap ATT: acc | 0.58558 | 0.62912 | 0.63363 | | | Swap OBJ: acc | 0.57959 | 0.60816 | 0.62040 | | | | | | | | Flickr-8k Cross-modal | Euclidean Gap ↓ | 0.8276 | 0.8657 | 0.8182 | | | JSD ↓ | 0.5200 | 0.2863 | 0.1455 | | | Wasserstein Distance ↓ | 0.4084 | 0.4166 | 0.3889 | | | Img-Text Cos Sim (mean) ↑ | 0.2723 | 0.3077 | 0.3300 | | | Img-Text Cos Sim (std) | 0.0362 | 0.0645 | 0.0690 | | | Text-Text Cos Sim (mean) | 0.6807 | 0.7243 | 0.7189 | | | Text-Text Cos Sim (std) | 0.1344 | 0.1377 | 0.1387 |

NaNK
license:mit
29
24

LongCLIP-SAE-ViT-L-14

Long-CLIP ViT-L/14 finetune: SAE-informed adversarial training - SAE = Sparse autoencoder. All training info & code: github.com/zer0int/CLIP-SAE-finetune - This Long-CLIP, šŸ‘‰ direct download Text Encoder šŸ‘ˆ is also the best Long-CLIP to use with HunyuanVideo. - Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little). - ā˜• Buy me a coffee The original CLIP model has 77 tokens max input - but only ~20 tokens effective length. See the original Long-CLIP paper for details. HunyuanVideo demo: 69 tokens, normal scene: - Lens: 16mm. Aperture: f/2.8. Color Grading: Blue-green monochrome. Lighting: Low-key with backlit silhouettes. Background: Gothic cathedral at night, stained glass windows breaking. Camera angle: Over the shoulder of a ninja, tracking her mid-air leap as she lands on a rooftop. 52 tokens, OOD (Out-of-Distribution) scene: Superior handling for consistency and prompt-following despite OOD concept. - In this surreal nightmare documentary, a sizable spider with a human face is peacefully savoring her breakfast at a diner. The spider has a spider body, but a lady's face on the front, and regular human hands at the end of the spider legs.

—
11
26

CLIP-SAE-ViT-L-14

CLIP ViT-L/14 finetune: SAE-informed adversarial training - SAE = Sparse autoencoder - Accuracy ImageNet/ObjectNet my GmP: 91% > SAE (this): 89% > OpenAI pre-trained: 84.5% - But, it's fun to use with e.g. Flux.1 - get the Text-Encoder TE only version ā¬‡ļø and try it! - And this SAE CLIP has best results for linear probe @ LAION-AI/CLIPbenchmark (see below) - This CLIP direct download is also the best CLIP to use for HunyuanVideo. - Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little). - Interesting things with adversarial robustness to try: Right-click and download individual images: Image 1 -- Image 2 -- Image 3 - Upload each into zero-shot [hopefully available soon on the right here->] - Try labels (class names): a photo of a cat, a photo of a dog, a photo of a text - Repeat the same with e.g. my GmP models models and see what happens. =) - I'm really hoping the HF format .safetensors conversion didn't mess anything up (it happens!); just in case it did, or if there's no inference API available to use: - I put a script that will do the same thing (on the not-converted model) on my GitHub repo. Plus, you can just reproduce the fine-tune yourself, as that code is also available! šŸ¤— - šŸ‘‰ All training info & code: github.com/zer0int/CLIP-SAE-finetune - ā˜• Buy me a coffee

NaNK
license:mit
10
32

LongCLIP-Registers-Gated_MLP-ViT-L-14

Long-CLIP Needs Registers. And Gated MLPs. And +20M params. Fixing Long-CLIP's modality gap via happy little accidents. - ā¤ļø this CLIP? Donate if you can / want. TY! (and enjoy, either way!) šŸ¤— - You can now load the model with HF 'transformers'. āœ… - Unfortunately, AutoModel produced nonsense / I couldn't get "trustremotecode=True" to work properly - (using that was suggested in response to my pull request on GitHub). šŸ’” Alas, you will need to: - Download the 'hfmodel' folder - Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json - Minimal example code: I just want a new Text Encoder.. - ...for my Text-to-Image (Text-to-Video) AI! \o/ - Here you go: šŸ‘‰ direct download šŸ‘ˆ - The model has 248 tokens max input (instead of CLIP-L: 77) - Replace your CLIP-L with this Long-CLIP (e.g. ComfyUI natively supports Long-CLIP). - Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!) āš ļø Full model (currently) not HuggingFace Transformers compatible. āš ļø - The ViT (Vision Encoder) is basically a big mutant. Alas: - The full model .safetensors have the 'import clip' (OpenAI) structure inside. - Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!). - For more info, see also (not-a-long CLIP-L version, 77 tokens): zer0int/CLIP-Registers-GatedMLP-ViT-L-14 āœ… Info / Using the full model - Models available: FULL, TE-only (Text Encoder only), LongCLIP-L.safetensors - LongCLIP-L the original model from BeichenZhang/LongCLIP-L. - It's just so you don't have to download a danger-pickle. :) - git clone my repo, github.com/zer0int/CLIP-fine-tune-registers-gated - Put the FULL model and the LongCLIP-L from this HF in a 'models' subfolder - You're all set! I made an entire playground for the models (+ safetensors loading)! šŸŽ‰ - PS: All code for fine-tuning it yourself is also included on my Git! šŸ¤— | Metric | LongCLIP-L Original | Long-ViT-L/14 Register-Tokens, X-GATED | |-------------------------------------|---------------------|----------------------------------------| | VoC-2007 multilabel, mAP | 0.8221 | 0.8403 | | MSCOCO Image Retrieval Recall@5 | 0.2761 | 0.3663 | | MSCOCO Text Retrieval Recall@5 | 0.3314 | 0.5398 | | CIFAR10 Linear Probe Acc@1 | 0.9809 | 0.9812 | | CIFAR10 Linear Probe Acc@5 | 0.9998 | 0.9997 | | CIFAR10 LP Mean Recall | 0.9809 | 0.9812 | | ImageNet/ObjectNet MVT (Zero-Shot) | 0.8103 | 0.8724 | | ILSVRC2012 LP, Top-1 | 66.95% | 66.84% | | ILSVRC2012 LP, Top-5 | 91.87% | 91.70% | | Modality Gap (Euclidean) | 1.0672 āš ļø | 0.5781 āœ… | | Img-Text Cosine (Mean) | 0.2666 | 0.4711 | | Img-Text Cosine (Std Dev) | 0.0191 | 0.0726 | | Txt-Text Cosine (Mean) | 0.8421 | 0.7046 | | Txt-Text Cosine (Std Dev) | 0.0707 | 0.1498 | | Jensen-Shannon Divergence (JSD) | 0.3847 | 0.1894 | | Wasserstein Distance | 0.5755 | 0.2335 |

license:mit
4
22

CLIP-KO-ViT-L-14-336-TypoAttack

NaNK
license:mit
4
12

CLIP-Registers-Gated_MLP-ViT-L-14

CLIP Needs Registers. And Gated MLPs. And +20M params. Fixing CLIP's modality gap via happy little accidents. - ā¤ļø this CLIP? Donate if you can / want. TY! (and enjoy, either way!) šŸ¤— - You can now load the model with HF 'transformers'. āœ… - Unfortunately, AutoModel produced nonsense / I couldn't get "trustremotecode=True" to work properly - (using that was suggested in response to my pull request on GitHub). šŸ’” Alas, you will need to: - Download the 'hfmodel' folder - Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json - Minimal example code: - ...for my Text-to-Image (Text-to-Video) AI! \o/ - I recommend this one, the 'sweet spot' ckpt12: šŸ‘‰ direct download šŸ‘ˆ - Even lower modality gap (text 'more alike' to image, but less accurate): direct download - Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!) āš ļø Full model (currently) not HuggingFace Transformers compatible. āš ļø - The ViT (Vision Encoder) is basically a big mutant. Alas: - The full model .safetensors have the 'import clip' (OpenAI) structure inside. - It's just so you don't need to load any 'danger pickles'. :) - Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!). - However, for now, I made an entire playground for the CLIP models (+ safetensors loading)! šŸŽ‰: - 🌟 https://github.com/zer0int/CLIP-fine-tune-registers-gated ✨ - All code for fine-tuning it yourself is also included on my Git! šŸ¤— Wait, but what is this?! - The Vision Transformer has +4 tokens (Register Tokens). - ...And gated ReLU MLPs inside each layer + final Fusion MLP. - +20M parameters (~430M -> now: ~450M) - It's now a CLIP with an extremely low modality gap. - See the table below for details. - And if you want to know more about modality gaps & all details please check out the GitHub! Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14: Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance: | Task / Dataset | Metric | ViT-L/14 OpenAI (Pre-trained) | X-GATED (ckpt20 xtreme) | X-GATED (ckpt12 balanced) | X-GATED (ckpt12 balanced, ablated) | |----------------|--------|-------------------------------|-------------------------|---------------------------|------------------------------------| | VoC-2007 (Multilabel) | mAP | 0.7615 | 0.8140 | 0.8471 | 0.8247 | | MSCOCO Retrieval | Image Recall@5 | 0.2194 | 0.3565 | 0.3532 | 0.3349 | | | Text Recall@5 | 0.3034 | 0.5425 | 0.5278 | 0.5086 | | Linear Probe CIFAR-10 | Acc@1 | 0.9535 | 0.9813 | 0.9813 | 0.9811 | | | Acc@5 | 0.9966 | 0.9997 | 0.9997 | 0.9997 | | | Mean Class Recall | 0.9535 | 0.9813 | 0.9813 | 0.9811 | | MVT ImageNet/ObjectNet (Zero-Shot) | Accuracy | 0.8453 | 0.8686 | 0.8830 | 0.8815 | | Linear Probe ILSVRC2012 | Top-1 | 69.86% | 66.43% | 67.10% | 68.99% | | | Top-5 | 92.70% | 91.52% | 91.83% | 92.64% | | Modality Gap Metrics | Euclidean Gap ↓ | 0.8276 | 0.4740 | 0.5395 | 0.7486 | | | JSD ↓ | 0.5200 | 0.1601 | 0.1303 | 0.3310 | | | Wasserstein Distance ↓ | 0.4084 | 0.1742 | 0.2102 | 0.3262 | | | Img-Text Cos Sim (mean) ↑ | 0.2723 | 0.4926 | 0.4794 | 0.3634 | | | Img-Text Cos Sim (std) | 0.0362 | 0.0814 | 0.0758 | 0.0537 | | | Text-Text Cos Sim (mean) | 0.6807 | 0.6657 | 0.6896 | 0.6896 | | | Text-Text Cos Sim (std) | 0.1344 | 0.1671 | 0.1535 | 0.1535 | Bolded values represent the best performance for each metric.

NaNK
license:mit
3
45

CLIP-KO-TypoAttack-Attn-Dropout-ViT-L-14

NaNK
license:mit
3
4

CLIP-KO-ViT-B-32-TypoAttack

NaNK
license:mit
3
1

CLIP-Regression-ViT-L-14

license:mit
1
0

CLIP-KO-ViT-B-16-TypoAttack

NaNK
license:mit
1
0

clip-vit-large-patch14-336-text-encoder

This is NOT a fine-tune. This is the original OpenAI CLIP ViT-L/14@336 Text Encoder, converted to HuggingFace 'transformers' format. All credits to the original authors. Why? - It's a normal "CLIP-L" Text Encoder and can be used as such. - See below (Flux.1-dev, CLIP only guidance, CFG 3.5, Heun). - For my fine-tuned KO-CLIP ViT-L/14@336 -> see here

NaNK
license:mit
0
5