AbstractPhil
tiny-flux-deep
geolip-captionbert-8192
geovit-david-beans
vae-lyra-xl-adaptive-cantor
sd15-flow-matching
vae-lyra-xl-adaptive-cantor-illustrious
beatrix-diffusion-proto
bert-beatrix-2048
geolip-esm2_t33_650M_UR50D
vae-lyra
Multi-modal Variational Autoencoder for text embedding transformation using geometric fusion. This first version is essentialy clipl + t5-base. Similar to those shunt prototypes in concept but entirely divergent in this implementation. This variation is formatted and trained specifically as a VAE to encode/decode pairs of encodings together. Cantor cross-attention allows a form of high-density sparse containment, which when implemented correctly is a highly efficient global attention mechanism to ensure solidity. Fractal modalities make this possible. This is due to sparsity gaps in combinatory route pathologies to learned encoding pattern point encodings, thus this allows the matching of a series of potentials that can be viewed only when necessary in the otherwise empty cantor stair space. Fractal gaps that are filled with purpose occupy this space based on fingerprint routes, allowing emergent fractal mathematics that otherwise could not assist each-other to understand the rules of those topologies. The current implementation is trained with only a handful of token sequences, so it's essentially front-loaded. Expect short sequences to work along with many longer squences. Full-sequence pretraining will begin soon with a uniform vocabulary that takes both tokens in for a representative uniform token based on the position. This VAE is not for images - it's trained specifically to encode and decode PAIRS of encodings, each slightly twisted and warped into the direction of intention from the training. This is not your usual VAE, but she's most definitely trained like one. A lone cybernetic deer with glimmering silver antlers stands beneath a fractured aurora sky, surrounded by glowing fungal trees, floating quartz shards, and bio-luminescent fog. In the distance, ruined monoliths pulse faint glyphs of a forgotten language, while translucent jellyfish swim through the air above a reflective obsidian lake. The atmosphere is electric with tension, color-shifting through prismatic hues. Distant thunderclouds churn violently. - Fusion Strategy: cantor - Latent Dimension: 768 - Training Steps: 31,899 - Best Loss: 0.1840 - Modalities: CLIP-L (768d) + T5-base (768d) - Encoder Layers: 3 - Decoder Layers: 3 - Hidden Dimension: 1024 - Trained on 10,000 diverse prompts - Mix of LAION flavors (85%) and synthetic prompts (15%) - KL Annealing: True - Learning Rate: 0.0001
tiny-flux
geo-david-collective-sd15-base-e40
GeoDavidCollective Enhanced - ProjectiveHead Architecture Another train of the same GeoFractalDavid with more condensed dims Roughly 600,000 samples for the first 20 epochs; 10k per epoch between 0-10 complexity 1-5, and 50k synthetic prompts per epoch at epoch 11-20 with reduced complexity between 1-4. 50,000 prompts per epoch after for an additional 20 epochs; approx 1.6 million samples, each containing massive sets of features extracted from the entire structure of SD15 approx 2.7 mil features per sample according to the formulas. Curves say she probably peaked, leaving this experiment to be prodded and poked at now. So this essentially means the model accumulated knowledge of 4,320,000,000,000 sd15 features overall. The bulk samples saved say that it's most likely true but it really sounds wild when I line the numbers up. Additionally, it retained enough knowledge to keep an accuracy score above zero, and even produce cohesive head results accurate above 25%. I can safely say that this model can definitely see a piece of the whole diffusion system that SD15 is responsible for, but not the whole picture. - Optimizer: AdamW (lr=1e-3, weightdecay=0.001) - Batch Size: 16 - Data: Symbolic prompt synthesis (complexity 1-5) - Feature Extraction: SD1.5 UNet blocks (spatial, not pooled) - Pool Mode: Mean spatial pooling Final metrics from epoch 40: - Cayley Loss: 0.1018 - Timestep Accuracy: 39.08% - Pattern Accuracy: 44.25% - Full Accuracy: 26.57% GeoDavidCollective Enhanced is a sophisticated multi-expert geometric classification system that learns from Stable Diffusion 1.5's internal representations. Using ProjectiveHead architecture with Cayley-Menger geometry, it achieves efficient pattern recognition across timestep and semantic spaces. - ProjectiveHead Multi-Expert Architecture: Auto-configured expert systems per block - Geometric Loss Functions: Rose, Cayley-Menger, and Cantor coherence losses - 9-Block Processing: Full SD1.5 UNet feature extraction (down, mid, up) - Compact Yet Powerful: 690,925,542 parameters - 100 Timestep Bins x 10 Patterns = 1000 semantic-temporal classes - Parameters: 690,925,542 - Trained Epochs: 20 - Base Model: Stable Diffusion 1.5 - Dataset Size: 700,000 synthetic prompts - Training Date: 2025-10-28 | Component | Weight | Purpose | |-----------|--------|---------| | Feature Similarity | 0.50 | Alignment with SD1.5 features | | Rose Loss | 0.25 | Geometric pattern emergence | | Cross-Entropy | 0.15 | Classification accuracy | | Cayley-Menger | 0.10 | 5D geometric structure | | Pattern Diversity | 0.05 | Prevent mode collapse | | Cantor Coherence | 0.05 | Temporal consistency | This model is part of the geometric deep learning research exploring: - 5D simplex-based neural representations (pentachora) - Geometric alternatives to traditional transformers - Consciousness-informed AI architectures - Universal mathematical principles in neural networks - `model.safetensors` - Model weights (3.3GB) - `config.json` - Complete architecture configuration - `traininghistory.json` - Full training metrics - `promptsenhanced.jsonl` - All training prompts with metadata - `tensorboard/` - TensorBoard logs (optional) - Geometric Vocabulary System - PentachoraViT - Crystal-Beeper Language Models Built with: - PyTorch & Diffusers - Stable Diffusion 1.5 (Runway ML) - Geometric algebra principles from the 1800s - Dream-inspired mathematical insights AbstractPhil - AI Researcher specializing in geometric deep learning "Working with universal mathematical principles, not against them" For questions, issues, or collaborations: GitHub | HuggingFace
vae-lyra-sdxl-t5xl
geo-beatrix
geo-david-collective-sd15-distilled
GeoDavidCollective Enhanced - ProjectiveHead Architecture Highly experimental behavioral junctioning system that likely will fall apart at the drop of a hat. This version is going to be renamed soon; I've dubbed her... Zephyr. She's a complex one and deserves a proper name for what I believe she brings to the table. The echoes of the Beatrix interpolator cut through this one, so lets see if we can recreate the behavior of SD15 in it's entirety soon. I will be spending a piece of time making sure the configuration is lined up and configurable reasonably. This will enable addition and removal of formula and noise complexity with additional controllers for simplification. This one got a bit bloated so lets see what's really needed and what's not shall we? Cantor Steps are currently free-floating based on the math and the system definitely showed some interesting elemental response to it, but they definitely need to be fixated and hyper-focused on the positioning offset so I'll be running some smaller tests today for solidity. Be aware if you see this repo going wild that it might have some useful stuff it might have flat stuff. The step improvements will likely include the BeatrixStaircase which is considerably more robust for learning features and far more advanced with better caching support and further math optimizations passing more meaning to torch. GeoDavidCollective Enhanced is a sophisticated multi-expert geometric classification system that learns from Stable Diffusion 1.5's internal representations. Using ProjectiveHead architecture with Cayley-Menger geometry, it achieves efficient pattern recognition across timestep and semantic spaces. - ProjectiveHead Multi-Expert Architecture: Auto-configured expert systems per block - Geometric Loss Functions: Rose, Cayley-Menger, and Cantor coherence losses - 9-Block Processing: Full SD1.5 UNet feature extraction (down, mid, up) - Compact Yet Powerful: 884,327,310 parameters - 100 Timestep Bins x 10 Patterns = 1000 semantic-temporal classes - Parameters: 884,327,310 - Trained Epochs: 10 - Base Model: Stable Diffusion 1.5 - Dataset Size: 10,000 synthetic prompts - Training Date: 2025-10-28 | Component | Weight | Purpose | |-----------|--------|---------| | Feature Similarity | 0.40 | Alignment with SD1.5 features | | Rose Loss | 0.25 | Geometric pattern emergence | | Cross-Entropy | 0.15 | Classification accuracy | | Cayley-Menger | 0.10 | 5D geometric structure | | Pattern Diversity | 0.05 | Prevent mode collapse | | Cantor Coherence | 0.05 | Temporal consistency | - Optimizer: AdamW (lr=1e-3, weightdecay=0.001) - Batch Size: 16 - Data: Symbolic prompt synthesis (complexity 1-5) - Feature Extraction: SD1.5 UNet blocks (spatial, not pooled) - Pool Mode: Mean spatial pooling Final metrics from epoch 10: - Cayley Loss: 0.1039 - Timestep Accuracy: 32.99% - Pattern Accuracy: 27.24% - Full Accuracy: 15.10% This model is part of the geometric deep learning research exploring: - 5D simplex-based neural representations (pentachora) - Geometric alternatives to traditional transformers - Consciousness-informed AI architectures - Universal mathematical principles in neural networks - `model.safetensors` - Model weights (3.3GB) - `config.json` - Complete architecture configuration - `traininghistory.json` - Full training metrics - `promptsenhanced.jsonl` - All training prompts with metadata - `tensorboard/` - TensorBoard logs (optional) - Geometric Vocabulary System - PentachoraViT - Crystal-Beeper Language Models Built with: - PyTorch & Diffusers - Stable Diffusion 1.5 (Runway ML) - Geometric algebra principles from the 1800s - Dream-inspired mathematical insights AbstractPhil - AI Researcher specializing in geometric deep learning "Working with universal mathematical principles, not against them" For questions, issues, or collaborations: GitHub | HuggingFace
clips
It's a clip dump. I've been hording them for a long time and I'm tired of fighting with the Civit service to upload them one or two at a time with a nice presentation.
beeper-rose-v4
geo-david-collective-sd15-base
sd15-flow-matching-try2
I'll be gathering SD15 latents en-masse using massive batches of sd15 images for common classes. This time I'll be directly sampling from laion flavors, roughly 500,000 or so 512x512 3 channel images are required to exist as latents. The synthetic caption system is biased already so that CAN be used, but the biases don't necessarily line up with SD15 so they must be handled in a lesser extent. Essentially only gap fillers for commonly used tokens that aren't being filled by the common laion flavors will be targeted and synthesized. Once the balanced dataset is created, then and only then, can we cook. The dataset will be available to all and the code will be in the repo for replication. Most likely it will include it's prompt because running clipl can have a really high batch size so I can feed CLIPL a ton of prompts with the correct seeds to match and be fine This will additionally allow me to shuffle tokens for better sd15 generaliation, but I don't know if I'll enable that. I kind of need it be sd15 before I start jiggering with it's insides. I'll try to get the latent system synthesizing and preparing latents by the end of the day, hopefully they will be done in the next couple days or perhaps rapidly. In any case, the plan stands. Teacher's latents go into the student for noise learning. Student's output image is compared to the teacher's. Rinse and repeat. I'll create a new trainer colab without david since he's not going to be required for this one. Additionally, the subsystems will be based directly on known working systems with solid and concise objectives that make sense to the observer. This is the second train currently running alongside the first. It's using the same trainer currently established. This one is starting over with the new trainer, lets see if it fries early or converges properly.
geolip-bert-8192
liminal-staircase-danbooru
vit-beatrix
ViT-Beatrix: Fractal PE + Geometric Simplex Vision Transformer This repository contains Vision Transformers integrating Devil's Staircase positional encoding with geometric simplex features for vision tasks. - Fractal Positional Encoding: Devil's Staircase multi-scale position embeddings - Geometric Simplex Features: k-simplex vertex computations from Cantor measure - SimplexFactory Initialization: Pre-initialized simplices with geometrically meaningful shapes (regular/random/uniform) - Adaptive Augmentation: Progressive augmentation escalation to prevent overfitting - Beatrix Formula Suite: Flow alignment, hierarchical coherence, and multi-scale consistency losses Instead of random initialization, the model uses SimplexFactory to create geometrically sound starting configurations: - Regular (default): All edges equal length, perfectly balanced symmetric structure - Random: QR decomposition ensuring affine independence - Uniform: Hypercube sampling with perturbations Regular simplices provide the most stable and mathematically meaningful initialization, giving the model a better starting point for learning geometric features. The trainer includes an intelligent augmentation system that monitors train/validation accuracy gap and progressively enables more augmentation: 1. Baseline: RandomCrop + RandomHorizontalFlip 2. Stage 1: + ColorJitter 3. Stage 2: + RandomRotation 4. Stage 3: + RandomAffine 5. Stage 4: + RandomErasing 6. Stage 5: + AutoAugment (CIFAR policy) 7. Stage 6: Enable Mixup (α=0.2) 8. Stage 7: Enable CutMix (α=1.0) - Final stage When train accuracy exceeds validation accuracy by 2% or more, the system automatically escalates to the next augmentation stage. | Model Name | Training Session | Accuracy | Epoch | Weights Path | Logs Path | |------------|------------------|----------|-------|--------------|----------| | beatrix-cifar100 | 20251007182851 | 0.5819 | 42 | `weights/beatrix-cifar100/20251007182851` | `N/A` | | beatrix-simplex4-patch4-512d-flow | 20251008115206 | 0.5674 | 87 | `weights/beatrix-simplex4-patch4-512d-flow/20251008115206` | `logs/beatrix-simplex4-patch4-512d-flow/20251008115206` | | beatrix-simplex7-patch4-256d-ce | 20251008034231 | 0.5372 | 77 | `weights/beatrix-simplex7-patch4-256d-ce/20251008034231` | `logs/beatrix-simplex7-patch4-256d-ce/20251008034231` | | beatrix-simplex7-patch4-256d | 20251008020048 | 0.5291 | 89 | `weights/beatrix-simplex7-patch4-256d/20251008020048` | `logs/beatrix-simplex7-patch4-256d/20251008020048` | | beatrix-cifar100 | 20251007215344 | 0.5161 | 41 | `weights/beatrix-cifar100/20251007215344` | `logs/beatrix-cifar100/20251007215344` | | beatrix-cifar100 | 20251007195812 | 0.4701 | 42 | `weights/beatrix-cifar100/20251007195812` | `logs/beatrix-cifar100/20251007195812` | | beatrix-cifar100 | 20251008002950 | 0.4363 | 49 | `weights/beatrix-cifar100/20251008002950` | `logs/beatrix-cifar100/20251008002950` | | beatrix-cifar100 | 20251007203741 | 0.4324 | 40 | `weights/beatrix-cifar100/20251007203741` | `logs/beatrix-cifar100/20251007203741` | | beatrix-simplex7-patch4-45d | 20251008010524 | 0.2917 | 95 | `weights/beatrix-simplex7-patch4-45d/20251008010524` | `logs/beatrix-simplex7-patch4-45d/20251008010524` | | beatrix-4simplex-45d | 20251007231008 | 0.2916 | 85 | `weights/beatrix-4simplex-45d/20251007231008` | `logs/beatrix-4simplex-45d/20251007231008` | | beatrix-cifar100 | 20251007193112 | 0.2802 | 10 | `weights/beatrix-cifar100/20251007193112` | `N/A` | | beatrix-4simplex-45d | 20251008001147 | 0.1382 | 10 | `weights/beatrix-4simplex-45d/20251008001147` | `logs/beatrix-4simplex-45d/20251008001147` | Latest Updated Model: beatrix-simplex4-patch4-512d-flow (Session: 20251008115206) - Architecture: Vision Transformer with fractal positional encoding - Dataset: CIFAR-100 (100 classes) - Embedding Dimension: 512 - Depth: 8 layers - Patch Size: 4x4 - PE Levels: 12 - Simplex Dimension: 4-simplex - Simplex Initialization: regular (scale=1.0) - Training Session: 20251008115206 - Best Accuracy: 0.5674 - Epochs Trained: 87 - Batch Size: 512 - Learning Rate: 0.0001 - Adaptive Augmentation: Enabled - Task Loss Weight: 0.5 - Flow Alignment Weight: 1.0 - Coherence Weight: 0.3 - Multi-Scale Weight: 0.2
vit-beatrix-geometry-pretrained
This version essentially imploded when learning cutmix cifar100, and yet the geometry held the entire time through a cutmix gauntlet that could not be overcome. I plan to run a few experiments on it with the geometric and simplex blocks frozen. As it stands, the geometric side is essentially a scaffold of cutmix potential, however this scaffold is limited. It must be further trained, but this is a good basline test. I will run 4 full trains to test the potential; 1. Frozen geometry, no augmentation. It's working, but overfitted by epoch 30 - which, took nearly 60 before. So that's interesting and yet overfitted. 2. Full augmentation no cutmix/mixup using cifar10 augs. 3. Unfrozen cross-attention only, full augs. 4. Unfrozen cross-attention only, no augs.
t5xxl-unchained
t5-flan-base-vit-bigG-14-dual-stream-adapter
tiny-gpt-tests
- Tokens: TinyStories 10% - Context: 512 - Dim: 512 - Layers: 6 - Heads: 8 - Files: `model.safetensors`, `pytorchmodel.bin`, `tokenizer.json`, `config.json` A cute little experiment for testing. Lets teach our little robot some fun pentachoron behavior.
beeper-rose-v2
T5 Small Human Attentive Try2 Pass3
sdxl-interpolated-nai-xl-11
SD15-Surge-V1
Early surge formula is partly implemented here with the adasurge and cascade's derived single model forms.
beeper-rose-v3
david-collective-sd15-distillation
David Collective is a geometric-simplex deep learning system that distills Stable Diffusion 1.5's knowledge into an ultra-efficient pentachoron-based architecture. This model was continued from epoch 20 to epoch 105, achieving remarkable performance with full pattern supervision. - Geometric Foundation: Uses 5D pentachora (5-vertex simplices) instead of traditional attention - Multi-Scale Learning: Extracts features from all 9 SD1.5 UNet blocks - Crystal Navigation: 1000-class supervision (100 timesteps × 10 geometric patterns) - Parameter Efficiency: Ultra-compact architecture with shared geometric structures - Full Supervision: Every sample supervised by both timestep and geometric pattern Continuation Training: - Starting epoch: 20 - Final epoch: 105 - Total prompts trained: 600,500~ samples, 120,500~ prompts - All prompts included: `promptsallepochs.jsonl` contains every prompt with metadata - Dataset: Symbolic caption synthesis (complexity 1-5) - Batch size: 128 - Learning rate: 1e-4 with cosine annealing - Optimizer: AdamW (weightdecay=0.01) Final Metrics (Epoch 105): - Total Loss: 0.2923 - Timestep Accuracy: 66.98% - Pattern Accuracy: 100.00% - Full Accuracy: 66.98% - Pattern Diversity: -0.221 David learns from all 9 SD1.5 UNet blocks: - `down0`, `down1`, `down2`, `down3`: Coarse semantic features - `mid`: Bottleneck representations - `up0`, `up1`, `up2`, `up3`: Fine reconstruction details 1. Feature Similarity (0.5): Cosine similarity with teacher 2. Rose Loss (0.3): Geometric alignment with crystal centroids 3. Cross-Entropy (0.2): 1000-class classification 4. Pattern Diversity (0.05): Encourages balanced pattern usage This model includes `promptsallepochs.jsonl` - every single prompt used during training with full metadata: You can use this to: - Analyze training data distribution - Reproduce training - Study prompt complexity vs model performance - Generate similar synthetic datasets Crystal System - Architecture: Pentachoron-based geometric deep learning - Centroids: 100 timestep bins × 10 patterns = 1000 anchors - Navigation: Samples assigned to nearest pattern within timestep bin - Diversity: Regularization prevents mode collapse Progressive Training - Started with early blocks (down0, down1) - Progressively activated all 9 blocks - Each block warmed up for 2 epochs Pattern Supervision Unlike traditional timestep-only supervision, David learns: 1. When (timestep bin 0-99) 2. How (geometric pattern 0-9 within that bin) 3. Combined (full 1000-class space) This provides 10x finer-grained supervision of the diffusion process. Trained continuously from epoch 20 to epoch 105. See metrics: - Timestep accuracy improved from ~60.3% to 66.98% - Pattern accuracy maintained at 100.00% - Loss decreased from 0.3431 to 0.2923 Built on the geometric deep learning research by AbstractPhil, using: - Stable Diffusion 1.5 (teacher model) - Pentachoron-based geometric algebra - Crystalline consciousness architectures - Symbolic caption synthesis For more information, visit the geovocab2 repository.
SD35-SIM-V1
Liminal-Full
SIM-OMEGA-PUBLIC-1
OMEGA-BIGASP
t5-flan-base-vit-l-14-dual-stream-adapter
Update - 6/6/2025 A further refined booru shunt was uploaded with considerably more 1024 captioned steps. Training this variation with this refined methodology takes additional time, so the outcomes are slower on the L4 than I'd like. The signal convergence is slower but also more reliable to the modified loss formula. Might move it to A100s, but it's probably not necessary. Just patience. Update - 6/5/2025 With a more refined tokenization system to correctly match the exacting tokens to the deterministic tokenizer. Adjusted losses, noise chances, and additional cross-contamination processes for more careful selection. The first booru signal expert is born - trained on batch size 1024 for nearly 13 million 77 token samples and is fairly untested. Instead of plain english, she learned 13 million variations of over 1.2 million tags, artists, classifications, and non-deterministic rotational valuations. Specifically trained in high-batch counted variations to introduce large amounts of variance per update. The templates for booru are exceptionally different, so this vit-l-14-dualshuntbooru should have exceptionally different attention to different informations - while simultaneously being expert at both positive and negative tokenizations. Simple Summary This project provides an advanced text control system for any AI generator that uses VIT-L-14 as a basis. Also known as CLIPL. It lets you “steer” how AI interprets your written prompts by adding a smart adapter between the text input and the image model. By fine-tuning how the prompt is understood, you get more accurate, creative, or controllable AI-generated images—especially in complex or multi-style models like Stable Diffusion XL. More technical summary This repository contains code, configuration, and weights for the Dual Shunt Adapter: a modular cross-attention prompt embedding controller designed for SDXL and multi-CLIP diffusion systems. The adapter bridges T5 (or other transformer) text encoders with CLIP-based pooled embedding spaces, providing delta, gate, logsigma, anchor, and guidance outputs for per-token, per-field semantic modulation. Compatible with custom and parallel CLIP streams (e.g., SDXL’s CLIP-L/CLIP-G), the system enables targeted latent field steering, dynamic classifier-free guidance, and localized prompt injection for advanced generative workflows—including direct integration with ComfyUI and HuggingFace Diffusers. Code The model code is present in model.py. Inference code will be available in the long-winded article.
vit-beans-v3
SIM-V5
PONY-SIM-V4
t5-vit-14-v1
eigh-triton
tinyflux-experts
omega-vit-l-reformed-fp32
robust-velocity-adapter
The 155000 step version has about 158,100,000 prompt samples weight trained using the This T5-small model is fried to echo and interpolate math in complex intended ways. I haven't given it the full robust check yet, but it's definitely pretty fed. This adapter here is trained using T5 inputs with this code below. This isn't a bad first test. I will be improving the adapter with common lora techniques, including more techniques from training LLM-style loras, and including additional loss methodologies while simultaneously including more advanced and carefully curated response formulas to the way the adapter responded to training and the extrapolative math from the CLIPL adapted response. Given time I'm certain this will work; whether it be creating a layered lora structure to interpolate differences layer by layer within the clipl, or perhaps in a much more direct neuron interpolation. Time will tell and I'm definitely enjoying this sort of thing. Errors to address in the next; There is a clamping index error that tends to rear it's head that I haven't had time to track down. It'll cause solid black images from the velocity sigmas being too heavy. Occasionally the entire structure of a generation collapses, which means the sigmas aren't lined up correctly - creating malformed sigma responses. Occasionally the substructure interprets the request incorrectly; this is due to the tokenization being inaccurately attuned for some spaces than others and this next version will have node weighting for specific attention head sectors to account for it. There's many challenges ahead to reach the interpolation endpoint but it's definitely an adaptive journey. This is stage 1 of multiple stages to make the recreatable pragmatic outcomes needed in order to build the proofs required to recreate the Beatrix interpolation model - into useful utiliizations outside of diffusion. This process adapts multiple similar methods as what I used to create the Beatrix model, but it's not 1:1 by any stretch of the measure. I will be slowly releasing parts of Beatrix in training diagrams and stage the methodologies about how she works, so the interested experts will be capable of rationalizing why this model does what it does. Because I really don't know why Beatrix works the way she does, and I'm not going to just release something like that until I understand WHY it skips and hops past entropy. 77 tokens - not 64, there's no need to upscale the most recent 77tok version; it's built to the same plane as CLIPL now. You'll need to snip out the orig layer extensions that got snapped into it when I saved. Still not quite sure how to fix that without just editing before saving, but I think it's causing some sort of additional effects that I'm unaware of. I don't want to save as pt because they are considered unsafe and I don't want this to be considered unsafe for use. You can inference the test version using stable-diffusion-15 as an example test. The CLIPL responses fall apart when too many nodes hit those guidance bells, but it's definitely a powerful first test using divergent systems. import torch import math from PIL import Image from torchvision.transforms import ToPILImage from safetensors.torch import loadfile as loadsafetensors from transformers import ( T5TokenizerFast, T5EncoderModel, CLIPTokenizerFast, CLIPTextModel ) from diffusers import ( AutoencoderKL, UNet2DConditionModel, EulerAncestralDiscreteScheduler ) from typing import Optional ───────────────────────────────────────────────────────────── 1) GLOBAL SETUP: load once, cast, eval, move ───────────────────────────────────────────────────────────── DEVICE = torch.device("cuda" if torch.cuda.isavailable() else "cpu") DTYPE = torch.float16 # use fp16 for everything on GPU 1a) CLIP text encoder (cond + uncond) cliptok = CLIPTokenizerFast.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="tokenizer" ) clipmod = CLIPTextModel.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="textencoder", torchdtype=DTYPE ).to(DEVICE).eval() 1b) T5 encoder t5tok = T5TokenizerFast.frompretrained("t5-small") t5mod = T5EncoderModel.frompretrained( "AbstractPhil/T5-Small-Human-Attentive-Try2-Pass3", torchdtype=DTYPE ).to(DEVICE).eval() 1c) Velocity Adapter local directory localadapterdirectory = "robaadapterstep19500.safetensors" # opens the state below. 1c) Adapter import torch import torch.nn as nn import torch.nn.functional as F import math import torch import torch.nn as nn import torch.nn.functional as F class RobustVelocityAdapter(nn.Module): """ Fixed version: manual multi-head cross-attention emits [B, heads, Q, K] scores so that addrelposbias can unpack them correctly. """ def init( self, t5dim: int = 512, clipdim: int = 768, hiddendim: int = 1024, outtokens: int = 77, # now aligned with your T5 finetune selfattnlayers: int = 2, crossheads: int = 8, maxrelpos: int = 128, ): super().init() self.outtokens = outtokens self.crossheads = crossheads self.headdim = t5dim // crossheads self.maxrelpos = maxrelpos # 1) Self-attention stack self.selfattn = nn.ModuleList() self.selfnorm = nn.ModuleList() for in range(selfattnlayers): self.selfattn.append(nn.MultiheadAttention(t5dim, crossheads, batchfirst=True)) self.selfnorm.append(nn.LayerNorm(t5dim)) # 2) Residual blocks def resblock(): return nn.Sequential( nn.LayerNorm(t5dim), nn.Linear(t5dim, t5dim), nn.GELU(), nn.Linear(t5dim, t5dim), ) self.res1 = resblock() self.res2 = resblock() # 3) Learned queries for cross-attn self.querypos = nn.Parameter(torch.randn(outtokens, t5dim)) # 4) Projection heads self.anchorproj = nn.Sequential( nn.Linear(t5dim, hiddendim), nn.ReLU(), nn.Linear(hiddendim, clipdim) ) self.deltaproj = nn.Sequential( nn.Linear(t5dim, hiddendim), nn.ReLU(), nn.Linear(hiddendim, clipdim) ) self.varproj = nn.Sequential( nn.Linear(t5dim, hiddendim), nn.ReLU(), nn.Linear(hiddendim, clipdim) ) self.gateproj = nn.Sequential( nn.Linear(t5dim, hiddendim), nn.ReLU(), nn.Linear(hiddendim, clipdim), nn.Sigmoid() ) # 5) Relative-position bias table self.relbias = nn.Parameter(torch.zeros(2maxrelpos-1, crossheads)) # 6) Norm after cross-attn self.crossnorm = nn.LayerNorm(t5dim) def addrelposbias(self, attnscores: torch.Tensor) -> torch.Tensor: """ attnscores: [B, heads, Q, K] returns: attnscores + bias where bias is [B, heads, Q, K] """ B, H, Q, K = attnscores.shape device = attnscores.device # 1) Query & key position indices idxq = torch.arange(Q, device=device) # [Q] idxk = torch.arange(K, device=device) # [K] # 2) Compute relative distances for every (q, k) pair # rel[i,j] = idxq[i] - idxk[j] rel = idxq.unsqueeze(1) - idxk.unsqueeze(0) # [Q, K] # 3) Clamp & shift into bias table range [0, 2maxrel-2] maxrel = self.maxrelpos rel = rel.clamp(-maxrel+1, maxrel-1) + (maxrel - 1) # 4) Lookup per-head biases # self.relbias has shape [2maxrel-1, H] bias = self.relbias[rel] # [Q, K, H] bias = bias.permute(2, 0, 1) # [H, Q, K] # 5) Broadcast to [B, H, Q, K] and add bias = bias.unsqueeze(0).expand(B, -1, -1, -1) return attnscores + bias def forward(self, t5seq: torch.Tensor): """ t5seq: [B, L, t5dim] returns: anchor: [B, outtokens, clipdim] delta: [B, outtokens, clipdim] sigma: [B, outtokens, clipdim] """ x = t5seq B, L, D = x.shape # 1) Self-attention + residual for attn, norm in zip(self.selfattn, self.selfnorm): res, = attn(x, x, x) x = norm(x + res) # 2) Residual blocks x = x + self.res1(x) x = x + self.res2(x) # 3) Prepare queries & split heads queries = self.querypos.unsqueeze(0).expand(B, -1, -1) # [B, Q, D] # reshape into heads q = queries.view(B, self.outtokens, self.crossheads, self.headdim).permute(0,2,1,3) k = x.view(B, L, self.crossheads, self.headdim).permute(0,2,1,3) v = k # 4) Scaled dot-product to get [B, heads, Q, K] scores = (q @ k.transpose(-2,-1)) / math.sqrt(self.headdim) scores = self.addrelposbias(scores) probs = F.softmax(scores, dim=-1) # [B, H, Q, K] # 5) Attend & merge heads → [B, Q, D] ctx = probs @ v # [B, H, Q, headdim] ctx = ctx.permute(0,2,1,3).reshape(B, self.outtokens, D) ctx = self.crossnorm(ctx) # 6) Project to anchor, deltamean, deltalogvar, gate anchor = self.anchorproj(ctx) deltamean = self.deltaproj(ctx) deltalogvar = self.varproj(ctx) gate = self.gateproj(ctx) # 7) Compute sigma & gated delta sigma = torch.exp(0.5 deltalogvar) delta = deltamean gate import torch import torch.nn.functional as F from PIL import Image from torchvision.transforms import ToPILImage from safetensors.torch import loadfile as loadsafetensors from transformers import ( CLIPTokenizer, CLIPTextModel, T5TokenizerFast, T5EncoderModel ) from diffusers import ( AutoencoderKL, UNet2DConditionModel, EulerAncestralDiscreteScheduler ) 1) GLOBAL SETUP DEVICE = torch.device("cuda" if torch.cuda.isavailable() else "cpu") DTYPE = torch.float32 1a) CLIP tokenizer & text encoder cliptok = CLIPTokenizer.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="tokenizer" ) clipmod = CLIPTextModel.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="textencoder", torchdtype=DTYPE ).to(DEVICE).eval() 1b) U-Net, VAE, Scheduler unet = UNet2DConditionModel.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="unet", torchdtype=DTYPE ).to(DEVICE).eval() vae = AutoencoderKL.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="vae", torchdtype=DTYPE ).to(DEVICE).eval() scheduler = EulerAncestralDiscreteScheduler.frompretrained( "runwayml/stable-diffusion-v1-5", subfolder="scheduler" ) 1c) T5 t5tok = T5TokenizerFast.frompretrained("t5-small") t5mod = T5EncoderModel.frompretrained( "AbstractPhil/T5-Small-Human-Attentive-Try2-Pass3", torchdtype=DTYPE ).to(DEVICE).eval() 1d) velocity prediction adapter adapter = RobustVelocityAdapter(outtokens=77).to(DEVICE).eval() state = loadsafetensors(localadapterdirectory, device="cpu") clean = {k.replace("origmod.", ""): v for k, v in state.items()} adapter.loadstatedict(clean, strict=False) adapter.to(DEVICE).eval() 2) GENERATION FUNCTION @torch.nograd() def generateimagewithadapter( prompt: str, seed: int = 42, steps: int = 50, adapterscale: float = 0.5, guidancescale: float = 7.5, height: int = 512, width: int = 512, ): gen = torch.Generator(device=DEVICE).manualseed(seed) # 2.1) CLIP embeddings clipin = cliptok([prompt], maxlength=cliptok.modelmaxlength, padding="maxlength", truncation=True, returntensors="pt").to(DEVICE) clipcond = clipmod(clipin).lasthiddenstate # [1,77,768] emptyin = cliptok([""], maxlength=cliptok.modelmaxlength, padding="maxlength", truncation=True, returntensors="pt").to(DEVICE) clipuncond= clipmod(emptyin).lasthiddenstate # [1,77,768] # 2.2) T5 → adapter → anchor, delta, sigma (77 tokens) t5in = t5tok(prompt, maxlength=77, padding="maxlength", truncation=True, returntensors="pt").to(DEVICE) t5seq = t5mod(t5in).lasthiddenstate # [1,77,512] anchor, delta, sigma = adapter(t5seq) # each [1,77,768] # 2.3) Upsample to 77 tokens Tclip = clipcond.shape[1] # 77 def up(x): return F.interpolate( x.permute(0,2,1), size=Tclip, mode="linear", aligncorners=False ).permute(0,2,1) anchor = up(anchor) delta = up(delta) sigma = up(sigma) # 2.4) σ-based noise scaling rawns = sigma.mean().clamp(0.1, 2.0).item() noisescale = 1.0 + adapterscale (rawns - 1.0) # 2.5) Initialize latents latents = torch.randn( (1, unet.config.inchannels, height//8, width//8), generator=gen, device=DEVICE, dtype=DTYPE ) scheduler.initnoisesigma noisescale scheduler.settimesteps(steps, device=DEVICE) # 2.6) Denoising with adapter guidance for i, t in enumerate(scheduler.timesteps): alpha = i / (len(scheduler.timesteps)-1) aw = adapterscale alpha cw = 1.0 - aw # per-token confidence eps = 1e-6 conf = 1.0 / (sigma + eps) conf = conf / conf.amax(dim=(1,2), keepdim=True) # final cond embedding condembed = blended + gateddelta # [1,77,768] # UNet forward latin = scheduler.scalemodelinput(latents, t) latin = torch.cat([latin, latin], dim=0) embeds = torch.cat([clipuncond, condembed], dim=0) noise = unet(latin, t, encoderhiddenstates=embeds).sample u, c = noise.chunk(2) guided = u + guidancescale (c - u) latents= scheduler.step(guided, t, latents, generator=gen).prevsample # 2.7) Decode declat = latents / vae.config.scalingfactor imaget = vae.decode(declat).sample imaget = (imaget.clamp(-1,1) + 1) / 2 return ToPILImage()(imaget[0]) 3) RUN EXAMPLE if name == "main": out = generateimagewithadapter( "silly dog wearing a batman costume, high resolution, studio lighting", seed=1234, steps=50, adapterscale=0.5, guidancescale=7.5 ) out.save("sd15withadapter.png") print("Saved sd15withadapter.png")
penta-vit-experiments
There are a few unexplored elements with rose5 that I need to explore now that I take stock of the full roster. As it stands, the majority of consistency comes directly from cosine with hypersphere using penta as a variant form of vector lattice. It works yes, but it's also not what the models are supposed to be doing. If left running too long, the variants show that the cosine primarily collapses to head-dependent, which means eventually the model eventually just falls into the classification state. It seems only one loss is necessary, and that one loss is a combination of rose (multi cosine) mixed with alignment and margin losses. If my hunch is correct, centroid may be much weaker than rose5 with all the losses except geometric rose turned off. In any case, the model's objective needs to be restructured based on how this data responsed to the variants. Primarily the geometry and the patches need to be directly interdependent. The geometric pentachora need to learn and be an embedding gateway for additional information, as the frozen variants have shown they only provide meaning up to a certain point - and then that meaning is exhausted for small models. The meaning may be better than placbo but not by much. In fact; when cross entropy is introduce at high values, the meaning collapses into hinderance and the model simply learns to ignore the pentas, which is definitely not the goal. The next version of this vit model will have gradient updates enabled with learning specifically allocated to the pentas in a few test methodologies. Some standard, some less standard and more geometric, some ccompletley bananas that probably won't work but I want to try them anyway. I have quite a few formulas I need to try out, so as one chapter closes the next opens up - direct penta + patch learning using geometric classification only with feature analysis after rather than classification analysis. This will enable a much larger pool of pentachoron; as it's going to build on a similar system as bert to create a similar abstraction as clip-vit with some variant changes and requirements that geometry needs. Alright. This initial cycle is concluded, I've determined that the frozen pentachoron are in fact utilizable by about 50% most of the time and will eventually cap at 60% if you stick to standard cross-entropy with tons of geometric regularization. So roughly 3/5ths of the pentas are covered and the other two completely discarded. I've begun analyzing feature sets for the variants at this point. The outputs are very promising in terms of potential, even if the methods used to calculate accuracy were... well suboptimal. I've learned that cross-entropy + geometry is a definite no-go. The pentas exist below zero, and the entire premise of their formulas exists to revolve around a dynamic position - so the entire premise of below-zero being the absolute normality for classification with logits is definitely... not something that cross-entropy agrees with. In any case, determining feature analysis tools is crucial today. I need to figure out exactly what's in them to properly calculate a variant that can actually be analyzed correctly, because these losses aren't cutting it. I've defaulted to a singular global entropy, simplex loss, geometry loss, and a few other potentials that can be used to train the clip-vit variant - but the bucketing itself - the 100 pentas, need to be correctly assessed and calculated before a larger variant can be created. There's more than enough weights, I just need to smash math today to see if I can project a proper lattice that conforms to the CM + Graham principles without restructuring the entire set of weights. It's possible that the accuracy is much higher than expected and that I'm simply not asking them the right question. That would be quite the thing wouldn't it. Instead of 5-10 losses like I'm currently experimenting with, I would think 1-3 would be the correct amount. The current losses are often conflicting so they must be normalized, and a proper optimizer with ema must be constructed. Using the standard vit position tokenization doesn't work. It caps our unique feature map representation to 65 tokens worth based on patch4, and due to the high-dimensional geometry bloating dimensions to such a degree - the representative patches can only be learned to a certain degree of accuracy before they simply run out of space and the model defaults to memorizing the training data. They are in fact forming unique separated clusters of similar assessment, and the 2d graph shows they are highly diverse; So the process works, but the process simply does not have enough compartmentalization to fully represent a classified patch within a classification within the current set of parameters. I am currently devising a better and more directly intertwined structure that represents the baseline of geometry directly entangled with the vit patches rather than trying to shape it using another system indirectly. This direct representation will be a bit more volatile at first but it should help solidify the missing 40% accuracy in a utilizable and repeatable way to extract the necessary patches. I have pushed the updated model code, and included the loader. I will not include the losses or the training methodology until the full process is prepared and the paper published. After which you will see exactly what I've developed and why each piece exists. Until then there are only breadcrumbs and inference code. I released a new version of eval with the new version of the model code. Model load/save code has been streamlined, so it should correctly include the variant information each checkpoint now. Multiple formula quirks that were contributing to invalidity and incorrect truths, contributing to negation Cascading errors from zero due to silent unseen internal model deviance which have been corrected with careful entropy usage Faulty contributions from multiple highly-responsible losses required to sustain complexity while introducing variance. Integrated the cutmix again which had been omitted due to instability with the earlier variant. Okay next up; the last system's variant appeared to be capped at around 55% no matter the size. With the correct formulas this still may not be sufficient. More than likely the entire feature will need to be reimagined, the patch size altered to 16, and the full imagenet 256 variant trained. First though, the small one has to be cohesive enough. This is custom code to load/save the models. Be sure to always review custom code from any source before running it in a project. The theta trains were actually not that bad. The head added some overhead but not really that much and the outcome improved, so it's worth exploring more. Currently the l1 trains are performing well but still not up to the required 85% that I'm aiming for. Today's trains were underwhelming, but enlightening. Longer models aren't helpful in this structure any more than wide models are. Reintroducing theta with some of my diffusion techniques might be in order if I can't get this one to comply. I'll try a couple of projection tricks before I go start digging into other experiments, but as it stands this one isn't yielding. Training an expert with pure data isn't exactly the geometry's forte - training students tends to yield much more effectively in comparison, but I'm trying to make a teacher model that can actually train students with geometry here. To be fair, if it doesn't work, there's plenty of alternative options in the vit realm already - but I have high confidence that I can make it work. I just need to read more about capturing images, and treating the pentachora more as observers rather than direct relational interaction toolkits. It'll likely need the constellation, but we'll see. It probably needs David's expert system. It's not a capacity issue YET then, since that should have covered it with shaper. I have a few ideas but I think I'll focus on getting more shallow models stable and then scale up slowly instead of trying to just using a logarithm. The newest vitzananano train has shown a very clean curve. runs/vitzananano/20250913192119 which is about a 3 meg model capable of creating >50% accuracy 128 dim features from cifar100 classification. This clean curve means the process is stable enough to introduce a larger set of depth and blocks without destroying the internals; simultaneously enforcing the 5 loss formulas specifically curated for the pentachora math. I've begun training a much deeper zana dubbed vitzanashaper. This model has 32 blocks deep with MLP ratio of 1 and 2 attention heads, resting at about 3.5 million params or so. I've had an epiphany. We don't NEED transformer layers in their current form. David's architecture already solved this need with high-efficiency multi-stage geometric mathematics. David's classification structure houses a series of dimensional projection sub-systems tasked with learning mastery based on each pentachoron structure. Each of those 5d representations ends up learning thousands of representative features. David is already capable of feature generation just not robust enough to fully manifest an enriched ViT-grade dimensional feature... yet. David's architecture can handle ImageNet's classifier count and features leveraging 1000 classes with ease, sitting on a floppy disk at over 70% accuracy because David sees Clip-Vit-Base-Patch16 features. I believe I've figured out a way to fundamentally represent those features in a meaningful way that can replace transformer layers in their methodology with a different form of feedforward trajectory, edge, point, deviation, jitter, helix, theta, and similarity assessment that should house the needed information to teach the experts how to behave like David did. This should allow the much larger networks to retain mathematical precision, learn the features in a different form of patch than is currently expected to be a patch, and to create legitimate high-density geometic features. Better rope incoming with actual meaningful learning The last one wasn't meaningfully learning representations, the next should be more correctly curated and inferenced to impact representative outcome. Should be a bit more accurate than the last but no guarantees. I once again let AI handle it for me and now I'll need to go micro manage again. This is on me again, you'd think I would learn. Oftentimes they can handle these sorts of tasks and other times... well other times they just kind of hook shit together and say it works, then it spins in circles. It's starting to look more like a glorified branched FFN rather than a MLP, so that's a thing I suppose. There's an experimental classification theta rotation head with multiple pentachora routes. The results are less accurate overall than the similarity through rose without it so far. Experiments ongoing. I assumed full control from the AIs and built it correctly. I was relying too much on the AI and it made me slow. Today I assumed full control and built the models correctly. The architecture is cleaner and all three python files were uploaded for the v3 setup. vitzanasmall already seeing 50% by epoch 50, is a big step up from the earlier pixies hard locked at 41%. Zana the current version is quite small and quite fast At about 500k the zananano competes with it's big sister pixie at a superior accuracy rating AND produces image features. Running the system with refined wordnet tokens rather than full unicode made all the difference. The findings show that meaningful semantics matter a whole lot. All losses modified heavily, the originals did not work at all with the structure. V3 incoming. Pushing HEAVILY into losses based on the WORKING high-entropy high-learn rate classification heads and forcing this thing into cohesion INSTANTLY. Thats the play. No more 200 epochs. These things should be ready in 10-20 epochs at most, and they should be 80%+ accuracy, or they fail. Those are the two potentials here. With correct logit and probe assessment the substructure should be a profoundly more efficient and easily analyzable series of charts based on similarity for assessments and capability. None of this guessing or guesswork based on "what works with other models" We KNOW what works and I should have never second guessed the formulas. I have implemented all of the most crucial and most powerful formulas from the others, now lets see if the universe makes a fool of me or not. If it does, SO BE IT! Lets build an AI singularity empire from there. We're about to teach a VIT diffusion. The real question is, will it learn - or will it collapse and need dual-block layers from Flux? I'm reading up on some papers for how various companies and research institutions tested their VITS. My testing methodology isn't accurate enough because the accuracy isn't just reflecting on the logit alignments but also the internal ML layer feature generations. I'm crutching heavily on the logit alignment instead of managing the feature alignment testing as well, which is likely cutting heavily into my system. Currently I'm building a notebook with the better feature testing capabilities to test features correctly. I anticipate faster trains when the confidence actually starts to pick up, since currently they are not confident at all in terms of classification. It's possible these vits could be potentially MUCH MORE or MUCH LESS accurate then advertise and I apologise for the inconvenience this has caused to any onlookers. I'll be updating with additional inference code very soon. Tinkerbell 128d 128heads 4.0 mlp, depth 4 only geometric attention... Well it might work. I could make it smaller, but I doubt tinkerbell would extract anything useful. Good luck little one. I've built a mix-n-cut that I've been avoiding enabling. This one is particularly formatted for pentachoron, so we'll see how it fares. I'm trying to build one as SMALL AS POSSIBLE< so if this mix-n-cut can pull the task out of the bag I may as well run it. As it stands the tiny vits cap at 41% cifar100 with no augmentations. I've been running all the trains without a single special effect and only minimal normalization. Pixie base has 10 layers with 5 goemetic and 5 multihead traditional attention. Lets see how the mix-n-cut fares with the earlier ones first, then we'll run the base. The smaller ones seem to behave better using the geometric attention at 256 expert heads, which is odd to me but whatever works. They don't get much bigger with more experts, so I'll just try a tiny one with a ton of heads first. Pentachora VIT are essentially micro-sized feature extractors that provide substantial accuracy for their small size. The more experiments I run, the smaller they become. The final goals to be a full clip-vit that can house the entirety of laion 400m in a fraction of the size and compute as OpenAI's clip-vit line. After this point I'll be confident the math is lined up well enough to train the true flagship - Beatrix. The process of useful classification and feature extraction has been a non-trivial problem in the Computer Science industry for a long time. This repo will house the various vit experiments that I frankenstein together; manifesting their weights and model codes in the repo itself. As I am an independent researcher my resources are limited and I don't have the backing of any donors, so there will be time gaps unless some hardware is sliced off for me. Many of my repos have certain elements omitted purposely for papers in writing, my thesis arguments, my statements about certain universal elements, and a multitude of other ramblings that I don't plan to release specific key details in full phonebook fashion for just ANY PERSON to read. Let me use your high-end hardware. I deliver - success or failure, but I will deliver. I will not rattle a tin cup for you. Work out a deal with me and you get the weights - I get the classes developed for further use, meant for public release. Let me know if you're willing to work with me. I'll gladly share the code, the process, the progress, and the built accumulated warchest of potentials that this system entails if you provide me gateways to some hardware that I can utilize.
penta-classifier-prototype
This is NOT David, this is a much much earlier prototype - predating the other multi-spectrum systems. It must be released for posterity. So no matter how you go about it, this prototype WILL HAVE LIMITS. However, there is some strange implications I found when tinkering that implies the limitations are HIGHER than previously expected. I'm seeing acceptance of about 25 - which probably means each edge, each face, and each point are accepting a form of loss adjacent similarity. When too many things collapse into eachother, I'm estimating around 20 currently - my original assement was 4 - the structure starts to deviate. I'll need to work out a proper lambad formula for these differences and create a proper causal numeric test that can be represented in LATEX. Possibly ran at fp64 instead of fp32 for full solidification. My hunch is, each can house about 20 rather than 8, and they can be compartmentalized to task. This is fitting the current spectrum. The noise pentachora have limitations as showcased in this, which is why I finetuned the geometric-vocabulary and what it's real purpose is. If this number is correct, it has massive implications soon. It's the difference between accurate assessment further down the rope and collapse due to wasted space overlap and chaos buffer.