Aurore-Reveil

3 models • 1 total models in database

Sort by:

Koto-Small-7B-IT

Koto-Small-7B-IT is an instruct-tuned version of Koto-Small-7B-PT, which was trained on MiMo-7B-Base for almost a billion tokens of creative-writing data. This model is meant for roleplaying and instruct usecases. Trained with ChatML formatting, A typical input would look like this: We found that 1.25 temperature and 0.05 minp worked best, but YMMV! - Thank you very much to Delta-Vector/Mango for providing the compute used to train this model. - Fizz for the pretrain. - Pocketdoc/Anthracite for da cool datasets. - Hensen chat. - Thank you to the illustrator of WataNare for drawing the art used in the model card! - Thanks to Curse for testing, ideas. - Thanks to Toasty for some data, ideas. - Thanks to everyone else in allura! Same as before, It was trained over the course of 12 hours for over 2 epochs, on an 8xA100 DGX node, Using Ademamix and REX LR schedular, High grad-clipping was used for regularization with NO WEIGHTDECAY because it sucks.

NaNK

license:mit

288

Olmo-3-1025-7B-stage1-step1413814

NaNK

license:apache-2.0

Austral-Qwen3-235B

It's an SFT ontop of the largest Qwen which nobody seems to have done yet, Trained with a collection of normal Austral(Books, RP Logs, LNs, etc) datasets. I do not totally endorse the model yet and i think there's much work to be done in trying to make a decensored and well-writing finetune of this model but I just released this to give everyone a slight taste of a qwen3 finetune. It was also a way for us to test out some Optims to actually get this model to train, Thanks to Intervitens system system-prompt user user-prompt assistant assistant-prompt yaml outputdir: ./qwen3235BA22Baustral/full tokenizer: component: torchtune.models.qwen3.qwen3tokenizer path: ./Qwen3-235B-A22B-tt/vocab.json mergesfile: ./Qwen3-235B-A22B-tt/merges.txt maxseqlen: 32768 dataset: component: torchtune.datasets.pretokenizeddataset source: IntervitensInc/test235B2-pack split: train packed: true seed: 42 shuffle: false model: component: torchtune.models.qwen3.qwen3moe235ba22b checkpointer: component: torchtune.training.FullModelTorchTuneCheckpointer checkpointdir: ./Qwen3-235B-A22B-tt checkpointfiles: - model-00001-of-00001.bin recipecheckpoint: null outputdir: ${outputdir} modeltype: QWEN3MOE resumefromcheckpoint: false enableasynccheckpointing: false batchsize: 1 epochs: 4 optimizer: component: torchao.optim.AdamW8bit lr: 3.0e-06 lrscheduler: component: torchtune.training.lrschedulers.getrexscheduler numwarmupsteps: 100 loss: component: torchtune.modules.loss.LinearCrossEntropyLoss maxstepsperepoch: null gradientaccumulationsteps: 1 clipgradnorm: null compile: model: true loss: true scalegrads: true optimizerstep: false optimizerinbwd: true device: cuda enableactivationcheckpointing: true enableactivationoffloading: true customshardedlayers: - tokembeddings - output fsdpcpuoffload: false dtype: bf16 metriclogger: component: torchtune.training.metriclogging.WandBLogger project: qwen3-235-a22b-austral logeverynsteps: 1 logpeakmemorystats: true loglevel: INFO ``` Thank you to Lucy Knada, Auri, Intervitens, Deepinfra, Cognitive Computations and the rest of Anthracite & Training The training was done for 4 epochs. We used 8 x B200s GPUs graciously provided by Deepinfra for the full-parameter fine-tuning of the model, Tuning was done all thanks to Intervitens. Safety It's still aligned to the beliefs of the Chinese Communist Party:

NaNK

—