tomg-group-umd

111 models • 1 total models in database
Sort by:

DynaGuard-8B

NaNK
license:apache-2.0
2,868
13

huginn-0125

Huginn-0125 This is Huginn, version 01/25, a latent recurrent-depth model with 3.5B parameters, trained for 800B tokens on AMD MI250X machines. This is a proof-of-concept model, but surprisingly capable in reasoning and code given its training budget and size. All details on this model can be found in the tech report: "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." (https://www.arxiv.org/abs/2502.05171) For more information, see the paper page: https://huggingface.co/papers/2502.05171. 8 intermediate checkpoints of the model can be found in its collection. Additional intermediate checkpoints are available upon request while we find a place to host all ~350 of them. The data used to train this model is publicly available (entirely on Hugging Face), and scripts provided with the pretraining code at https://github.com/seal-rg/recurrent-pretraining can be used to repeat our preprocessing and our entire training run. 1. How to Use 2. Advanced Usage 3. Model Summary 4. Limitations 5. Technical Details 6. License 7. Citation Downloading and Using the Model Load the model like this: Modifying the Model's Depth at Test Time: By providing the argument `numsteps`, the model will execute a forward pass with that amount of compute: The model has about 1.5B parameters in its non-recurrent layers (prelude+coda), 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline, the number of materialized parameters is `numsteps 1.5B + 2B`. Playing with this parameter is what makes this model interesting, and different from fixed-depth transformers! The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `numsteps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements. Note: Due to an upload issue the model is currently stored on HF with 2 copies of the tied embedding, instead of just one. This will be fixed in a future release. Inference The model was trained with bfloat16-mixed precision, so we recommend using `bfloat16` to run inference (or AMP bfloat16-mixed precision, if you really want). All benchmarks were evaluated in pure `bfloat16`. Sampling The model can be used like a normal HF model to generate text with KV-caching working as expected. You can provide `numsteps` directly to the `generate` call, for example: Note: `numsteps` and other model arguments CANNOT be included in the `GenerationConfig`, they will shadow model args at runtime. The model was not finetuned or post-trained, but due to inclusion of instruction data during pretraining, natively understand its chat template. You can chat with the model like so KV-cache Details The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones. The current implementation will always try to inject this Cache implementation, but that may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this: Per-Token Adaptive Compute When generating, you can use a variable amount of compute per-token. The model is not trained for this, so this is a proof-of-concept, that it can do this task zero-shot. You can pick between a few sane stopping rules, `entropy-diff`, `latent-diff`,`kl` and `argmax-stability`, via `criterion=...`. The exit threshold can be modified via `exitthreshold=5e-4`. We suggest using `kl` for interesting exits and `argmax-stability` for conservative exits. Note that using these variables overrides the default generation function. Not all arguments that are valid for the normal `generate` call are valid here. To make this more explicit, you can also directly call `generatewithadaptivecompute`: Your cache strategy should be set to `"latest-m4"` if using adaptive compute. KV-cache Sharing To reduce KV cache memory requirements, the model can be run with fewer KV-caches, with later iterations in the recurrence overwriting earlier caches. To use this feature, set the cache argument `lookupstrategy` to include `compress-s16` (where the last number determine the size of the cache). You can combine this per-token adaptive compute. In that case your lookup strategy should be `latest-m4-compress-s16`. Warmstart / Continuous CoT At each generation step, the recurrence can be warmstarted with the final state from the previous token by setting `continuouscompute=True`, like so Model Summary The model is primarily structured around decoder-only transformer blocks. However these blocks are structured into three functional groups, the prelude \\(P\\), which embeds the input data into a latent space using multiple transformer layers, then the core recurrent block \\(R\\), which is the central unit of recurrent computation modifying states \\(\mathbf{s} \in \mathbb{R}^{n \times h }\\), and finally the coda \\(C\\), which un-embeds from latent space using several layers and also contains the prediction head of the model. Given a number of recurrent iterations \\(r\\), and a sequence of input tokens \\(\mathbf{x} \in V^n\\) these groups are used in the following way to produce output probabilities \\(\mathbf{p} \in \mathbb{R}^{n \times |V|}\\). $$\mathbf{s}0 \sim \mathcal{N}(\mathbf{0}, \sigma^2 I{n\cdot h})$$ $$\mathbf{s}i = R(\mathbf{e}, \mathbf{s}{i-1}) \; \textnormal{for} \; i \in \lbrace 1, \dots, r \rbrace$$ $$\mathbf{p} = C(\mathbf{s}r)$$ where \\(\sigma\\) is the standard deviation of the initial random state. Given an init random state \\(\mathbf{s}0\\), the model repeatedly applies the core recurrent block \\(R\\), which accepts the latent state \\(\mathbf{s}{i-1}\\) and the embedded input \\(\mathbf{e}\\) and outputs a new latent state \\(\mathbf{s}i\\). After finishing all iterations, the coda block processes the last state and produces the probabilities of the next token. Please refer to the paper for benchmark performance on standard benchmarks. Limitations Our checkpoint is trained for only 47000 steps on a broadly untested data mixture with a constant learning rate. As an academic project, the model is trained only on publicly available data and the 800B token count, while large in comparison to older fully open-source models such as the Pythia series, is small in comparison to modern open-source efforts such as OLMo, and tiny in comparison to the datasets used to train industrial open-weight models. Technical Specifications This model was trained on 21 segments of 4096 AMD MI-250X GPUs on the OLCF Frontier Supercomputer in early December 2024. The model was trained using ROCM 6.2.0, and PyTorch 2.6 nightly pre-release 24/11/02. The code used to train the model can be found at https://github.com/seal-rg/recurrent-pretraining. License This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread on Hugging Face.

license:apache-2.0
1,416
288

Gemstone-512x13

Gemstone-512x13 Gemstone-512x13 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-512x13 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
983
0

Gemstone-384x13

license:apache-2.0
496
0

Gemstone-2560x8

license:apache-2.0
171
1

Gemstone-1280x36

Gemstone-1280x36 Gemstone-1280x36 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1280x36 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
102
0

zephyr-llama3-8b-sft-refusal-n-contrast-multiple-tokens

NaNK
llama
49
0

huginn_swa_75_7_ema_0.9_merge

license:apache-2.0
45
1

DynaGuard-4B

NaNK
license:apache-2.0
35
2

DynaGuard-1.7B

NaNK
license:apache-2.0
34
3

Gemstone-768x45

Gemstone-1280x36 Gemstone-1280x36 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1280x36 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
33
0

CSD-ViT-L

license:cc-by-4.0
27
5

Gemstone-256x23_cooldown

license:apache-2.0
25
0

Gemstone-768x45_cooldown

license:apache-2.0
20
0

Gemstone-384x36

license:apache-2.0
18
0

Gemstone-384x36_lr_ablation

license:apache-2.0
17
0

Gemstone-256x80_cooldown

license:apache-2.0
16
0

Gemstone-512x12_cooldown

license:apache-2.0
15
0

Gemstone-1792x7

Gemstone-1792x7 Gemstone-1792x7 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1792x7 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
12
0

Gemstone-1280x15

Gemstone-1280x15 Gemstone-1280x15 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1280x15 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
11
0

step-00006144-recurrence_full_512_0

Huginn-0125-intermediate checkpoints This is an intermediate checkpoint from our large-scale training run. Additional intermediate checkpoints are available upon request. All other information can be found at the main checkpoint. 1. How to Use 2. Advanced Usage 3. Model Summary 4. Limitations 5. Technical Details 6. License 7. Citation Downloading and Using the Model Load the model like this: Modifying the Model's Depth at Test Time: By providing the argument `numsteps`, the model will execute a forward pass with that amount of compute: The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline, the number of materialized parameters is `numsteps 1.5B + 2B`. Playing with this parameter is what makes this model interesting, and different from fixed-depth transformers! The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `numsteps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements. Note: Due to an upload issue the model is currently stored on HF with 2 copies of the tied embedding, instead of just one. This will be fixed in a future release. Inference The model was trained with bfloat16-mixed precision, so we recommend using `bfloat16` to run inference (or AMP bfloat16-mixed precision, if you really want). All benchmarks were evaluated in pure `bfloat16`. Sampling The model can be used like a normal HF model to generate text with KV-caching working as expected. You can provide `numsteps` directly to the `generate` call, for example: Note: `numsteps` and other model arguments CANNOT be included in the `GenerationConfig`, they will shadow model args at runtime. The model was not finetuned or post-trained, but due to inclusion of instruction data during pretraining, natively understand its chat template. You can chat with the model like so KV-cache Details The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones. The current implementation will always try to inject this Cache implementation, but that may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this: Per-Token Adaptive Compute When generating, you can also a variable amount of compute per-token. The model is not trained for this, so this is a proof-of-concept, that can do this task zero-shot. You can pick between a few sane stopping rules, `entropy-diff`, `latent-diff`,`kl` and `argmax-stability`, via `criterion=kl`. The exit threshold can be modified via `exitthreshold=5e-4`. We suggest using `kl` for interesting exits and `argmax-stability` for conservative exits. Note that using these variables overrides the default generation function. Not all arguments that are valid for the normal `generate` call are valid here. To make this more explicit, you can also directly call `generatewithadaptivecompute`: Your cache strategy should be set to `"latest-m4"` if using adaptive compute. KV-cache Sharing To reduce KV cache memory requirements, the model can be run with fewer KV-caches, with later iterations in the recurrence overwriting earlier caches. To use this feature, set the cache argument `lookupstrategy` to include `compress-s16` (where the last number determine the size of the cache). You can combine this per-token adaptive compute. In that case your lookup strategy should be `latest-m4-compress-s16`. Warmstart / Continuous CoT At each generation step, the recurrence can be warmstarted with the final state from the previous token by setting `continuouscompute=True`, like so Model Summary The model is primarily structured around decoder-only transformer blocks. However these blocks are structured into three functional groups, the prelude \\(P\\), which embeds the input data into a latent space using multiple transformer layers, then the core recurrent block \\(R\\), which is the central unit of recurrent computation modifying states \\(\mathbf{s} \in \mathbb{R}^{n \times h }\\), and finally the coda \\(C\\), which un-embeds from latent space using several layers and also contains the prediction head of the model. Given a number of recurrent iterations \\(r\\), and a sequence of input tokens \\(\mathbf{x} \in V^n\\) these groups are used in the following way to produce output probabilities \\(\mathbf{p} \in \mathbb{R}^{n \times |V|}\\). $$\mathbf{s}0 \sim \mathcal{N}(\mathbf{0}, \sigma^2 I{n\cdot h})$$ $$\mathbf{s}i = R(\mathbf{e}, \mathbf{s}{i-1}) \; \textnormal{for} \; i \in \lbrace 1, \dots, r \rbrace$$ $$\mathbf{p} = R(\mathbf{s}r)$$ where \\(\sigma\\) is the standard deviation of the initial random state. Given an init random state \\(\mathbf{s}0\\), the model repeatedly applies the core block \\(R\\), which accepts the latent state \\(\mathbf{s}{i-1}\\) and the embedded input \\(\mathbf{e}\\) and outputs a new latent state \\(\mathbf{s}i\\). After finishing all iterations, the coda block processes the last state and produces the probabilities of the next token. Please refer to the paper for benchmark performance on standard benchmarks. Limitations Our checkpoint is trained for only 47000 steps on a broadly untested data mixture with a constant learning rate. As an academic project, the model is trained only on publicly available data and the 800B token count, while large in comparison to older fully open-source models such as the Pythia series, is small in comparison to modern open-source efforts such as OLMo, and tiny in comparison to the datasets used to train industrial open-weight models. Technical Specifications This model was trained on 21 segments of 4096 AMD MI-250X GPUs on the OLCF Frontier Supercomputer in early December 2024. The model was trained using ROCM 6.2.0, and PyTorch 2.6 nightly pre-release 24/11/02. The code used to train the model can be found at https://github.com/seal-rg/recurrent-pretraining. License This model is released under the apache-2.0 licence. You can also find the paper at https://huggingface.co/papers/2502.05171. Contact Please, feel free to contact us with any questions, or open an discussion thread on Hugging Face.

license:apache-2.0
10
0

Gemstone-3072x12

license:apache-2.0
10
0

Gemstone-256x23

license:apache-2.0
9
0

Gemstone-512x14

license:apache-2.0
9
0

Gemstone-256x27_cooldown

license:apache-2.0
9
0

Gemstone-1280x15_cooldown

license:apache-2.0
9
0

step-00011904-recurrence_full_512_0

license:apache-2.0
8
0

Gemstone-256x71

license:apache-2.0
8
0

Gemstone-512x16

license:apache-2.0
8
0

Gemstone-256x27

license:apache-2.0
8
0

Gemstone-256x71_cooldown

license:apache-2.0
8
0

Gemstone-768x3_cooldown

license:apache-2.0
8
0

Gemstone-512x13_cooldown

license:apache-2.0
8
0

LoRI-D_code_mistral7b_rank_32

NaNK
8
0

Gemstone-2048x27

license:apache-2.0
7
0

Gemstone-512x12

Gemstone-512x12 Gemstone-512x12 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-512x12 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
7
0

Gemstone-512x11

license:apache-2.0
7
0

Gemstone-768x3

license:apache-2.0
7
0

step-00041728-recurrence_full_512_0

license:apache-2.0
6
0

Gemstone-256x80

license:apache-2.0
6
0

Gemstone-512x16_cooldown

license:apache-2.0
6
0

LoRI-S_math_llama3_rank_64

NaNK
base_model:meta-llama/Meta-Llama-3-8B
6
0

zephyr-llama3-8b-sft-refusal-n-contrast-single-token

NaNK
llama
5
1

step-00010720-baseline_2_0

license:apache-2.0
5
0

LoRI-S_nlu_mistral7b_rank_32

NaNK
5
0

zephyr-llama3-8b-sft-refusal-n-contrast

NaNK
llama
5
0

zephyr-llama3-8b-sft-no-refusal-messages

NaNK
llama
5
0

step-00023808-recurrence_full_512_0

license:apache-2.0
4
0

step-00035840-recurrence_full_512_0

license:apache-2.0
4
0

Gemstone-1792x7_cooldown

license:apache-2.0
4
0

LoRI-D_math_mistral7b_rank_32

NaNK
4
0

4-goldfish-loss-llama-1B

NaNK
llama
3
0

step-00017920-recurrence_full_512_0

license:apache-2.0
3
0

step-00029824-recurrence_full_512_0

license:apache-2.0
3
0

step-00006144-baseline_2_0

license:apache-2.0
3
0

step-00010752-recurrence_full_512_0

license:apache-2.0
3
0

huginn_swa_100_10_avg_0.9_merge

license:apache-2.0
3
0

Gemstone-1536x50

Gemstone-1536x50 Gemstone-1536x50 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1536x50 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
3
0

Gemstone-1792x18

Gemstone-1792x18 Gemstone-1792x18 is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 350 billion tokens, we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1792x18 The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
3
0

Gemstone-1792x18_cooldown

license:apache-2.0
3
0

Gemstone-2560x8_cooldown

license:apache-2.0
3
0

Gemstone-3072x12_cooldown

license:apache-2.0
3
0

LoRI-D_continual_safety_code

NaNK
base_model:meta-llama/Meta-Llama-3-8B
3
0

LoRI-D_nlu_mistral7b_rank_64

NaNK
3
0

128-goldfish-loss-llama-1B

NaNK
llama
2
0

32-goldfish-loss-llama-1B

NaNK
llama
2
0

Gemstone-1024x28

license:apache-2.0
2
0

Gemstone-1536x50_cooldown

license:apache-2.0
2
0

Gemstone-2048x27_cooldown

license:apache-2.0
2
0

Gemstone-256x23_lr_ablation

license:apache-2.0
2
0

Gemstone-1280x36_lr_ablation

Gemstone-1280x36lrablation Gemstone-1280x36lrablation is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. This particular version, denoted by the `lrablation` postfix, corresponds to an ablation detailed in the paper where we train the same suite of models but with a learning rate that is half of the original. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 100 billion tokens (in contrast to the main suite, which is trained to 350 billion tokens), we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-1280x36lrablation The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
2
0

Gemstone-384x13_lr_ablation

Gemstone-384x13lrablation Gemstone-384x13lrablation is part of the Gemstone Suite of Models. A set of models trained with varying widths and depths. This particular version, denoted by the `lrablation` postfix, corresponds to an ablation detailed in the paper where we train the same suite of models but with a learning rate that is half of the original. Training We train using litgpt and AxoNN using AMD MI250X GPUs on Frontier at Oak Ridge National Laboratory with a global batch size of 2048. Data Train and validation data is taken from non-overlapping subsets of dolma. As such it is not an instruction model. This model is trained for 100 billion tokens (in contrast to the main suite, which is trained to 350 billion tokens), we upload checkpoints every 2 billion tokens (477 steps). Using Gemstone-384x13lrablation The Gemstones are based on the gemma-2b architecture and use modelinggemma.py to run using the transformers library. Licence This model is released under the apache-2.0 licence. Contact Please, feel free to contact us with any questions, or open a discussion thread.

license:apache-2.0
2
0

Gemstone-512x16_lr_ablation

license:apache-2.0
2
0

LoRI-D_code_llama3_rank_32

NaNK
llama
2
0

LoRI-D_code_llama3_rank_64

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-D_continual_safety_nlu

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-D_nlu_llama3_rank_32

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-D_nlu_llama3_rank_64

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-D_nlu_mistral7b_rank_32

NaNK
2
0

LoRI-D_safety_mistral7b_rank_64

NaNK
2
0

LoRI-S_code_llama3_rank_64

NaNK
llama
2
0

LoRI-S_continual_safety_code

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-S_math_llama3_rank_32

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-S_nlu_llama3_rank_64

NaNK
base_model:meta-llama/Meta-Llama-3-8B
2
0

LoRI-S_nlu_mistral7b_rank_64

NaNK
2
0

LoRI-S_safety_mistral7b_rank_32

NaNK
2
0

LoRI-S_safety_llama3_rank_32

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
1

3-goldfish-loss-llama-1B

NaNK
llama
1
0

control-llama-1B

NaNK
llama
1
0

GenQA-math-llama-3

llama
1
0

llama-2-7b-lora_r32_step32

NaNK
llama
1
0

llama-2-7b-lora_r32_step16

NaNK
llama
1
0

Gemstone-1280x36_cooldown

license:apache-2.0
1
0

Gemstone-1792x18_lr_ablation

license:apache-2.0
1
0

Gemstone-3072x12_lr_ablation

license:apache-2.0
1
0

Gemstone-2048x27_lr_ablation

license:apache-2.0
1
0

Gemstone-384x13_cooldown

license:apache-2.0
1
0

LoRI-D_code_mistral7b_rank_64

NaNK
1
0

LoRI-D_continual_safety_math

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

LoRI-D_math_llama3_rank_64

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

LoRI-D_safety_mistral7b_rank_32

NaNK
1
0

LoRI-S_code_llama3_rank_32

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

LoRI-S_code_mistral7b_rank_32

NaNK
1
0

LoRI-S_code_mistral7b_rank_64

NaNK
1
0

LoRI-S_continual_safety_math

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

LoRI-S_continual_safety_nlu

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

LoRI-S_math_mistral7b_rank_32

NaNK
1
0

LoRI-S_nlu_llama3_rank_32

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

LoRI-S_safety_llama3_rank_64

NaNK
base_model:meta-llama/Meta-Llama-3-8B
1
0

zero-model-checkpoints

NaNK
llama-3
0
2

LoRI-S_safety_mistral7b_rank_64

NaNK
0
1