Downtown-Case

35 models • 1 total models in database
Sort by:

GLM-4.6-128GB-RAM-IK-GGUF

Quantized for 128GB RAM + single GPU setups, with `IQK` quants for better quality/performance in the size than mainline llama.cpp. Requires ikllama.cpp. I can hit ~6.8 tokens a second textgen on 128GB dual-channel DDR5, single CCD Ryzen 7000 + a single 3090. See ubergarm's model card for more info on running these quants: - The first 6 layers are `IQ3KT` (a less lossy and more GPU optimal 3bpw trellis quant), running under the assumption they will be offloaded to GPU. - Instead of quantizing ffndown asymmetrically, it's quantized the same as ffnup/gate, but the beginning/end layers are `IQ3KS`. Targeting this more finely is a WIP. 117.3GB, for ~11GB-16GB VRAM + 128GB RAM (or longer context). - Dense parts are `IQ4KT` instead of `IQ5KS` to save VRAM. - More layers are `IQ2KL` instead of `IQ3KS` to avoid CPU swapping, and layer 92 was also 'trimmed' since it's not used. - Uses ubergarm's ikllama.cpp imatrix (which should be less lossy without a .gguf -> .dat conversion). - Unsloth bf16 weights used as a base, including its tokenizer bugfixes. - Expert quantization follows Unsloth's IQ2XSS layer scheme, with perplexity 'bumps' boosted. See the quantization dump here: https://huggingface.co/ubergarm/GLM-4.6-GGUF/discussions/2#68dd8ca9cb29272d402f3062 `taskset -c 8-15 nice --20 build/bin/llama-server --cache-type-k q80 --cache-type-v q51 --batchsize 4096 --ubatchsize 4096 --ctx-size 20480 --host 0.0.0.0 --port 5000 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-6])\.ffn.=CUDA0" -ot exps=CPU --parallel 1 --threads 8 --no-mmap --path examples/server/publicmikupad --sql-save-file /home/alpha/FastStorage/SQLSave/sqlite-save.sql --model /path/to/GLM-4.6/24GB+128GBV3/GLM-4.6-IQ2KL-BIG-00001-of-00003.gguf` 6 MoE layers on GPU, adjust with the '6' in `"blk\.([0-6])\.ffn.=CUDA0"` `taskset -c 8-15 ./build/bin/llama-perplexity --ctx-size 2048 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-9])\.ffn.=CUDA0" -ot exps=CPU --no-mmap --file /home/alpha/Models/GGUF/ddh0imatcalibrationdatav2.txt --kl-divergence --kl-divergence-base /home/alpha/Models/GGUF/GLM-4.6-KLD-ref-logits-Q80-ddh0-imat-calibration-data-v2.bin --model /home/alpha/Models/GGUF/GLM-4.6/24GB+128GBV3/GLM-4.6-unsloth.gguf-00001-of-00003.gguf` 126.8GB, for 24GB VRAM + 128GB RAM. Slower, higher quality than V3, 0.081 KLD. - All ffndown layers are 3 bit. The same 'sensitive' up/gate ffns as V3 are still 3-bit. - `IQ3KT` instead of `IQ3KS`, for smaller size and less loss. - The cost: ~15% slower TG than V3 (on my Ryzen 7800). `taskset -c 8-15 nice --20 build/bin/llama-server --cache-type-k q80 --cache-type-v q51 --batchsize 4096 --ubatchsize 4096 --ctx-size 20480 --host 0.0.0.0 --port 5000 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-6])\.ffn.=CUDA0" -ot exps=CPU --parallel 1 --threads 8 --no-mmap --path examples/server/publicmikupad --sql-save-file /home/alpha/FastStorage/SQLSave/sqlite-save.sql --model /path/to/GLM-4.6/24GB+128GBV3/GLM-4.6-IQ2KL-BIG-00001-of-00003.gguf` 3 MoE layers on GPU, adjust with the '6' in `"blk\.([0-6])\.ffn.=CUDA0"` `taskset -c 8-15 ./build/bin/llama-perplexity --ctx-size 2048 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-9])\.ffn.=CUDA0" -ot exps=CPU --no-mmap --file /home/alpha/Models/GGUF/ddh0imatcalibrationdatav2.txt --kl-divergence --kl-divergence-base /home/alpha/Models/GGUF/GLM-4.6-KLD-ref-logits-Q80-ddh0-imat-calibration-data-v2.bin --model /home/alpha/Models/GGUF/GLM-4.6/24GB+128GBV4/GLM-4.6-slow.gguf-00001-of-00003.gguf` For reference, Unsloth's (130.8GB) Q2KXL has a KL Divergence of ~0.12, and bartowski's 128GB Q2KXL is ~0.155, per AesSedai's benchmarks. Ik quants make a massive difference in this range. Ubergarm's IQ2KL mix has a KLD of .088 at 127.5GB. I'd recommend that as well! With all the hearsay about the effects of context cache quantization( `--cache-type-k`, `--cache-type-v` ), I tested the V4 GGUF at different levels: - q80/q80 is within the margin of error (+0.001 KLD); seemingly very little loss for the huge vram savings. - But some other configurations appears to be reasonably low loss, with q80/q51 (for instance) within the margin of error, and q51/iq4nl (at +0.0045) being quite reasonable for squeezing in a lot of context. Personaly, I use q80/q51 now. - Take this with a grain of salt, as (due to the way the test uses the K/V cache) I haven't confirmed the correlation between KV cache quantization KLD with actual long context inference. KL divergence/perplexity tests are done with AesSedai's wonderful testing data: https://huggingface.co/AesSedai/GLM-4.6-GGUF/discussions/1#68dcb412ae30ad1405dacd9a MoE Experts are generally `IQ2KL`/`IQ3KS` on CPU, or `IQ3KT` if destined for the GPU, with dense layers at higher quants levels like `IQ5KS` for less loss. My hardware is a undervolted 3090, dual channel DDR5 6000, an AMD 7800 CPU and linux, though dual CDD ryzen (or tweaked systems) should be notably faster due to the single CCD bandwidth limit. See the example scripts for quantizing, launching the server, and such. KLD results are not necessarily comparible to other repos (as they were run at 2048 context instead of the default 512), but they will be once I rerun them. - ~~Check perplexity of expert FFNs in each layer.~~ - Make more optimal mixes using the Thireus's perplexity data, as seen in `ExampleScripts/GLM-4.6-expert-sorted-perplexity.txt`. - Find 'point of diminishing returns' for dense layer quantization (`Q6K`?). - Test KLD impact of different tokenembd quantization. Derived from ubergarm's GLM-4.5 (Instruct) quantizations: https://huggingface.co/ubergarm/GLM-4.5-GGUF And GGUF-Tool-Suite: https://github.com/Thireus/GGUF-Tool-Suite

NaNK
ik_llama.cpp
202
19

GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

GLM-4.5-Base, quantized down to 124GB (V2) and 118GB (V1), specifically for 128GB RAM + small GPU setups. With the following mix, derived from ubergarm's GLM-4.5 (Instruct) quantizations: - Mostly iq5ks GPU layers to minimize loss cheaply, keep it fast (as iqXks quantizations are very fast), and minimize the number of quantization types. - iq3ks shared experts near the beginning and end, as this seems to be where there are perplexity 'bumps.' Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and RAM to spare for the system. It's awesome for story continuation. Do NOT load with mmap! Requires ikllama.cpp, see ubergarm's GLM 4.5 page. And let me know if you want a different mix (such as one more optimal for 8-11GB GPUs).

ik_llama.cpp
74
1

ByteDance-Seed_Seed-OSS-36B-Instruct-exl3-4.02bpw-hb8

Custom exl3 quantization, with 5bpw KV heads, 4bpw for all the other layers, and an 8bpw lmhead. You can get to know us better through the following channels👇 > [!NOTE] > This model card is dedicated to the `Seed-OSS-36B-Instruct` model. News - [2025/08/20]🔥We release `Seed-OSS-36B-Base` (both with and without synthetic data versions) and `Seed-OSS-36B-Instruct`. Introduction Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. Although trained with only 12T tokens, Seed-OSS achieves excellent performance on several popular open benchmarks. We release this series of models to the open-source community under the Apache-2.0 license. > [!NOTE] > Seed-OSS is primarily optimized for international (i18n) use cases. Key Features - Flexible Control of Thinking Budget: Allowing users to flexibly adjust the reasoning length as needed. This capability of dynamically controlling the reasoning length enhances inference efficiency in practical application scenarios. - Enhanced Reasoning Capability: Specifically optimized for reasoning tasks while maintaining balanced and excellent general capabilities. - Agentic Intelligence: Performs exceptionally well in agentic tasks such as tool-using and issue resolving. - Research-Friendly: Given that the inclusion of synthetic instruction data in pre-training may affect the post-training research, we released pre-trained models both with and without instruction data, providing the research community with more diverse options. - Native Long Context: Trained with up-to-512K long context natively. Seed-OSS adopts the popular causal language model architecture with RoPE, GQA attention, RMSNorm and SwiGLU activation. | | | |:---:|:---:| | | Seed-OSS-36B | | Parameters | 36B | | Attention | GQA | | Activation Function | SwiGLU | | Number of Layers | 64 | | Number of QKV Heads | 80 / 8 / 8 | | Head Size | 128 | | Hidden Size | 5120 | | Vocabulary Size | 155K | | Context Length | 512K | | RoPE Base Frequency | 1e7 | Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as `Seed-OSS-36B-Base`. We also release `Seed-OSS-36B-Base-woSyn` trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data. Benchmark Seed1.6-Base Qwen3-30B-A3B-Base-2507 Qwen2.5-32B-Base Seed-OSS-36B-Base ( w/ syn. ) Seed-OSS-36B-Base-woSyn ( w/o syn. ) - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Benchmark Seed1.6-Thinking-0715 OAI-OSS-20B Qwen3-30B-A3B-Thinking-2507 Qwen3-32B Gemma3-27B Seed-OSS-36B-Instruct GPQA-D 80.7 72.2 (71.5) 71.4 (73.4) 66.7 (68.4) 42.4 71.4 LiveCodeBench v6 (02/2025-05/2025) 66.8 63.8 60.3 (66) 53.4 - 67.4 SWE-Bench Verified (OpenHands) 41.8 (60.7) 31 23.4 - 56 SWE-Bench Verified (AgentLess 410) 48.4 - 33.5 39.7 - 47 - Bold denotes open-source SOTA. Underlined indicates the second place in the open-source model. - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Some results have been omitted due to the failure of the evaluation run. - The results of Gemma3-27B are sourced directly from its technical report. - Generation configs for Seed-OSS-36B-Instruct: temperature=1.1, topp=0.95. Specifically, for Taubench, temperature=1, topp=0.7. > [!NOTE] > We recommend sampling with `temperature=1.1` and `topp=0.95`. Users can flexibly specify the model's thinking budget. The figure below shows the performance curves across different tasks as the thinking budget varies. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. Here is an example with a thinking budget set to 512: during the reasoning process, the model periodically triggers self-reflection to estimate the consumed and remaining budget, and delivers the final response once the budget is exhausted or the reasoning concludes. If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. Download Seed-OSS checkpoint to `./Seed-OSS-36B-Instruct` Transformers The `generate.py` script provides a simple interface for model inference with configurable options. Key Parameters | Parameter | Description | |-----------|-------------| | `--modelpath` | Path to the pretrained model directory (required) | | `--prompts` | Input prompts (default: sample cooking/code questions) | | `--maxnewtokens` | Maximum tokens to generate (default: 4096) | | `--attnimplementation` | Attention mechanism: `flashattention2` (default) or `eager` | | `--loadin4bit/8bit` | Enable 4-bit/8-bit quantization (reduces memory usage) | | `--thinkingbudget` | Thinking budget in tokens (default: -1 for unlimited budget) | - First install vLLM with Seed-OSS support version: License This project is licensed under Apache-2.0. See the LICENSE flie for details. Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

NaNK
license:apache-2.0
25
0

Seed-OSS-36B-Base-Instruct-Karcher-Merge

This is a merge of Bytedance Seed-OSS-36B Base and Instruct, using the karcher-means method in mergekit, with the idea being to get Bytedance Instruct to 'feel' and write more like a raw continuation model. Karcher was tested because this and SLERP are seemingly the only viable ways to merge an instruct and base model. Quantized, it gets an MMLU score (via the exllamav3 eval script) of `11853/ 14042 = 84.41% correct, ( 80.41% prob.)` For reference, ByteDance's instruct model (with the exact same quantization settings) gets `11680/ 14042 = 83.18% correct, ( 80.96% prob.)` The base model by itself: `11851/ 14042 = 84.40% correct, ( 76.96% prob.)` This model was merged using the Karcher Mean merge method using /home/alpha/Models/Raw/ByteDance-SeedSeed-OSS-36B-Instruct as a base. The following models were included in the merge: /home/alpha/Models/Raw/ByteDance-SeedSeed-OSS-36B-Base The following YAML configuration was used to produce this model:

NaNK
license:apache-2.0
18
3

ByteDance-Seed_Seed-OSS-36B-Instruct-exl3-4.22bpw-hb8

NaNK
license:apache-2.0
14
1

OpenBuddy_SimpleChat-32B-V1-exl3-4.3bpw-hb8

Custom exl3 quantization, with 5bpw attention layers, 4bpw for the MLP layers, and an 8bpw lmhead. The SimpleChat series represents our new exploration into Non-Chain-of-Thought (Non-CoT) models. Its main features are: Distinct Chat Style: Designed to be concise, rational, and empathetic. Specifically built for casual, everyday conversations. Enhanced Creativity: Boosts the creativity of its generated content and its capacity for emotional understanding. This is achieved by distilling knowledge from advanced models, including K2. Efficient Reasoning within a Non-CoT Framework: Delivers the faster response times of a Non-CoT model while preserving strong reasoning skills. It retains this ability because it was trained on CoT models before being transitioned to a Non-CoT framework, allowing it to think through complex problems. Known Trade-off: Compared to models that specialize in Chain-of-Thought, it may not perform as strongly on mathematical tasks. GitHub and Usage Guide: https://github.com/OpenBuddy/OpenBuddy This model supports a Qwen3-like prompt format, with following system prompt recommended: You may want to use `vllm` to deploy an OpenAI-like API service. For more information, please refer to the vllm documentation. All OpenBuddy models have inherent limitations and may potentially produce outputs that are erroneous, harmful, offensive, or otherwise undesirable. Users should not use these models in critical or high-stakes situations that may lead to personal injury, property damage, or significant losses. Examples of such scenarios include, but are not limited to, the medical field, controlling software and hardware systems that may cause harm, and making important financial or legal decisions. OpenBuddy is provided "as-is" without any warranty of any kind, either express or implied, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement. In no event shall the authors, contributors, or copyright holders be liable for any claim, damages, or other liabilities, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software. By using OpenBuddy, you agree to these terms and conditions, and acknowledge that you understand the potential risks associated with its use. You also agree to indemnify and hold harmless the authors, contributors, and copyright holders from any claims, damages, or liabilities arising from your use of OpenBuddy. 所有OpenBuddy模型均存在固有的局限性,可能产生错误的、有害的、冒犯性的或其他不良的输出。用户在关键或高风险场景中应谨慎行事,不要使用这些模型,以免导致人身伤害、财产损失或重大损失。此类场景的例子包括但不限于医疗领域、可能导致伤害的软硬件系统的控制以及进行重要的财务或法律决策。 OpenBuddy按“原样”提供,不附带任何种类的明示或暗示的保证,包括但不限于适销性、特定目的的适用性和非侵权的暗示保证。在任何情况下,作者、贡献者或版权所有者均不对因软件或使用或其他软件交易而产生的任何索赔、损害赔偿或其他责任(无论是合同、侵权还是其他原因)承担责任。 使用OpenBuddy即表示您同意这些条款和条件,并承认您了解其使用可能带来的潜在风险。您还同意赔偿并使作者、贡献者和版权所有者免受因您使用OpenBuddy而产生的任何索赔、损害赔偿或责任的影响。

NaNK
14
0

Star-Command-R-Lite-32B-v1-exl2-4bpw

NaNK
13
1

ByteDance-Seed_Seed-OSS-36B-Instruct-exl3-3.22bpw-hb6

NaNK
license:apache-2.0
13
0

ByteDance-Seed_Seed-OSS-36B-Base-woSyn-exl3-3.22bpw-hb6

NaNK
license:apache-2.0
13
0

ByteDance-Seed_Seed-OSS-36B-Base-exl3-4.22bpw-hb8

NaNK
license:apache-2.0
12
0

jukofyork_command-r-35b-writer-v3-exl3-3.75bpw-hb6

EXL3 quant with 3bpw MLP projection layer and 4bpw for all other layers, to fit in 24GB cards with 16K context. Original description: Merged jukofyork/command-r-35b-writer-v3-multiplicative-lora into CohereLabs/c4ai-command-r-v01 using jukofyork/merge-lora.

NaNK
license:cc-by-nc-4.0
11
1

ByteDance-Seed_Seed-OSS-36B-Base-woSyn-exl3-4.22bpw-hb8

NaNK
license:apache-2.0
11
0

Seed-OSS-36B-Base-Instruct-Karcher-Merge-exl3-4.22bpw-hb8

This is a merge of Bytedance Seed-OSS-36B Base and Instruct, using the karcher-means method in mergekit, with the idea being to get Bytedance Instruct to 'feel' and write more like a raw continuation model. Karcher was tested because this and SLERP are seemingly the only viable ways to merge an instruct and base model. Quantized, it gets an MMLU score (via the exllamav3 eval script) of `11853/ 14042 = 84.41% correct, ( 80.41% prob.)` For reference, ByteDance's instruct model (with the exact same quantization settings) gets `11680/ 14042 = 83.18% correct, ( 80.96% prob.)` The base model by itself: `11851/ 14042 = 84.40% correct, ( 76.96% prob.)` This upload is a custom ~4.22bpw exl3 quantization, with 5bpw attention heads and 4bpw MLP layers. If you want a different size quantization, just ask. This model was merged using the Karcher Mean merge method using /home/alpha/Models/Raw/ByteDance-SeedSeed-OSS-36B-Instruct as a base. The following models were included in the merge: /home/alpha/Models/Raw/ByteDance-SeedSeed-OSS-36B-Base The following YAML configuration was used to produce this model:

NaNK
license:apache-2.0
10
1

CohereForAI_c4ai-command-r-08-2024-exl2-3.75bpw

NaNK
license:cc-by-nc-4.0
4
2

c4ai-command-a-03-2025-exl3-3.12bpw-hb6

NaNK
license:cc-by-nc-4.0
4
2

internlm2_5-7b-chat-1m-llamafied-Q6K-GGUF

NaNK
4
0

Tifa-Deepsex-14b-CoT-Chat-HF

NaNK
license:apache-2.0
2
2

Qwen_Qwen2.5-32B-Base-exl2-3.75bpw

NaNK
license:apache-2.0
2
1

Qwen2.5-32B-EVA-Instruct-Merge-0.1

This is a merge of EVA 32B 0.1 with Qwen's 32B instruct model, and EVA 0.0, at low weights, using mergekit. Also see: https://huggingface.co/ParasiticRogue/EVA-Instruct-32B This model was merged using the della merge method using /home/a/Models/Raw/QwenQwen2.5-32B as a base. The following models were included in the merge: /home/a/Models/Raw/EVA-UNIT-01EVA-Qwen2.5-32B-v0.1 /home/a/Models/Raw/QwenQwen2.5-32B-Instruct /home/a/Models/Raw/EVA-UNIT-01EVA-Qwen2.5-32B-v0.0 The following YAML configuration was used to produce this model:

NaNK
2
1

meta-llama_Meta-Llama-3.1-8B-Instruct-exl2-8bpw

NaNK
llama
2
0

Qwen_Qwen2.5-32B-Base-exl2-3.62bpw

NaNK
license:apache-2.0
2
0

deepseek-ai_DeepSeek-R1-Distill-Qwen-32B-exl2-4.5bpw-8K-Cal

NaNK
2
0

Tifa-Deepsex-14b-CoT-Crazy-HF

NaNK
license:apache-2.0
1
1

EVA-UNIT-01_EVA-Qwen2.5-32B-v0.1-exl2-4.1bpw

4.1bpw quantization of EVA 0.1, using default exllamav2 parameters. A RP/storywriting specialist model, full-parameter finetune of Qwen2.5-32B on mixture of synthetic and natural data. It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model. Version notes for 0.1: Additional round of cleaning for the datasets, new subsets of 4o-WritingPrompts and Charcards, picking the most diverse samples from them, plus added a small subset of SystemChat2.0 to improve instruction following and sliglthy increased sequence length. Additionally, fixed the training config mistake from 32B 0.0, layernorm layers stay frozen this time. Unfreezing them caused positivity bias to appear in 32B 0.0 for some reason. Prompt format is ChatML. Recommended sampler values: Temperature: 1 Typical-P: 0.9 Min-P: 0.05 Top-A: 0.2 Repetition Penalty: 1.03 Recommended SillyTavern presets (via CalamitousFelicitousness): Celeste 70B 0.1 data mixture minus Opus Instruct subset. See that model's card for details. Kalomaze's OpusInstruct25k dataset, filtered for refusals. A subset (1k rows) of ChatGPT-4o-WritingPrompts by Gryphe A subset (2k rows) of Sonnet3.5-Charcards-Roleplay by Gryphe Synthstruct and SynthRP datasets by Epiculous A subset from Dolphin-2.9.3, including filtered version of notsamantha and a small subset of systemchat. Model was trained by Kearm and Auri. Special thanks: to FeatherlessAI for generously providing 8xH100 SXM node for training of this model to Gryphe, Lemmy, Kalomaze, Nopm, Epiculous and CogninitiveComputations for the data and to Allura-org for support, feedback, beta-testing and doing quality control of EVA models.

NaNK
license:apache-2.0
1
0

nbeerbower_EVA-Gutenberg3-Qwen2.5-32B-exl3-4.0bpw-hb8

EVA-UNIT-01/EVA-Qwen2.5-32B-v0.2 finetuned on jondurbin/gutenberg-dpo-v0.1, nbeerbower/gutenberg2-dpo, and nbeerbower/gutenberg-moderne-dpo.

NaNK
license:apache-2.0
1
0

OpenBuddy_CoTGen-32B-V1-exl3-4.3bpw-hb8

Custom exl3 quantization, with 5bpw attention layers, 4bpw for the MLP layers, and an 8bpw lmhead.

NaNK
1
0

internlm2_5-7b-chat-1m-llamafied

NaNK
llama
0
4

internlm_internlm2_5-20b-llamafied-hacked-rope

NaNK
llama
0
2

internlm2_5-7b-chat-1m-llamafied-6bpw-exl2

NaNK
llama
0
1

Tess-2.0-RPMerge-SlerpMerge

llama
0
1

aws-prototyping_MegaBeam-Mistral-7B-512K-exl2-8.0bpw

NaNK
license:apache-2.0
0
1

Qwen_Qwen2.5-32B-Base-exl2-3.92bpw

NaNK
license:apache-2.0
0
1

nbeerbower_EVA-Gutenberg3-Qwen2.5-32B-exl2-5bpw-8K-Cal

Quantized using the default exllamav2 quantization script/dataset, with the following changes: - Context length for the calibration/quantization phases were both forced to 8192, as the script does not respect CLI changes by default and simply uses 512/2048 as context lengths. - Fewer rows, but ultimately, much more data was used. - A few rows of an "extra" dataset, with some examples of long, coherent text and this model's chat tokens, were added to the dataset. The goal is less degredation from quantization at long context. But I tried to stay as close to default exl2 quantization parameters as possible, as straying too far from them only seems to degrade performance. EVA-UNIT-01/EVA-Qwen2.5-32B-v0.2 finetuned on jondurbin/gutenberg-dpo-v0.1, nbeerbower/gutenberg2-dpo, and nbeerbower/gutenberg-moderne-dpo.

NaNK
license:apache-2.0
0
1

Deepseek-EVA-32B-SCE-v1

NaNK
0
1

Deepseek-EVA-32B-DELLA-v1

NaNK
0
1