TheBloke

✓ VerifiedCommunity

Prolific quantizer, GGUF format pioneer (now retired)

500 models • 29 total models in database
Sort by:

Noromaid-20B-v0.1.1-GGUF

--- base_model: NeverSleep/Noromaid-20b-v0.1.1 inference: false license: cc-by-nc-4.0 model_creator: IkariDev and Undi model_name: Noromaid 20B v0.1.1 model_type: llama prompt_template: 'Below is an instruction that describes a task. Write a response that appropriately completes the request.

NaNK
llama
375,382
30

deepseek-coder-6.7B-instruct-AWQ

--- base_model: deepseek-ai/deepseek-coder-6.7b-instruct inference: false license: other license_link: LICENSE license_name: deepseek model_creator: DeepSeek model_name: Deepseek Coder 6.7B Instruct model_type: deepseek prompt_template: 'You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science qu

NaNK
llama
204,159
19

TinyLlama-1.1B-Chat-v0.3-GPTQ

--- base_model: PY007/TinyLlama-1.1B-Chat-v0.3 datasets: - cerebras/SlimPajama-627B - bigcode/starcoderdata - OpenAssistant/oasst_top1_2023-08-25 inference: false language: - en license: apache-2.0 model_creator: Zhang Peiyuan model_name: TinyLlama 1.1B Chat v0.3 model_type: tinyllama prompt_template: 'system

NaNK
llama
199,392
9

TinyLlama-1.1B-Chat-v1.0-GGUF

--- base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 datasets: - cerebras/SlimPajama-627B - bigcode/starcoderdata - OpenAssistant/oasst_top1_2023-08-25 inference: false language: - en license: apache-2.0 model_creator: TinyLlama model_name: Tinyllama 1.1B Chat v1.0 model_type: tinyllama prompt_template: '

NaNK
tinyllama
100,001
195

Mistral-7B-Instruct-v0.2-GPTQ

NaNK
license:apache-2.0
95,261
54

Mixtral-8x7B-Instruct-v0.1-GPTQ

NaNK
license:apache-2.0
91,082
139

Llama-2-7B-GPTQ

NaNK
llama
73,456
81

DiscoLM_German_7b_v1-AWQ

NaNK
license:apache-2.0
63,544
4

Mistral-7B-Instruct-v0.2-GGUF

--- base_model: mistralai/Mistral-7B-Instruct-v0.2 inference: false license: apache-2.0 model_creator: Mistral AI_ model_name: Mistral 7B Instruct v0.2 model_type: mistral pipeline_tag: text-generation prompt_template: '[INST] {prompt} [/INST]

NaNK
license:apache-2.0
59,663
479

MythoMax-L2-13B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) MythoMax L2 13B - GGUF - Model creator: Gryphe - Original model: MythoMax L2 13B This repo contains GGUF format model files for Gryphe's MythoMax L2 13B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Gryphe's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The creator of the source model has listed its license as `other`, and this quantization has therefore used that same license. As this model is based on Llama 2, it is also subject to the Meta Llama 2 license terms, and the license files for that are additionally included. It should therefore be considered as being claimed to be licensed under both licenses. I contacted Hugging Face for clarification on dual licensing but they do not yet have an official position. Should this change, or should Meta provide any feedback on this situation, I will update this section accordingly. In the meantime, any questions regarding licensing, and in particular how these two licenses might interact, should be directed to the original model repository: Gryphe's MythoMax L2 13B. These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | mythomax-l2-13b.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | mythomax-l2-13b.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | mythomax-l2-13b.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | mythomax-l2-13b.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | mythomax-l2-13b.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | mythomax-l2-13b.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | mythomax-l2-13b.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | mythomax-l2-13b.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | mythomax-l2-13b.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | mythomax-l2-13b.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | mythomax-l2-13b.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | mythomax-l2-13b.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/MythoMax-L2-13B-GGUF and below it, a specific filename to download, such as: mythomax-l2-13b.q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. An improved, potentially even perfected variant of MythoMix, my MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a model, resulting in increased coherency across the entire structure. The script and the acccompanying templates I used to produce both can be found here. This model is proficient at both roleplaying and storywriting due to its unique nature. Quantized models are available from TheBloke: GGML - GPTQ (You're the best!) The idea behind this merge is that each layer is composed of several tensors, which are in turn responsible for specific functions. Using MythoLogic-L2's robust understanding as its input and Huginn's extensive writing capability as its output seems to have resulted in a model that exceeds at both, confirming my theory. (More details to be released at a later time) This type of merge is incapable of being illustrated, as each of its 363 tensors had an unique ratio applied to it. As with my prior merges, gradients were part of these ratios to further finetune its behaviour. This model primarily uses Alpaca formatting, so for optimal model performance, use:

NaNK
llama
57,148
200

Mistral-7B-Instruct-v0.2-AWQ

--- base_model: mistralai/Mistral-7B-Instruct-v0.2 inference: false license: apache-2.0 model_creator: Mistral AI_ model_name: Mistral 7B Instruct v0.2 model_type: mistral pipeline_tag: text-generation prompt_template: '[INST] {prompt} [/INST]

NaNK
license:apache-2.0
55,230
51

Llama-2-7B-Chat-GGUF

--- language: - en license: llama2 tags: - facebook - meta - pytorch - llama - llama-2 model_name: Llama 2 7B Chat arxiv: 2307.09288 base_model: meta-llama/Llama-2-7b-chat-hf inference: false model_creator: Meta Llama 2 model_type: llama pipeline_tag: text-generation prompt_template: '[INST] >

NaNK
llama
55,028
500

TinyLlama-1.1B-Chat-v0.3-AWQ

NaNK
llama
45,045
3

phi-2-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Phi 2 - GGUF - Model creator: Microsoft - Original model: Phi 2 This repo contains GGUF format model files for Microsoft's Phi 2. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Microsoft's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | phi-2.Q2K.gguf | Q2K | 2 | 1.17 GB| 3.67 GB | smallest, significant quality loss - not recommended for most purposes | | phi-2.Q3KS.gguf | Q3KS | 3 | 1.25 GB| 3.75 GB | very small, high quality loss | | phi-2.Q3KM.gguf | Q3KM | 3 | 1.48 GB| 3.98 GB | very small, high quality loss | | phi-2.Q40.gguf | Q40 | 4 | 1.60 GB| 4.10 GB | legacy; small, very high quality loss - prefer using Q3KM | | phi-2.Q3KL.gguf | Q3KL | 3 | 1.60 GB| 4.10 GB | small, substantial quality loss | | phi-2.Q4KS.gguf | Q4KS | 4 | 1.62 GB| 4.12 GB | small, greater quality loss | | phi-2.Q4KM.gguf | Q4KM | 4 | 1.79 GB| 4.29 GB | medium, balanced quality - recommended | | phi-2.Q50.gguf | Q50 | 5 | 1.93 GB| 4.43 GB | legacy; medium, balanced quality - prefer using Q4KM | | phi-2.Q5KS.gguf | Q5KS | 5 | 1.93 GB| 4.43 GB | large, low quality loss - recommended | | phi-2.Q5KM.gguf | Q5KM | 5 | 2.07 GB| 4.57 GB | large, very low quality loss - recommended | | phi-2.Q6K.gguf | Q6K | 6 | 2.29 GB| 4.79 GB | very large, extremely low quality loss | | phi-2.Q80.gguf | Q80 | 8 | 2.96 GB| 5.46 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/phi-2-GGUF and below it, a specific filename to download, such as: phi-2.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. Our model hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more. Phi-2 is intended for research purposes only. Given the nature of the training data, the Phi-2 model is best suited for prompts using the QA format, the chat format, and the code format. You can provide the prompt as a standalone question as follows: where the model generates the text after "." . To encourage the model to write more concise answers, you can also try the following QA format using "Instruct: \ \nOutput:" where the model generates the text after "Output:". where the model generates the text after the first "Bob:". where the model generates the text after the comments. Notes: Phi-2 is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications. Direct adoption for production tasks is out of the scope of this research project. As a result, the Phi-2 model has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details. If you are using `transformers>=4.36.0`, always load the model with `trustremotecode=True` to prevent side-effects. To ensure the maximum compatibility, we recommend using the second execution mode (FP16 / CUDA), as follows: Remark: In the generation function, our model currently does not support beam search (`numbeams > 1`). Furthermore, in the forward pass of the model, we currently do not support outputting hidden states or attention values, or using custom input embeddings. Generate Inaccurate Code and Facts: The model may produce incorrect code snippets and statements. Users should treat these outputs as suggestions or starting points, not as definitive or accurate solutions. Limited Scope for code: Majority of Phi-2 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users. Language Limitations: The model is primarily designed to understand standard English. Informal English, slang, or any other languages might pose challenges to its comprehension, leading to potential misinterpretations or errors in response. Potential Societal Biases: Phi-2 is not entirely free from societal biases despite efforts in assuring trainig data safety. There's a possibility it may generate content that mirrors these societal biases, particularly if prompted or instructed to do so. We urge users to be aware of this and to exercise caution and critical thinking when interpreting model outputs. Toxicity: Despite being trained with carefully selected data, the model can still produce harmful content if explicitly prompted or instructed to do so. We chose to release the model for research purposes only -- We hope to help the open-source community develop the most effective ways to reduce the toxicity of a model directly after pretraining. Verbosity: Phi-2 being a base model often produces irrelevant or extra text and responses following its first answer to user prompts within a single turn. This is due to its training dataset being primarily textbooks, which results in textbook-like responses. Architecture: a Transformer-based model with next-word prediction objective Dataset size: 250B tokens, combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4. The model is licensed under the microsoft-research-license. This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

NaNK
41,339
230

Mistral-7B-Instruct-v0.1-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Mistral 7B Instruct v0.1 - GGUF - Model creator: Mistral AI - Original model: Mistral 7B Instruct v0.1 This re...

NaNK
license:apache-2.0
35,449
607

TinyLlama-1.1B-Chat-v1.0-GPTQ

NaNK
llama
32,342
14

dolphin-2.6-mistral-7B-AWQ

NaNK
license:apache-2.0
31,135
8

Mixtral-8x7B-Instruct-v0.1-GGUF

NaNK
license:apache-2.0
26,317
646

deepseek-coder-6.7B-instruct-GGUF

NaNK
22,717
227

OpenHermes-2.5-Mistral-7B-GGUF

NaNK
license:apache-2.0
16,391
270

Llama-2-13B-chat-GGUF

NaNK
llama
16,222
203

Mistral-7B-v0.1-GGUF

NaNK
license:apache-2.0
14,879
269

Wizard-Vicuna-13B-Uncensored-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizard Vicuna 13B Uncensored - GGUF - Model creator: Eric Hartford - Original model: Wizard Vicuna 13B Uncensored This repo contains GGUF format model files for Eric Hartford's Wizard Vicuna 13B Uncensored. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | Wizard-Vicuna-13B-Uncensored.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | Wizard-Vicuna-13B-Uncensored.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | Wizard-Vicuna-13B-Uncensored.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | Wizard-Vicuna-13B-Uncensored.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | Wizard-Vicuna-13B-Uncensored.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | Wizard-Vicuna-13B-Uncensored.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | Wizard-Vicuna-13B-Uncensored.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | Wizard-Vicuna-13B-Uncensored.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | Wizard-Vicuna-13B-Uncensored.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | Wizard-Vicuna-13B-Uncensored.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | Wizard-Vicuna-13B-Uncensored.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | Wizard-Vicuna-13B-Uncensored.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Wizard-Vicuna-13B-Uncensored-GGUF and below it, a specific filename to download, such as: Wizard-Vicuna-13B-Uncensored.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizard Vicuna 13B Uncensored This is wizard-vicuna-13b trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.

NaNK
llama
13,897
86

deepseek-coder-33B-instruct-GGUF

NaNK
13,150
185

Llama-2-7B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Llama 2 7B - GGUF - Model creator: Meta - Original model: Llama 2 7B This repo contains GGUF format model files for Meta's Llama 2 7B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Meta's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | llama-2-7b.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | llama-2-7b.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | llama-2-7b.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | llama-2-7b.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | llama-2-7b.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | llama-2-7b.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | llama-2-7b.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | llama-2-7b.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | llama-2-7b.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | llama-2-7b.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | llama-2-7b.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | llama-2-7b.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Llama-2-7B-GGUF and below it, a specific filename to download, such as: llama-2-7b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Llama 2 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pretrained model, converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. Model Details Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept our License before requesting access here. Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|A new mix of publicly available online data|7B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|13B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|70B|4k|✔|2.0T|1.5 x 10 -4 | Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ Research Paper "Llama-2: Open Foundation and Fine-tuned Chat Models" Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. To get the expected features and performance for the chat versions, a specific formatting needs to be followed, including the `INST` and ` >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). See our reference code in github for details: `chatcompletion`. Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws).Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Llama 2. Hardware and Software Training Factors We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO 2 eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| CO 2 emissions during pretraining. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Training Data Overview Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|54.2| Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1. |||TruthfulQA|Toxigen| |---|---|---|---| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|21.25| |Llama 2|13B|41.86|26.10| |Llama 2|70B|50.18|24.60| Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |||TruthfulQA|Toxigen| |---|---|---|---| |Llama-2-Chat|7B|57.04|0.00| |Llama-2-Chat|13B|62.18|0.00| |Llama-2-Chat|70B|64.14|0.01| Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above. Ethical Considerations and Limitations Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/ Reporting Issues Please report any software “bug,” or other problems with the models through one of the following means: - Reporting issues with the model: github.com/facebookresearch/llama - Reporting problematic content generated by the model: developers.facebook.com/llamaoutputfeedback - Reporting bugs and security concerns: facebook.com/whitehat/info Llama Model Index |Model|Llama2|Llama2-hf|Llama2-chat|Llama2-chat-hf| |---|---|---|---|---| |7B| Link | Link | Link | Link| |13B| Link | Link | Link | Link| |70B| Link | Link | Link | Link|

NaNK
llama
13,013
206

dolphin-2.5-mixtral-8x7b-GGUF

NaNK
license:apache-2.0
12,872
303

MythoMax-L2-Kimiko-v2-13B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) MythoMax L2 Kimiko v2 13B - GGUF - Model creator: Undi95 - Original model: MythoMax L2 Kimiko v2 13B This repo contains GGUF format model files for Undi95's MythoMax L2 Kimiko v2 13B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Undi95's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The creator of the source model has listed its license as `cc-by-nc-4.0`, and this quantization has therefore used that same license. As this model is based on Llama 2, it is also subject to the Meta Llama 2 license terms, and the license files for that are additionally included. It should therefore be considered as being claimed to be licensed under both licenses. I contacted Hugging Face for clarification on dual licensing but they do not yet have an official position. Should this change, or should Meta provide any feedback on this situation, I will update this section accordingly. In the meantime, any questions regarding licensing, and in particular how these two licenses might interact, should be directed to the original model repository: Undi95's MythoMax L2 Kimiko v2 13B. These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | mythomax-l2-kimiko-v2-13b.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | mythomax-l2-kimiko-v2-13b.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | mythomax-l2-kimiko-v2-13b.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | mythomax-l2-kimiko-v2-13b.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | mythomax-l2-kimiko-v2-13b.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | mythomax-l2-kimiko-v2-13b.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | mythomax-l2-kimiko-v2-13b.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | mythomax-l2-kimiko-v2-13b.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | mythomax-l2-kimiko-v2-13b.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | mythomax-l2-kimiko-v2-13b.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | mythomax-l2-kimiko-v2-13b.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | mythomax-l2-kimiko-v2-13b.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF and below it, a specific filename to download, such as: mythomax-l2-kimiko-v2-13b.q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Undi95's MythoMax L2 Kimiko v2 13B Model : https://huggingface.co/Gryphe/MythoMax-L2-13b

NaNK
llama
12,295
40

CodeLlama-7B-Instruct-GGUF

NaNK
llama
12,124
143

dolphin-2.7-mixtral-8x7b-GGUF

NaNK
license:apache-2.0
11,606
159

deepseek-coder-1.3b-instruct-GGUF

NaNK
11,382
42

TinyLlama-1.1B-Chat-v0.3-GGUF

NaNK
tinyllama
11,098
48

Mixtral-8x7B-Instruct-v0.1-AWQ

NaNK
license:apache-2.0
10,866
58

Yi-34B-200K-AEZAKMI-v2-AWQ

NaNK
llama
9,632
2

CodeLlama-13B-Instruct-GGUF

NaNK
llama
8,829
130

Llama-2-7B-Chat-GPTQ

NaNK
llama
8,706
266

LLaMA2-13B-Tiefighter-AWQ

NaNK
llama
8,694
39

deepsex-34b-GGUF

NaNK
license:mit
8,509
126

WizardLM-7B-uncensored-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizardlm 7B Uncensored - GGUF - Model creator: Eric Hartford - Original model: Wizardlm 7B Uncensored This repo contains GGUF format model files for Eric Hartford's Wizardlm 7B Uncensored. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | WizardLM-7B-uncensored.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | WizardLM-7B-uncensored.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | WizardLM-7B-uncensored.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | WizardLM-7B-uncensored.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | WizardLM-7B-uncensored.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | WizardLM-7B-uncensored.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | WizardLM-7B-uncensored.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | WizardLM-7B-uncensored.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | WizardLM-7B-uncensored.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | WizardLM-7B-uncensored.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | WizardLM-7B-uncensored.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | WizardLM-7B-uncensored.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/WizardLM-7B-uncensored-GGUF and below it, a specific filename to download, such as: WizardLM-7B-uncensored.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizardlm 7B Uncensored This is WizardLM trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.

NaNK
llama
8,238
42

SOLAR-10.7B-Instruct-v1.0-uncensored-GGUF

NaNK
license:apache-2.0
8,050
101

Luna-AI-Llama2-Uncensored-GGUF

llama
7,672
58

WizardLM-1.0-Uncensored-Llama2-13B-GGUF

NaNK
llama
6,417
66

dolphin-2.6-mistral-7B-GGUF

NaNK
license:apache-2.0
6,265
88

OpenHermes-2.5-Mistral-7B-AWQ

NaNK
license:apache-2.0
6,257
21

CodeLlama-7B-GGUF

NaNK
llama
6,058
125

dolphin-2.2.1-mistral-7B-GGUF

NaNK
license:apache-2.0
5,888
121

Mixtral-8x7B-v0.1-GGUF

NaNK
license:apache-2.0
5,841
431

Mistral-7B-OpenOrca-GGUF

NaNK
license:apache-2.0
5,772
242

Llama-2-7B-fp16

NaNK
llama
5,659
44

deepseek-coder-33B-instruct-GPTQ

NaNK
llama
5,351
25

Synthia-v3.0-11B-AWQ

NaNK
llama
4,999
2

Phind-CodeLlama-34B-v2-GGUF

NaNK
llama
4,741
169

CodeLlama-34B-Instruct-GGUF

NaNK
llama
4,713
106

WizardCoder-Python-34B-V1.0-GGUF

NaNK
llama
4,700
88

dolphin-2.1-mistral-7B-GGUF

NaNK
license:apache-2.0
4,670
105

deepseek-llm-7B-chat-GGUF

NaNK
4,664
33

Open_Gpt4_8x7B-GGUF

NaNK
license:apache-2.0
4,418
23

rocket-3B-GGUF

NaNK
license:cc-by-sa-4.0
4,314
38

dolphin-2.7-mixtral-8x7b-AWQ

NaNK
license:apache-2.0
4,266
23

Nous-Hermes-2-SOLAR-10.7B-GGUF

NaNK
license:apache-2.0
4,041
113

claude2-alpaca-13B-GGUF

NaNK
llama
3,996
36

CapybaraHermes-2.5-Mistral-7B-GGUF

NaNK
license:apache-2.0
3,796
124

vicuna-7B-v1.5-GGUF

NaNK
llama
3,499
16

Mistral-7B-Claude-Chat-GGUF

NaNK
license:cc-by-nc-4.0
3,421
30

CodeLlama-13B-GGUF

NaNK
llama
3,308
61

Llama-2-7B-Chat-AWQ

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Llama 2 7B Chat - AWQ - Model creator: Meta Llama 2 - Original model: Llama 2 7B Chat This repo contains AWQ model files for Meta Llama 2's Llama 2 7B Chat. AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference. It is also now supported by continuous batching server vLLM, allowing use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Meta Llama 2's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions For my first release of AWQ models, I am releasing 128g models only. I will consider adding 32g as well if there is interest, and once I have done perplexity and evaluation comparisons, but at this time 32g models are still not fully tested with AutoAWQ and vLLM. | Branch | Bits | GS | AWQ Dataset | Seq Len | Size | | ------ | ---- | -- | ----------- | ------- | ---- | | main | 4 | 128 | wikitext | 4096 | 3.89 GB Documentation on installing and using vLLM can be found here. - When using vLLM as a server, pass the `--quantization awq` parameter, for example: When using vLLM from Python code, pass the `quantization=awq` parameter, for example: If you have problems installing AutoAWQ using the pre-built wheels, install it from source instead: The files provided are tested to work with AutoAWQ, and vLLM. Huggingface Text Generation Inference (TGI) is not yet compatible with AWQ, but a PR is open which should bring support soon: TGI PR #781. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Meta Llama 2's Llama 2 7B Chat Llama 2 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. Model Details Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept our License before requesting access here. Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|A new mix of publicly available online data|7B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|13B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|70B|4k|✔|2.0T|1.5 x 10 -4 | Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ Research Paper "Llama-2: Open Foundation and Fine-tuned Chat Models" Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. To get the expected features and performance for the chat versions, a specific formatting needs to be followed, including the `INST` and ` >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). See our reference code in github for details: `chatcompletion`. Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws).Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Llama 2. Hardware and Software Training Factors We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO 2 eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| CO 2 emissions during pretraining. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Training Data Overview Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|54.2| Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1. |||TruthfulQA|Toxigen| |---|---|---|---| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|21.25| |Llama 2|13B|41.86|26.10| |Llama 2|70B|50.18|24.60| Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |||TruthfulQA|Toxigen| |---|---|---|---| |Llama-2-Chat|7B|57.04|0.00| |Llama-2-Chat|13B|62.18|0.00| |Llama-2-Chat|70B|64.14|0.01| Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above. Ethical Considerations and Limitations Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/ Reporting Issues Please report any software “bug,” or other problems with the models through one of the following means: - Reporting issues with the model: github.com/facebookresearch/llama - Reporting problematic content generated by the model: developers.facebook.com/llamaoutputfeedback - Reporting bugs and security concerns: facebook.com/whitehat/info Llama Model Index |Model|Llama2|Llama2-hf|Llama2-chat|Llama2-chat-hf| |---|---|---|---|---| |7B| Link | Link | Link | Link| |13B| Link | Link | Link | Link| |70B| Link | Link | Link | Link|

NaNK
llama
3,281
24

WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) WizardLM Uncensored SuperCOT Storytelling 30B - GGUF - Model creator: YellowRoseCx - Original model: WizardLM Uncensored SuperCOT Storytelling 30B This repo contains GGUF format model files for Monero's WizardLM-Uncensored-SuperCOT-Storytelling-30B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference YellowRoseCx's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | WizardLM-Uncensored-SuperCOT-Storytelling.Q2K.gguf | Q2K | 2 | 13.50 GB| 16.00 GB | smallest, significant quality loss - not recommended for most purposes | | WizardLM-Uncensored-SuperCOT-Storytelling.Q3KS.gguf | Q3KS | 3 | 14.06 GB| 16.56 GB | very small, high quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q3KM.gguf | Q3KM | 3 | 15.76 GB| 18.26 GB | very small, high quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q3KL.gguf | Q3KL | 3 | 17.28 GB| 19.78 GB | small, substantial quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q40.gguf | Q40 | 4 | 18.36 GB| 20.86 GB | legacy; small, very high quality loss - prefer using Q3KM | | WizardLM-Uncensored-SuperCOT-Storytelling.Q4KS.gguf | Q4KS | 4 | 18.44 GB| 20.94 GB | small, greater quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q4KM.gguf | Q4KM | 4 | 19.62 GB| 22.12 GB | medium, balanced quality - recommended | | WizardLM-Uncensored-SuperCOT-Storytelling.Q50.gguf | Q50 | 5 | 22.40 GB| 24.90 GB | legacy; medium, balanced quality - prefer using Q4KM | | WizardLM-Uncensored-SuperCOT-Storytelling.Q5KS.gguf | Q5KS | 5 | 22.40 GB| 24.90 GB | large, low quality loss - recommended | | WizardLM-Uncensored-SuperCOT-Storytelling.Q5KM.gguf | Q5KM | 5 | 23.05 GB| 25.55 GB | large, very low quality loss - recommended | | WizardLM-Uncensored-SuperCOT-Storytelling.Q6K.gguf | Q6K | 6 | 26.69 GB| 29.19 GB | very large, extremely low quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q80.gguf | Q80 | 8 | 34.57 GB| 37.07 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GGUF and below it, a specific filename to download, such as: WizardLM-Uncensored-SuperCOT-Storytelling.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Monero's WizardLM-Uncensored-SuperCOT-Storytelling-30B This model is a triple model merge of WizardLM Uncensored+CoT+Storytelling, resulting in a comprehensive boost in reasoning and story writing capabilities. You've become a compendium of knowledge on a vast array of topics. Lore Mastery is an arcane tradition fixated on understanding the underlying mechanics of magic. It is the most academic of all arcane traditions. The promise of uncovering new knowledge or proving (or discrediting) a theory of magic is usually required to rouse its practitioners from their laboratories, academies, and archives to pursue a life of adventure. Known as savants, followers of this tradition are a bookish lot who see beauty and mystery in the application of magic. The results of a spell are less interesting to them than the process that creates it. Some savants take a haughty attitude toward those who follow a tradition focused on a single school of magic, seeing them as provincial and lacking the sophistication needed to master true magic. Other savants are generous teachers, countering ignorance and deception with deep knowledge and good humor.

NaNK
llama
3,231
32

Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF

NaNK
license:apache-2.0
3,227
66

KafkaLM-70B-German-V0.1-GGUF

NaNK
llama
3,030
57

WhiteRabbitNeo-13B-AWQ

NaNK
llama
2,961
4

Pygmalion-2-13B-GGUF

NaNK
llama
2,845
32

openchat_3.5-GGUF

license:apache-2.0
2,838
128

Llama-2-13B-GPTQ

NaNK
llama
2,764
120

WizardLM-13B-Uncensored-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizardlm 13B Uncensored - GGUF - Model creator: Eric Hartford - Original model: Wizardlm 13B Uncensored This repo contains GGUF format model files for Eric Hartford's Wizardlm 13B Uncensored. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | WizardLM-13B-Uncensored.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | WizardLM-13B-Uncensored.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | WizardLM-13B-Uncensored.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | WizardLM-13B-Uncensored.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | WizardLM-13B-Uncensored.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | WizardLM-13B-Uncensored.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | WizardLM-13B-Uncensored.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | WizardLM-13B-Uncensored.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | WizardLM-13B-Uncensored.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | WizardLM-13B-Uncensored.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | WizardLM-13B-Uncensored.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | WizardLM-13B-Uncensored.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/WizardLM-13B-Uncensored-GGUF and below it, a specific filename to download, such as: WizardLM-13B-Uncensored.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizardlm 13B Uncensored This is WizardLM trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. Note: An uncensored model has no guardrails. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.

NaNK
llama
2,733
18

zephyr-7B-beta-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Zephyr 7B Beta - GGUF - Model creator: Hugging Face H4 - Original model: Zephyr 7B Beta This repo contains GGUF format model files for Hugging Face H4's Zephyr 7B Beta. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Hugging Face H4's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | zephyr-7b-beta.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | zephyr-7b-beta.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | zephyr-7b-beta.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | zephyr-7b-beta.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | zephyr-7b-beta.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | zephyr-7b-beta.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | zephyr-7b-beta.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | zephyr-7b-beta.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | zephyr-7b-beta.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | zephyr-7b-beta.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | zephyr-7b-beta.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | zephyr-7b-beta.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/zephyr-7B-beta-GGUF and below it, a specific filename to download, such as: zephyr-7b-beta.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Hugging Face H4's Zephyr 7B Beta Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. You can find more details in the technical report. - Model type: A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets. - Language(s) (NLP): Primarily English - License: MIT - Finetuned from model: mistralai/Mistral-7B-v0.1 - Repository: https://github.com/huggingface/alignment-handbook - Demo: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat - Chatbot Arena: Evaluate Zephyr 7B against 10+ LLMs in the LMSYS arena: http://arena.lmsys.org At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the MT-Bench and AlpacaEval benchmarks: | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | StableLM-Tuned-α | 7B| dSFT |2.75| -| | MPT-Chat | 7B |dSFT |5.42| -| | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83| | Mistral-Instructv0.1 | 7B| - | 6.84 |-| | Zephyr-7b-α |7B| dDPO| 6.88| -| | Zephyr-7b-β 🪁 | 7B | dDPO | 7.34 | 90.60 | | Falcon-Instruct | 40B |dSFT |5.17 |45.71| | Guanaco | 65B | SFT |6.41| 71.80| | Llama2-Chat | 70B |RLHF |6.86| 92.66| | Vicuna v1.3 | 33B |dSFT |7.12 |88.99| | WizardLM v1.0 | 70B |dSFT |7.71 |-| | Xwin-LM v0.1 | 70B |dPPO |- |95.57| | GPT-3.5-turbo | - |RLHF |7.94 |89.37| | Claude 2 | - |RLHF |8.06| 91.36| | GPT-4 | -| RLHF |8.99| 95.28| In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B: However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap. The model was initially fine-tuned on a filtered and preprocessed of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contains 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our demo to test its capabilities. You can find the datasets used for training Zephyr-7B-β here Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: Zephyr-7B-β has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the Falcon 180B model card for an example of this. During DPO training, this model achieves the following results on the evaluation set: - Loss: 0.7496 - Rewards/chosen: -4.5221 - Rewards/rejected: -8.3184 - Rewards/accuracies: 0.7812 - Rewards/margins: 3.7963 - Logps/rejected: -340.1541 - Logps/chosen: -299.4561 - Logits/rejected: -2.3081 - Logits/chosen: -2.3531 The following hyperparameters were used during training: - learningrate: 5e-07 - trainbatchsize: 2 - evalbatchsize: 4 - seed: 42 - distributedtype: multi-GPU - numdevices: 16 - totaltrainbatchsize: 32 - totalevalbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 The table below shows the full set of DPO training metrics: | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 | | 0.4908 | 0.1 | 200 | 0.5426 | -0.0279 | -0.6842 | 0.75 | 0.6563 | -263.8124 | -254.5145 | -2.7719 | -2.7960 | | 0.5264 | 0.15 | 300 | 0.5324 | 0.0414 | -0.9793 | 0.7656 | 1.0207 | -266.7627 | -253.8209 | -2.7892 | -2.8122 | | 0.5536 | 0.21 | 400 | 0.4957 | -0.0185 | -1.5276 | 0.7969 | 1.5091 | -272.2460 | -254.4203 | -2.8542 | -2.8764 | | 0.5362 | 0.26 | 500 | 0.5031 | -0.2630 | -1.5917 | 0.7812 | 1.3287 | -272.8869 | -256.8653 | -2.8702 | -2.8958 | | 0.5966 | 0.31 | 600 | 0.5963 | -0.2993 | -1.6491 | 0.7812 | 1.3499 | -273.4614 | -257.2279 | -2.8778 | -2.8986 | | 0.5014 | 0.36 | 700 | 0.5382 | -0.2859 | -1.4750 | 0.75 | 1.1891 | -271.7204 | -257.0942 | -2.7659 | -2.7869 | | 0.5334 | 0.41 | 800 | 0.5677 | -0.4289 | -1.8968 | 0.7969 | 1.4679 | -275.9378 | -258.5242 | -2.7053 | -2.7265 | | 0.5251 | 0.46 | 900 | 0.5772 | -0.2116 | -1.3107 | 0.7344 | 1.0991 | -270.0768 | -256.3507 | -2.8463 | -2.8662 | | 0.5205 | 0.52 | 1000 | 0.5262 | -0.3792 | -1.8585 | 0.7188 | 1.4793 | -275.5552 | -258.0276 | -2.7893 | -2.7979 | | 0.5094 | 0.57 | 1100 | 0.5433 | -0.6279 | -1.9368 | 0.7969 | 1.3089 | -276.3377 | -260.5136 | -2.7453 | -2.7536 | | 0.5837 | 0.62 | 1200 | 0.5349 | -0.3780 | -1.9584 | 0.7656 | 1.5804 | -276.5542 | -258.0154 | -2.7643 | -2.7756 | | 0.5214 | 0.67 | 1300 | 0.5732 | -1.0055 | -2.2306 | 0.7656 | 1.2251 | -279.2761 | -264.2903 | -2.6986 | -2.7113 | | 0.6914 | 0.72 | 1400 | 0.5137 | -0.6912 | -2.1775 | 0.7969 | 1.4863 | -278.7448 | -261.1467 | -2.7166 | -2.7275 | | 0.4655 | 0.77 | 1500 | 0.5090 | -0.7987 | -2.2930 | 0.7031 | 1.4943 | -279.8999 | -262.2220 | -2.6651 | -2.6838 | | 0.5731 | 0.83 | 1600 | 0.5312 | -0.8253 | -2.3520 | 0.7812 | 1.5268 | -280.4902 | -262.4876 | -2.6543 | -2.6728 | | 0.5233 | 0.88 | 1700 | 0.5206 | -0.4573 | -2.0951 | 0.7812 | 1.6377 | -277.9205 | -258.8084 | -2.6870 | -2.7097 | | 0.5593 | 0.93 | 1800 | 0.5231 | -0.5508 | -2.2000 | 0.7969 | 1.6492 | -278.9703 | -259.7433 | -2.6221 | -2.6519 | | 0.4967 | 0.98 | 1900 | 0.5290 | -0.5340 | -1.9570 | 0.8281 | 1.4230 | -276.5395 | -259.5749 | -2.6564 | -2.6878 | | 0.0921 | 1.03 | 2000 | 0.5368 | -1.1376 | -3.1615 | 0.7812 | 2.0239 | -288.5854 | -265.6111 | -2.6040 | -2.6345 | | 0.0733 | 1.08 | 2100 | 0.5453 | -1.1045 | -3.4451 | 0.7656 | 2.3406 | -291.4208 | -265.2799 | -2.6289 | -2.6595 | | 0.0972 | 1.14 | 2200 | 0.5571 | -1.6915 | -3.9823 | 0.8125 | 2.2908 | -296.7934 | -271.1505 | -2.6471 | -2.6709 | | 0.1058 | 1.19 | 2300 | 0.5789 | -1.0621 | -3.8941 | 0.7969 | 2.8319 | -295.9106 | -264.8563 | -2.5527 | -2.5798 | | 0.2423 | 1.24 | 2400 | 0.5455 | -1.1963 | -3.5590 | 0.7812 | 2.3627 | -292.5599 | -266.1981 | -2.5414 | -2.5784 | | 0.1177 | 1.29 | 2500 | 0.5889 | -1.8141 | -4.3942 | 0.7969 | 2.5801 | -300.9120 | -272.3761 | -2.4802 | -2.5189 | | 0.1213 | 1.34 | 2600 | 0.5683 | -1.4608 | -3.8420 | 0.8125 | 2.3812 | -295.3901 | -268.8436 | -2.4774 | -2.5207 | | 0.0889 | 1.39 | 2700 | 0.5890 | -1.6007 | -3.7337 | 0.7812 | 2.1330 | -294.3068 | -270.2423 | -2.4123 | -2.4522 | | 0.0995 | 1.45 | 2800 | 0.6073 | -1.5519 | -3.8362 | 0.8281 | 2.2843 | -295.3315 | -269.7538 | -2.4685 | -2.5050 | | 0.1145 | 1.5 | 2900 | 0.5790 | -1.7939 | -4.2876 | 0.8438 | 2.4937 | -299.8461 | -272.1744 | -2.4272 | -2.4674 | | 0.0644 | 1.55 | 3000 | 0.5735 | -1.7285 | -4.2051 | 0.8125 | 2.4766 | -299.0209 | -271.5201 | -2.4193 | -2.4574 | | 0.0798 | 1.6 | 3100 | 0.5537 | -1.7226 | -4.2850 | 0.8438 | 2.5624 | -299.8200 | -271.4610 | -2.5367 | -2.5696 | | 0.1013 | 1.65 | 3200 | 0.5575 | -1.5715 | -3.9813 | 0.875 | 2.4098 | -296.7825 | -269.9498 | -2.4926 | -2.5267 | | 0.1254 | 1.7 | 3300 | 0.5905 | -1.6412 | -4.4703 | 0.8594 | 2.8291 | -301.6730 | -270.6473 | -2.5017 | -2.5340 | | 0.085 | 1.76 | 3400 | 0.6133 | -1.9159 | -4.6760 | 0.8438 | 2.7601 | -303.7296 | -273.3941 | -2.4614 | -2.4960 | | 0.065 | 1.81 | 3500 | 0.6074 | -1.8237 | -4.3525 | 0.8594 | 2.5288 | -300.4951 | -272.4724 | -2.4597 | -2.5004 | | 0.0755 | 1.86 | 3600 | 0.5836 | -1.9252 | -4.4005 | 0.8125 | 2.4753 | -300.9748 | -273.4872 | -2.4327 | -2.4716 | | 0.0746 | 1.91 | 3700 | 0.5789 | -1.9280 | -4.4906 | 0.8125 | 2.5626 | -301.8762 | -273.5149 | -2.4686 | -2.5115 | | 0.1348 | 1.96 | 3800 | 0.6015 | -1.8658 | -4.2428 | 0.8281 | 2.3769 | -299.3976 | -272.8936 | -2.4943 | -2.5393 | | 0.0217 | 2.01 | 3900 | 0.6122 | -2.3335 | -4.9229 | 0.8281 | 2.5894 | -306.1988 | -277.5699 | -2.4841 | -2.5272 | | 0.0219 | 2.07 | 4000 | 0.6522 | -2.9890 | -6.0164 | 0.8281 | 3.0274 | -317.1334 | -284.1248 | -2.4105 | -2.4545 | | 0.0119 | 2.12 | 4100 | 0.6922 | -3.4777 | -6.6749 | 0.7969 | 3.1972 | -323.7187 | -289.0121 | -2.4272 | -2.4699 | | 0.0153 | 2.17 | 4200 | 0.6993 | -3.2406 | -6.6775 | 0.7969 | 3.4369 | -323.7453 | -286.6413 | -2.4047 | -2.4465 | | 0.011 | 2.22 | 4300 | 0.7178 | -3.7991 | -7.4397 | 0.7656 | 3.6406 | -331.3667 | -292.2260 | -2.3843 | -2.4290 | | 0.0072 | 2.27 | 4400 | 0.6840 | -3.3269 | -6.8021 | 0.8125 | 3.4752 | -324.9908 | -287.5042 | -2.4095 | -2.4536 | | 0.0197 | 2.32 | 4500 | 0.7013 | -3.6890 | -7.3014 | 0.8125 | 3.6124 | -329.9841 | -291.1250 | -2.4118 | -2.4543 | | 0.0182 | 2.37 | 4600 | 0.7476 | -3.8994 | -7.5366 | 0.8281 | 3.6372 | -332.3356 | -293.2291 | -2.4163 | -2.4565 | | 0.0125 | 2.43 | 4700 | 0.7199 | -4.0560 | -7.5765 | 0.8438 | 3.5204 | -332.7345 | -294.7952 | -2.3699 | -2.4100 | | 0.0082 | 2.48 | 4800 | 0.7048 | -3.6613 | -7.1356 | 0.875 | 3.4743 | -328.3255 | -290.8477 | -2.3925 | -2.4303 | | 0.0118 | 2.53 | 4900 | 0.6976 | -3.7908 | -7.3152 | 0.8125 | 3.5244 | -330.1224 | -292.1431 | -2.3633 | -2.4047 | | 0.0118 | 2.58 | 5000 | 0.7198 | -3.9049 | -7.5557 | 0.8281 | 3.6508 | -332.5271 | -293.2844 | -2.3764 | -2.4194 | | 0.006 | 2.63 | 5100 | 0.7506 | -4.2118 | -7.9149 | 0.8125 | 3.7032 | -336.1194 | -296.3530 | -2.3407 | -2.3860 | | 0.0143 | 2.68 | 5200 | 0.7408 | -4.2433 | -7.9802 | 0.8125 | 3.7369 | -336.7721 | -296.6682 | -2.3509 | -2.3946 | | 0.0057 | 2.74 | 5300 | 0.7552 | -4.3392 | -8.0831 | 0.7969 | 3.7439 | -337.8013 | -297.6275 | -2.3388 | -2.3842 | | 0.0138 | 2.79 | 5400 | 0.7404 | -4.2395 | -7.9762 | 0.8125 | 3.7367 | -336.7322 | -296.6304 | -2.3286 | -2.3737 | | 0.0079 | 2.84 | 5500 | 0.7525 | -4.4466 | -8.2196 | 0.7812 | 3.7731 | -339.1662 | -298.7007 | -2.3200 | -2.3641 | | 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 | | 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 | | 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 | - Transformers 4.35.0.dev0 - Pytorch 2.0.1+cu118 - Datasets 2.12.0 - Tokenizers 0.14.0 If you find Zephyr-7B-β is useful in your work, please cite it with:

NaNK
license:mit
2,715
229

CodeLlama-13B-oasst-sft-v10-GGUF

NaNK
llama
2,652
14

deepseek-coder-6.7B-base-GGUF

NaNK
2,638
16

Mixtral-8x7B-MoE-RP-Story-GGUF

NaNK
license:cc-by-nc-4.0
2,623
50

dolphin-2.0-mistral-7B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Dolphin 2.0 Mistral 7B - GGUF - Model creator: Eric Hartford - Original model: Dolphin 2.0 Mistral 7B This repo contains GGUF format model files for Eric Hartford's Dolphin 2.0 Mistral 7B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | dolphin-2.0-mistral-7b.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | dolphin-2.0-mistral-7b.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | dolphin-2.0-mistral-7b.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | dolphin-2.0-mistral-7b.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | dolphin-2.0-mistral-7b.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | dolphin-2.0-mistral-7b.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | dolphin-2.0-mistral-7b.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | dolphin-2.0-mistral-7b.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | dolphin-2.0-mistral-7b.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | dolphin-2.0-mistral-7b.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | dolphin-2.0-mistral-7b.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | dolphin-2.0-mistral-7b.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/dolphin-2.0-mistral-7B-GGUF and below it, a specific filename to download, such as: dolphin-2.0-mistral-7b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Dolphin 2.0 Mistral 7B Dolphin-2.0-mistral-7b's training was sponsored by a16z. This model is based on mistralAI, so it is suitable for commercial or non-commercial use. This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly. This dataset is Dolphin, an open-source implementation of Microsoft's Orca I modified the dataset for uncensoring, deduping, cleaning, and quality. I added Jon Durbin's excellent Airoboros dataset to increase creativity. Training It took 48 hours to train 10 epochs on 4x A100s. Prompt format: This model (and all my future releases) use ChatML prompt format. Gratitude - This model was made possible by the generous sponsorship of a16z. - Thank you to Microsoft for authoring the Orca paper and inspiring this work. - Special thanks to WingLian, and TheBloke for helpful advice - Thank you to all the other people in the Open Source AI community who have taught me and helped me along the way.

NaNK
license:apache-2.0
2,618
65

WhiteRabbitNeo-13B-GGUF

NaNK
llama
2,602
55

CodeLlama-7B-Python-GGUF

NaNK
llama
2,584
58

deepseek-coder-33B-instruct-AWQ

NaNK
llama
2,533
39

CodeLlama-34B-GGUF

NaNK
llama
2,502
56

CausalLM-14B-GGUF

NaNK
llama
2,484
189

Wizard-Vicuna-30B-Uncensored-GGUF

NaNK
llama
2,471
54

guanaco-7B-HF

NaNK
llama
2,452
13

dolphin-2_6-phi-2-GGUF

NaNK
2,428
72

Everyone-Coder-33B-Base-GPTQ

NaNK
llama
2,413
3

dolphin-2.6-mistral-7B-GPTQ

NaNK
license:apache-2.0
2,412
9

Llama-2-70B-Chat-GPTQ

NaNK
llama
2,395
259

CodeLlama-70B-Instruct-GGUF

NaNK
llama
2,346
62

Yarn-Mistral-7B-128k-GGUF

NaNK
license:apache-2.0
2,341
130

llama2_7b_chat_uncensored-GGUF

NaNK
llama
2,330
33

Speechless-Llama2-Hermes-Orca-Platypus-WizardLM-13B-GGUF

NaNK
llama
2,323
49

WizardLM-13B-V1.2-GGUF

NaNK
llama
2,318
21

Mistral-7B-v0.1-AWQ

NaNK
license:apache-2.0
2,306
33

em_german_mistral_v01-GGUF

NaNK
license:apache-2.0
2,284
11

Llama-2-70B-Chat-GGUF

NaNK
llama
2,271
122

Wizard-Vicuna-7B-Uncensored-GGUF

NaNK
llama
2,243
36

CodeLlama-34B-Python-GGUF

NaNK
llama
2,190
38

SynthIA-70B-v1.5-AWQ

NaNK
llama
2,189
2

Psyfighter-13B-GGUF

NaNK
llama
2,186
13

Starling-LM-7B-alpha-GGUF

NaNK
license:cc-by-nc-4.0
2,158
94

Pygmalion-2-7B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Pygmalion 2 7B - GGUF - Model creator: PygmalionAI - Original model: Pygmalion 2 7B This repo contains GGUF format model files for PygmalionAI's Pygmalion 2 7B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference PygmalionAI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The model has been trained on prompts using three different roles, which are denoted by the following tokens: ` `, ` ` and ` `. The ` ` prompt can be used to inject out-of-channel information behind the scenes, while the ` ` prompt should be used to indicate user input. The ` ` token should then be used to indicate that the model should generate a response. These tokens can happen multiple times and be chained up to form a conversation history. The system prompt has been designed to allow the model to "enter" various modes and dictate the reply length. Here's an example: These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | pygmalion-2-7b.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | pygmalion-2-7b.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | pygmalion-2-7b.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | pygmalion-2-7b.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | pygmalion-2-7b.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | pygmalion-2-7b.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | pygmalion-2-7b.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | pygmalion-2-7b.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | pygmalion-2-7b.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | pygmalion-2-7b.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | pygmalion-2-7b.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | pygmalion-2-7b.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Pygmalion-2-7B-GGUF and below it, a specific filename to download, such as: pygmalion-2-7b.q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Pygmalion-2 7B An instruction-tuned Llama-2 biased towards fiction writing and conversation. The long-awaited release of our new models based on Llama-2 is finally here. Pygmalion-2 7B (formerly known as Metharme) is based on Llama-2 7B released by Meta AI. The Metharme models were an experiment to try and get a model that is usable for conversation, roleplaying and storywriting, but which can be guided using natural language like other instruct models. After much deliberation, we reached the conclusion that the Metharme prompting format is superior (and easier to use) compared to the classic Pygmalion. This model was trained by doing supervised fine-tuning over a mixture of regular instruction data alongside roleplay, fictional stories and conversations with synthetically generated instructions attached. This model is freely available for both commercial and non-commercial use, as per the Llama-2 license. The model has been trained on prompts using three different roles, which are denoted by the following tokens: ` `, ` ` and ` `. The ` ` prompt can be used to inject out-of-channel information behind the scenes, while the ` ` prompt should be used to indicate user input. The ` ` token should then be used to indicate that the model should generate a response. These tokens can happen multiple times and be chained up to form a conversation history. The system prompt has been designed to allow the model to "enter" various modes and dictate the reply length. Here's an example: Dataset The dataset used to fine-tune this model includes our own PIPPA, along with several other instruction datasets, and datasets acquired from various RP forums. The intended use-case for this model is fictional writing for entertainment purposes. Any other sort of usage is out of scope. As such, it was not fine-tuned to be safe and harmless: the base model and this fine-tune have been trained on data known to contain profanity and texts that are lewd or otherwise offensive. It may produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. Outputs might often be factually wrong or misleading. Acknowledgements We would like to thank SpicyChat for sponsoring the training for this model.

NaNK
llama
2,144
30

japanese-stablelm-instruct-gamma-7B-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Japanese StableLM Instruct Gamma 7B - GGUF - Model creator: Stability AI - Original model: Japanese StableLM Instruct Gamma 7B This repo contains GGUF format model files for Stability AI's Japanese StableLM Instruct Gamma 7B. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | japanese-stablelm-instruct-gamma-7b.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | japanese-stablelm-instruct-gamma-7b.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | japanese-stablelm-instruct-gamma-7b.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | japanese-stablelm-instruct-gamma-7b.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | japanese-stablelm-instruct-gamma-7b.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | japanese-stablelm-instruct-gamma-7b.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | japanese-stablelm-instruct-gamma-7b.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | japanese-stablelm-instruct-gamma-7b.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | japanese-stablelm-instruct-gamma-7b.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | japanese-stablelm-instruct-gamma-7b.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | japanese-stablelm-instruct-gamma-7b.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | japanese-stablelm-instruct-gamma-7b.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/japanese-stablelm-instruct-gamma-7B-GGUF and below it, a specific filename to download, such as: japanese-stablelm-instruct-gamma-7b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, SX, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann-Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius And thank you again to a16z for their generous grant. Original model card: Stability AI's Japanese StableLM Instruct Gamma 7B This is a 7B-parameter decoder-only Japanese language model fine-tuned on instruction-following datasets, built on top of the base model Japanese Stable LM Base Gamma 7B. If you are in search of a smaller model, please check Japanese StableLM-3B-4E1T Instruct. Developed by: Stability AI Model type: `Japanese Stable LM Instruct Gamma 7B` model is an auto-regressive language model based on the transformer decoder architecture. Language(s): Japanese License: This model is licensed under Apache License, Version 2.0. Contact: For questions and comments about the model, please join Stable Community Japan. For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAIJP. For details, please see Mistral AI's paper and release blog post. - Japanese translation of the Databricks Dolly-15k dataset - Japanese translation of the subset of the Anthropic HH dataset - Wikinews subset of the izumi-lab/llm-japanese-dataset The model is intended to be used by all individuals as a foundational model for application-specific fine-tuning without strict limitations on commercial use. The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters which can be reflected in the model-generated text. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups. The fine-tuning was carried out by Fujiki Nakamura. Other aspects, including data preparation and evaluation, were handled by the Language Team of Stability AI Japan, notably Meng Lee, Makoto Shing, Paul McCann, Naoki Orii, and Takuya Akiba. This model is based on Mistral-7B-v0.1 released by the Mistral AI team. We are grateful to the Mistral AI team for providing such an excellent base model. We are grateful for the contributions of the EleutherAI Polyglot-JA team in helping us to collect a large amount of pre-training data in Japanese. Polyglot-JA members includes Hyunwoong Ko (Project Lead), Fujiki Nakamura (originally started this project when he commited to the Polyglot team), Yunho Mo, Minji Jung, KeunSeok Im, and Su-Kyeong Jang. We are also appreciative of AI Novelist/Sta (Bit192, Inc.) and the numerous contributors from Stable Community Japan for assisting us in gathering a large amount of high-quality Japanese textual data for model training.

NaNK
license:apache-2.0
2,117
10

CodeLlama-13B-Instruct-AWQ

NaNK
llama
2,097
9

stablelm-zephyr-3b-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) StableLM Zephyr 3B - GGUF - Model creator: Stability AI - Original model: StableLM Zephyr 3B This repo contains GGUF format model files for Stability AI's StableLM Zephyr 3B. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | stablelm-zephyr-3b.Q2K.gguf | Q2K | 2 | 1.20 GB| 3.70 GB | smallest, significant quality loss - not recommended for most purposes | | stablelm-zephyr-3b.Q3KS.gguf | Q3KS | 3 | 1.25 GB| 3.75 GB | very small, high quality loss | | stablelm-zephyr-3b.Q3KM.gguf | Q3KM | 3 | 1.39 GB| 3.89 GB | very small, high quality loss | | stablelm-zephyr-3b.Q3KL.gguf | Q3KL | 3 | 1.51 GB| 4.01 GB | small, substantial quality loss | | stablelm-zephyr-3b.Q40.gguf | Q40 | 4 | 1.61 GB| 4.11 GB | legacy; small, very high quality loss - prefer using Q3KM | | stablelm-zephyr-3b.Q4KS.gguf | Q4KS | 4 | 1.62 GB| 4.12 GB | small, greater quality loss | | stablelm-zephyr-3b.Q4KM.gguf | Q4KM | 4 | 1.71 GB| 4.21 GB | medium, balanced quality - recommended | | stablelm-zephyr-3b.Q50.gguf | Q50 | 5 | 1.94 GB| 4.44 GB | legacy; medium, balanced quality - prefer using Q4KM | | stablelm-zephyr-3b.Q5KS.gguf | Q5KS | 5 | 1.94 GB| 4.44 GB | large, low quality loss - recommended | | stablelm-zephyr-3b.Q5KM.gguf | Q5KM | 5 | 1.99 GB| 4.49 GB | large, very low quality loss - recommended | | stablelm-zephyr-3b.Q6K.gguf | Q6K | 6 | 2.30 GB| 4.80 GB | very large, extremely low quality loss | | stablelm-zephyr-3b.Q80.gguf | Q80 | 8 | 2.97 GB| 5.47 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/stablelm-zephyr-3b-GGUF and below it, a specific filename to download, such as: stablelm-zephyr-3b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Original model card: Stability AI's StableLM Zephyr 3B `StableLM Zephyr 3B` is a 3 billion parameter instruction tuned inspired by HugginFaceH4's Zephyr 7B training pipeline this model was trained on a mix of publicly available datasets, synthetic datasets using Direct Preference Optimization (DPO), evaluation for this model based on MT Bench and Alpaca Benchmark `StableLM Zephyr 3B` uses the following instruction format: This format is also available through the tokenizer's `applychattemplate` method: You can also see how to run a performance optimized version of this model here using OpenVINO from Intel. Developed by: Stability AI Model type: `StableLM Zephyr 3B` model is an auto-regressive language model based on the transformer decoder architecture. Language(s): English Library: Alignment Handbook Finetuned from model: stabilityai/stablelm-3b-4e1t License: StabilityAI Non-Commercial Research Community License Contact: For questions and comments about the model, please email `[email protected]` The dataset is comprised of a mixture of open datasets large-scale datasets available on the HuggingFace Hub: 1. SFT Datasets - HuggingFaceH4/ultrachat200k - meta-math/MetaMathQA - WizardLM/WizardLMevolinstructV2196k - Open-Orca/SlimOrca 2. Preference Datasets: - HuggingFaceH4/ultrafeedbackbinarized - Intel/orcadpopairs | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | StableLM Zephyr 3B 🪁 | 3B | DPO | 6.64 | 76.00 | | StableLM Zephyr (SFT only) | 3B | SFT | 6.04 | 71.15 | | Capybara v1.9 | 3B | dSFT | 5.94 | - | | MPT-Chat | 7B |dSFT |5.42| -| | Xwin-LM v0.1 | 7B| dPPO| 6.19| 87.83| | Mistral-Instruct v0.1 | 7B| - | 6.84 |-| | Zephyr-7b-α |7B| dDPO| 6.88| -| | Zephyr-7b-β| 7B | dDPO | 7.34 | 90.60 | | Falcon-Instruct | 40B |dSFT |5.17 |45.71| | Guanaco | 65B | SFT |6.41| 71.80| | Llama2-Chat | 70B |RLHF |6.86| 92.66| | Vicuna v1.3 | 33B |dSFT |7.12 |88.99| | WizardLM v1.0 | 70B |dSFT |7.71 |-| | Xwin-LM v0.1 | 70B |dPPO |- |95.57| | GPT-3.5-turbo | - |RLHF |7.94 |89.37| | Claude 2 | - |RLHF |8.06| 91.36| | GPT-4 | -| RLHF |8.99| 95.28| Other benchmarks: | Task | Value | |-----------------------|---------------------------| | ARC (25-shot) | 47.0 | | HellaSwag (10-shot) | 74.2 | | MMLU (5-shot) | 46.3 | | TruthfulQA (0-shot) | 46.5 | | Winogrande (5-shot) | 65.5 | | GSM8K (5-shot) | 42.3 | | BigBench (Avg) | 35.26 | | AGI Benchmark (Avg) | 33.23 | Hardware: `StableLM Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes. Code Base: We use our internal script for SFT steps and used HuggingFace Alignment Handbook script for DPO training. Commitment to Ethical AI In line with our responsibility towards ethical AI development, `StableLM Zephyr 3B` is released with a focus on ensuring safety, reliability, and appropriateness in its applications. To this end, we have evaluated `StableLM Zephyr 3B` on 488 malicious prompts and used standard protocols to assess the harmfulness of its outputs. Compared to Zephyr-7b-β, `StableLM Zephyr 3B` reduces the number of harmful outputs as assessed by GPT-4 by 55. Additionally, we performed an internal red teaming event targeting the following abuse areas: Self-Harm Methods: (Suicide Methods, Encouragement of Self-Harm, Methods and encouragement of Eating Disorders) Misinformation: (Health, Conspiracy Theories, Social Unrest/Conflict, Political Misinformation, & Climate change) Hate Speech: (Race, Stereotypes, Immigrants, Gender, Personally Identifiable Information such as Social security numbers, Full names, ID numbers, Email addresses, and telephone numbers) We have incorporated the findings of our malicious prompts evaluation and red teaming event into our release. Users are encouraged to fine-tune and evaluate the model to suit their specific needs, considering the potential biases and limitations found in `StableLM Zephyr 3B` and inherent in other LLM models. The model is intended to be used as a foundational base model for application-specific fine-tuning. Developers must evaluate and fine-tune the model for safe performance in downstream applications. Limitations and Bias ​ This model is not trained against adversarial inputs. We strongly recommend pairing this model with an input and output classifier to prevent harmful responses. Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it is willing to output potentially harmful outputs or misinformation when the user requests it. Using this model will require guardrails around your inputs and outputs to ensure that any outputs returned are not misinformation or harmful. Additionally, as each use case is unique, we recommend running your own suite of tests to ensure proper performance of this model. Finally, do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.

NaNK
2,031
101

WizardCoder-Python-13B-V1.0-GGUF

NaNK
llama
2,003
59

ReMM-SLERP-L2-13B-GGUF

NaNK
llama
1,929
6

deepseek-llm-67b-chat-GGUF

NaNK
1,922
43

Silicon-Maid-7B-GGUF

NaNK
license:cc-by-4.0
1,910
61

Stheno-L2-13B-GGUF

NaNK
llama
1,906
6

Open_Gpt4_8x7B_v0.2-GGUF

NaNK
license:apache-2.0
1,902
22

Chronos-Hermes-13b-v2-GGUF

NaNK
llama
1,850
23

OpenHermes-2-Mistral-7B-GGUF

NaNK
license:apache-2.0
1,816
77

Llama-2-13B-GGUF

NaNK
llama
1,811
67

koala-13B-HF

NaNK
llama
1,800
41

Mythalion-13B-GGUF

NaNK
llama
1,790
69

CodeLlama-70B-Python-GGUF

NaNK
llama
1,757
44

openchat-3.5-1210-GGUF

license:apache-2.0
1,709
51

Nous-Capybara-34B-GGUF

NaNK
license:mit
1,693
167

meditron-7B-GGUF

NaNK
llama
1,682
23

FusionNet_34Bx2_MoE-GGUF

NaNK
license:mit
1,671
5

llama2_70b_chat_uncensored-GGUF

NaNK
llama
1,670
46

laser-dolphin-mixtral-2x7b-dpo-GGUF

NaNK
license:apache-2.0
1,652
50

ReMM-SLERP-L2-13B-AWQ

NaNK
llama
1,641
2

CodeLlama-34B-Python-fp16

NaNK
llama
1,628
13

CodeLlama-34B-Instruct-fp16

NaNK
llama
1,628
6

CodeLlama-13B-Python-fp16

NaNK
llama
1,626
25

CodeLlama-13B-Instruct-fp16

NaNK
llama
1,624
28

deepseek-coder-1.3b-base-GGUF

NaNK
1,609
9

HornyEchidna-13B-v0.1-GGUF

NaNK
llama
1,605
14

openchat-3.5-0106-GGUF

license:apache-2.0
1,596
76

Xwin-MLewd-13B-v0.2-GGUF

NaNK
llama
1,575
46

dolphin-2.6-mixtral-8x7b-GGUF

NaNK
license:apache-2.0
1,570
47

CodeLlama-13B-Python-GGUF

NaNK
llama
1,543
36

medalpaca-13B-GGUF

NaNK
llama
1,489
6

LLaMA-7b-GGUF

NaNK
llama
1,457
9

Guanaco-7B-Uncensored-GGUF

NaNK
llama
1,425
16

CausalLM-7B-GGUF

NaNK
llama
1,408
62

WizardLM-1.0-Uncensored-CodeLlama-34B-GGUF

NaNK
llama
1,377
25

stable-code-3b-GGUF

NaNK
dataset:bigcode/commitpackft
1,360
30

dolphin-2.6-mistral-7B-dpo-laser-GGUF

NaNK
license:apache-2.0
1,358
41

WizardCoder-Python-7B-V1.0-GGUF

NaNK
llama
1,355
24

Nous-Hermes-2-Yi-34B-GGUF

NaNK
license:apache-2.0
1,350
40

agentlm-7B-GGUF

NaNK
llama
1,345
8

deepseek-coder-33B-base-GGUF

NaNK
1,344
8

Emerhyst-20B-GGUF

NaNK
llama
1,334
37

zephyr-7B-alpha-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Zephyr 7B Alpha - GGUF - Model creator: Hugging Face H4 - Original model: Zephyr 7B Alpha This repo contains GGUF format model files for Hugging Face H4's Zephyr 7B Alpha. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Hugging Face H4's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | zephyr-7b-alpha.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | zephyr-7b-alpha.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | zephyr-7b-alpha.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | zephyr-7b-alpha.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | zephyr-7b-alpha.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | zephyr-7b-alpha.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | zephyr-7b-alpha.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | zephyr-7b-alpha.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | zephyr-7b-alpha.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | zephyr-7b-alpha.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | zephyr-7b-alpha.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | zephyr-7b-alpha.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/zephyr-7B-alpha-GGUF and below it, a specific filename to download, such as: zephyr-7b-alpha.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Hugging Face H4's Zephyr 7B Alpha Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-α is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. - Model type: A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets. - Language(s) (NLP): Primarily English - License: MIT - Finetuned from model: mistralai/Mistral-7B-v0.1 - Repository: https://github.com/huggingface/alignment-handbook - Demo: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat The model was initially fine-tuned on a variant of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contain 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our demo to test its capabilities. Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: Zephyr-7B-α has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the Falcon 180B model card for an example of this. Zephyr 7B Alpha achieves the following results on the evaluation set: - Loss: 0.4605 - Rewards/chosen: -0.5053 - Rewards/rejected: -1.8752 - Rewards/accuracies: 0.7812 - Rewards/margins: 1.3699 - Logps/rejected: -327.4286 - Logps/chosen: -297.1040 - Logits/rejected: -2.7153 - Logits/chosen: -2.7447 The following hyperparameters were used during training: - learningrate: 5e-07 - trainbatchsize: 2 - evalbatchsize: 4 - seed: 42 - distributedtype: multi-GPU - numdevices: 16 - totaltrainbatchsize: 32 - totalevalbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupratio: 0.1 - numepochs: 1 | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.5602 | 0.05 | 100 | 0.5589 | -0.3359 | -0.8168 | 0.7188 | 0.4809 | -306.2607 | -293.7161 | -2.6554 | -2.6797 | | 0.4852 | 0.1 | 200 | 0.5136 | -0.5310 | -1.4994 | 0.8125 | 0.9684 | -319.9124 | -297.6181 | -2.5762 | -2.5957 | | 0.5212 | 0.15 | 300 | 0.5168 | -0.1686 | -1.1760 | 0.7812 | 1.0074 | -313.4444 | -290.3699 | -2.6865 | -2.7125 | | 0.5496 | 0.21 | 400 | 0.4835 | -0.1617 | -1.7170 | 0.8281 | 1.5552 | -324.2635 | -290.2326 | -2.7947 | -2.8218 | | 0.5209 | 0.26 | 500 | 0.5054 | -0.4778 | -1.6604 | 0.7344 | 1.1826 | -323.1325 | -296.5546 | -2.8388 | -2.8667 | | 0.4617 | 0.31 | 600 | 0.4910 | -0.3738 | -1.5180 | 0.7656 | 1.1442 | -320.2848 | -294.4741 | -2.8234 | -2.8521 | | 0.4452 | 0.36 | 700 | 0.4838 | -0.4591 | -1.6576 | 0.7031 | 1.1986 | -323.0770 | -296.1796 | -2.7401 | -2.7653 | | 0.4674 | 0.41 | 800 | 0.5077 | -0.5692 | -1.8659 | 0.7656 | 1.2967 | -327.2416 | -298.3818 | -2.6740 | -2.6945 | | 0.4656 | 0.46 | 900 | 0.4927 | -0.5279 | -1.6614 | 0.7656 | 1.1335 | -323.1518 | -297.5553 | -2.7817 | -2.8015 | | 0.4102 | 0.52 | 1000 | 0.4772 | -0.5767 | -2.0667 | 0.7656 | 1.4900 | -331.2578 | -298.5311 | -2.7160 | -2.7455 | | 0.4663 | 0.57 | 1100 | 0.4740 | -0.8038 | -2.1018 | 0.7656 | 1.2980 | -331.9604 | -303.0741 | -2.6994 | -2.7257 | | 0.4737 | 0.62 | 1200 | 0.4716 | -0.3783 | -1.7015 | 0.7969 | 1.3232 | -323.9545 | -294.5634 | -2.6842 | -2.7135 | | 0.4259 | 0.67 | 1300 | 0.4866 | -0.6239 | -1.9703 | 0.7812 | 1.3464 | -329.3312 | -299.4761 | -2.7046 | -2.7356 | | 0.4935 | 0.72 | 1400 | 0.4747 | -0.5626 | -1.7600 | 0.7812 | 1.1974 | -325.1243 | -298.2491 | -2.7153 | -2.7444 | | 0.4211 | 0.77 | 1500 | 0.4645 | -0.6099 | -1.9993 | 0.7656 | 1.3894 | -329.9109 | -299.1959 | -2.6944 | -2.7236 | | 0.4931 | 0.83 | 1600 | 0.4684 | -0.6798 | -2.1082 | 0.7656 | 1.4285 | -332.0890 | -300.5934 | -2.7006 | -2.7305 | | 0.5029 | 0.88 | 1700 | 0.4595 | -0.5063 | -1.8951 | 0.7812 | 1.3889 | -327.8267 | -297.1233 | -2.7108 | -2.7403 | | 0.4965 | 0.93 | 1800 | 0.4613 | -0.5561 | -1.9079 | 0.7812 | 1.3518 | -328.0831 | -298.1203 | -2.7226 | -2.7523 | | 0.4337 | 0.98 | 1900 | 0.4608 | -0.5066 | -1.8718 | 0.7656 | 1.3652 | -327.3599 | -297.1296 | -2.7175 | -2.7469 | - Transformers 4.34.0 - Pytorch 2.0.1+cu118 - Datasets 2.12.0 - Tokenizers 0.14.0

NaNK
license:mit
1,329
139

medicine-LLM-GGUF

llama
1,312
23

koala-7B-HF

NaNK
llama
1,297
21

Yi-34B-Chat-GGUF

NaNK
1,282
75

Yi-34B-GGUF

NaNK
1,281
77

Ziya-Coding-34B-v1.0-GGUF

NaNK
llama
1,272
11

Llama-2-70B-Chat-AWQ

NaNK
llama
1,267
23

airoboros-mistral2.2-7B-GGUF

NaNK
llama-2
1,258
23

LLaMA-Pro-8B-Instruct-GGUF

NaNK
llama
1,248
23

SOLAR-10.7B-Instruct-v1.0-GGUF

NaNK
license:apache-2.0
1,237
82

storytime-13B-GGUF

NaNK
llama
1,231
19

Dr_Samantha-7B-GGUF

NaNK
llama
1,229
23

Nous-Hermes-Llama2-GGUF

NaNK
llama
1,229
18

Guanaco-13B-Uncensored-GGUF

NaNK
llama
1,227
41

em_german_leo_mistral-GGUF

license:apache-2.0
1,220
40

DaringMaid-13B-GGUF

NaNK
llama
1,177
6

Yarn-Llama-2-7B-128K-GGUF

NaNK
llama
1,174
21

LongAlpaca-70B-GGUF

NaNK
llama
1,160
8

Nous-Hermes-Llama2-70B-GGUF

NaNK
llama
1,155
26

Magicoder-S-DS-6.7B-GGUF

NaNK
1,153
76

Mistral-7B-Instruct-v0.2-code-ft-GGUF

NaNK
license:cc-by-nc-nd-4.0
1,136
30

sqlcoder-7B-GGUF

NaNK
license:cc-by-sa-4.0
1,133
18

EstopianMaid-13B-GGUF

NaNK
llama
1,132
55

MythoMist-7B-GGUF

NaNK
1,121
28

airoboros-l2-13B-gpt4-1.4.1-GGUF

NaNK
llama
1,117
7

OpenHermes-2.5-Mistral-7B-16k-GGUF

NaNK
license:apache-2.0
1,100
58

LLaMA2-13B-Tiefighter-GGUF

NaNK
llama
1,089
25

Llama-2-70B-GPTQ

NaNK
llama
1,085
83

TinyLlama-1.1B-1T-OpenOrca-GGUF

NaNK
llama
1,077
17

Wizard-Vicuna-13B-Uncensored-HF

NaNK
llama
1,065
213

Mistral-7B-Instruct-v0.1-AWQ

NaNK
license:apache-2.0
1,056
38

phi-2-electrical-engineering-GGUF

1,039
18

neural-chat-7B-v3-1-GGUF

NaNK
license:apache-2.0
1,027
61

Nous-Hermes-13B-GGUF

NaNK
llama
1,009
14

Chronomaid-Storytelling-13B-GGUF

NaNK
llama
1,004
27

Mistral-7B-OpenOrca-GPTQ

NaNK
license:apache-2.0
989
99

Nous-Hermes-Llama-2-7B-GGUF

NaNK
llama
987
7

Leo-Mistral-Hessianai-7B-Chat-GGUF

NaNK
license:apache-2.0
982
13

LLaMA-30b-GGUF

NaNK
llama
979
5

DiscoLM_German_7b_v1-GGUF

NaNK
license:apache-2.0
976
31

docsgpt-7B-mistral-GGUF

NaNK
license:apache-2.0
969
9

saiga_mistral_7b-GGUF

NaNK
968
18

Toppy-M-7B-GGUF

NaNK
license:cc-by-nc-4.0
966
26

Orca-2-13B-GGUF

NaNK
llama
961
66

finance-LLM-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Finance LLM - GGUF - Model creator: AdaptLLM - Original model: Finance LLM This repo contains GGUF format model files for AdaptLLM's Finance LLM. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference AdaptLLM's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | finance-llm.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | finance-llm.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | finance-llm.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | finance-llm.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | finance-llm.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | finance-llm.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | finance-llm.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | finance-llm.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | finance-llm.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | finance-llm.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | finance-llm.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | finance-llm.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/finance-LLM-GGUF and below it, a specific filename to download, such as: finance-llm.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Adapt (Large) Language Models to Domains This repo contains the domain-specific base model developed from LLaMA-1-7B, using the method in our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to transform large-scale pre-training corpora into reading comprehension texts, consistently improving prompting performance across tasks in biomedicine, finance, and law domains. Our 7B model competes with much larger domain-specific models like BloombergGPT-50B. 🤗 We are currently working hard on developing models across different domains, scales and architectures! Please stay tuned! 🤗 Updates 12/19: Released our 13B base models developed from LLaMA-1-13B. 12/8: Released our chat models developed from LLaMA-2-Chat-7B. 9/18: Released our paper, code, data, and base models developed from LLaMA-1-7B. Domain-Specific LLaMA-1 LLaMA-1-7B In our paper, we develop three domain-specific models from LLaMA-1-7B, which are also available in Huggingface: Biomedicine-LLM, Finance-LLM and Law-LLM, the performances of our AdaptLLM compared to other domain-specific LLMs are: LLaMA-1-13B Moreover, we scale up our base model to LLaMA-1-13B to see if our method is similarly effective for larger-scale models, and the results are consistently positive too: Biomedicine-LLM-13B, Finance-LLM-13B and Law-LLM-13B. Domain-Specific LLaMA-2-Chat Our method is also effective for aligned models! LLaMA-2-Chat requires a specific data format, and our reading comprehension can perfectly fit the data format by transforming the reading comprehension into a multi-turn conversation. We have also open-sourced chat models in different domains: Biomedicine-Chat, Finance-Chat and Law-Chat Domain-Specific Tasks To easily reproduce our results, we have uploaded the filled-in zero/few-shot input instructions and output completions of each domain-specific task: biomedicine-tasks, finance-tasks, and law-tasks. Note: those filled-in instructions are specifically tailored for models before alignment and do NOT fit for the specific data format required for chat models. Citation If you find our work helpful, please cite us:

NaNK
llama
960
25

SOLAR-10.7B-v1.0-GGUF

NaNK
license:apache-2.0
952
14

wizardLM-7B-HF

NaNK
llama
951
95

zephyr-7B-beta-GPTQ

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Zephyr 7B Beta - GPTQ - Model creator: Hugging Face H4 - Original model: Zephyr 7B Beta This repo contains GPTQ model files for Hugging Face H4's Zephyr 7B Beta. Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. These files were quantised using hardware kindly provided by Massed Compute. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Hugging Face H4's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These GPTQ models are known to work in the following inference servers/webuis. - text-generation-webui - KobaldAI United - LoLLMS Web UI - Hugging Face Text Generation Inference (TGI) This may not be a complete list; if you know of others, please let me know! Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. Each separate quant is in a different branch. See below for instructions on fetching from different branches. Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers. - Bits: The bit size of the quantised model. - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value. - Act Order: True or False. Also known as `descact`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy. - GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s). - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences. - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4-bit. | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc | | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- | | main | 4 | 128 | Yes | 0.1 | wikitext | 4096 | 4.16 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | | gptq-4bit-32g-actorderTrue | 4 | 32 | Yes | 0.1 | wikitext | 4096 | 4.57 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | | gptq-8bit--1g-actorderTrue | 8 | None | Yes | 0.1 | wikitext | 4096 | 7.52 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | | gptq-8bit-128g-actorderTrue | 8 | 128 | Yes | 0.1 | wikitext | 4096 | 7.68 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. | | gptq-8bit-32g-actorderTrue | 8 | 32 | Yes | 0.1 | wikitext | 4096 | 8.17 GB | No | 8-bit, with group size 32g and Act Order for maximum inference quality. | | gptq-4bit-64g-actorderTrue | 4 | 64 | Yes | 0.1 | wikitext | 4096 | 4.29 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. | To download from the `main` branch, enter `TheBloke/zephyr-7B-beta-GPTQ` in the "Download model" box. To download from another branch, add `:branchname` to the end of the download name, eg `TheBloke/zephyr-7B-beta-GPTQ:gptq-4bit-32g-actorderTrue` I recommend using the `huggingface-hub` Python library: To download the `main` branch to a folder called `zephyr-7B-beta-GPTQ`: To download from a different branch, add the `--revision` parameter: If you remove the `--local-dir-use-symlinks False` parameter, the files will instead be stored in the central Hugging Face cache directory (default location on Linux is: `~/.cache/huggingface`), and symlinks will be added to the specified `--local-dir`, pointing to their real location in the cache. This allows for interrupted downloads to be resumed, and allows you to quickly clone the repo to multiple places on disk without triggering a download again. The downside, and the reason why I don't list that as the default option, is that the files are then hidden away in a cache folder and it's harder to know where your disk space is being used, and to clear it up if/when you want to remove a download model. The cache location can be changed with the `HFHOME` environment variable, and/or the `--cache-dir` parameter to `huggingface-cli`. For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. To clone a specific branch with `git`, use a command like this: Note that using Git with HF repos is strongly discouraged. It will be much slower than using `huggingface-hub`, and will use twice as much disk space as it has to store the model files twice (it stores every byte both in the intended target folder, and again in the `.git` folder as a blob.) How to easily download and use this model in text-generation-webui Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. 1. Click the Model tab. 2. Under Download custom model or LoRA, enter `TheBloke/zephyr-7B-beta-GPTQ`. - To download from a specific branch, enter for example `TheBloke/zephyr-7B-beta-GPTQ:gptq-4bit-32g-actorderTrue` - see Provided Files above for the list of branches for each option. 3. Click Download. 4. The model will start downloading. Once it's finished it will say "Done". 5. In the top left, click the refresh icon next to Model. 6. In the Model dropdown, choose the model you just downloaded: `zephyr-7B-beta-GPTQ` 7. The model will automatically load, and is now ready for use! 8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right. - Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantizeconfig.json`. 9. Once you're ready, click the Text Generation tab and enter a prompt to get started! Serving this model from Text Generation Inference (TGI) It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0` Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later): Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later. If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead: The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly. ExLlama is compatible with Llama and Mistral models in 4-bit. Please see the Provided Files table above for per-file compatibility. For a list of clients/servers, please see "Known compatible clients / servers", above. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Hugging Face H4's Zephyr 7B Beta Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. You can find more details in the technical report. - Model type: A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets. - Language(s) (NLP): Primarily English - License: MIT - Finetuned from model: mistralai/Mistral-7B-v0.1 - Repository: https://github.com/huggingface/alignment-handbook - Demo: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat - Chatbot Arena: Evaluate Zephyr 7B against 10+ LLMs in the LMSYS arena: http://arena.lmsys.org At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the MT-Bench and AlpacaEval benchmarks: | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | StableLM-Tuned-α | 7B| dSFT |2.75| -| | MPT-Chat | 7B |dSFT |5.42| -| | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83| | Mistral-Instructv0.1 | 7B| - | 6.84 |-| | Zephyr-7b-α |7B| dDPO| 6.88| -| | Zephyr-7b-β 🪁 | 7B | dDPO | 7.34 | 90.60 | | Falcon-Instruct | 40B |dSFT |5.17 |45.71| | Guanaco | 65B | SFT |6.41| 71.80| | Llama2-Chat | 70B |RLHF |6.86| 92.66| | Vicuna v1.3 | 33B |dSFT |7.12 |88.99| | WizardLM v1.0 | 70B |dSFT |7.71 |-| | Xwin-LM v0.1 | 70B |dPPO |- |95.57| | GPT-3.5-turbo | - |RLHF |7.94 |89.37| | Claude 2 | - |RLHF |8.06| 91.36| | GPT-4 | -| RLHF |8.99| 95.28| In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B: However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap. The model was initially fine-tuned on a filtered and preprocessed of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contains 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our demo to test its capabilities. You can find the datasets used for training Zephyr-7B-β here Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: Zephyr-7B-β has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the Falcon 180B model card for an example of this. During DPO training, this model achieves the following results on the evaluation set: - Loss: 0.7496 - Rewards/chosen: -4.5221 - Rewards/rejected: -8.3184 - Rewards/accuracies: 0.7812 - Rewards/margins: 3.7963 - Logps/rejected: -340.1541 - Logps/chosen: -299.4561 - Logits/rejected: -2.3081 - Logits/chosen: -2.3531 The following hyperparameters were used during training: - learningrate: 5e-07 - trainbatchsize: 2 - evalbatchsize: 4 - seed: 42 - distributedtype: multi-GPU - numdevices: 16 - totaltrainbatchsize: 32 - totalevalbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 The table below shows the full set of DPO training metrics: | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 | | 0.4908 | 0.1 | 200 | 0.5426 | -0.0279 | -0.6842 | 0.75 | 0.6563 | -263.8124 | -254.5145 | -2.7719 | -2.7960 | | 0.5264 | 0.15 | 300 | 0.5324 | 0.0414 | -0.9793 | 0.7656 | 1.0207 | -266.7627 | -253.8209 | -2.7892 | -2.8122 | | 0.5536 | 0.21 | 400 | 0.4957 | -0.0185 | -1.5276 | 0.7969 | 1.5091 | -272.2460 | -254.4203 | -2.8542 | -2.8764 | | 0.5362 | 0.26 | 500 | 0.5031 | -0.2630 | -1.5917 | 0.7812 | 1.3287 | -272.8869 | -256.8653 | -2.8702 | -2.8958 | | 0.5966 | 0.31 | 600 | 0.5963 | -0.2993 | -1.6491 | 0.7812 | 1.3499 | -273.4614 | -257.2279 | -2.8778 | -2.8986 | | 0.5014 | 0.36 | 700 | 0.5382 | -0.2859 | -1.4750 | 0.75 | 1.1891 | -271.7204 | -257.0942 | -2.7659 | -2.7869 | | 0.5334 | 0.41 | 800 | 0.5677 | -0.4289 | -1.8968 | 0.7969 | 1.4679 | -275.9378 | -258.5242 | -2.7053 | -2.7265 | | 0.5251 | 0.46 | 900 | 0.5772 | -0.2116 | -1.3107 | 0.7344 | 1.0991 | -270.0768 | -256.3507 | -2.8463 | -2.8662 | | 0.5205 | 0.52 | 1000 | 0.5262 | -0.3792 | -1.8585 | 0.7188 | 1.4793 | -275.5552 | -258.0276 | -2.7893 | -2.7979 | | 0.5094 | 0.57 | 1100 | 0.5433 | -0.6279 | -1.9368 | 0.7969 | 1.3089 | -276.3377 | -260.5136 | -2.7453 | -2.7536 | | 0.5837 | 0.62 | 1200 | 0.5349 | -0.3780 | -1.9584 | 0.7656 | 1.5804 | -276.5542 | -258.0154 | -2.7643 | -2.7756 | | 0.5214 | 0.67 | 1300 | 0.5732 | -1.0055 | -2.2306 | 0.7656 | 1.2251 | -279.2761 | -264.2903 | -2.6986 | -2.7113 | | 0.6914 | 0.72 | 1400 | 0.5137 | -0.6912 | -2.1775 | 0.7969 | 1.4863 | -278.7448 | -261.1467 | -2.7166 | -2.7275 | | 0.4655 | 0.77 | 1500 | 0.5090 | -0.7987 | -2.2930 | 0.7031 | 1.4943 | -279.8999 | -262.2220 | -2.6651 | -2.6838 | | 0.5731 | 0.83 | 1600 | 0.5312 | -0.8253 | -2.3520 | 0.7812 | 1.5268 | -280.4902 | -262.4876 | -2.6543 | -2.6728 | | 0.5233 | 0.88 | 1700 | 0.5206 | -0.4573 | -2.0951 | 0.7812 | 1.6377 | -277.9205 | -258.8084 | -2.6870 | -2.7097 | | 0.5593 | 0.93 | 1800 | 0.5231 | -0.5508 | -2.2000 | 0.7969 | 1.6492 | -278.9703 | -259.7433 | -2.6221 | -2.6519 | | 0.4967 | 0.98 | 1900 | 0.5290 | -0.5340 | -1.9570 | 0.8281 | 1.4230 | -276.5395 | -259.5749 | -2.6564 | -2.6878 | | 0.0921 | 1.03 | 2000 | 0.5368 | -1.1376 | -3.1615 | 0.7812 | 2.0239 | -288.5854 | -265.6111 | -2.6040 | -2.6345 | | 0.0733 | 1.08 | 2100 | 0.5453 | -1.1045 | -3.4451 | 0.7656 | 2.3406 | -291.4208 | -265.2799 | -2.6289 | -2.6595 | | 0.0972 | 1.14 | 2200 | 0.5571 | -1.6915 | -3.9823 | 0.8125 | 2.2908 | -296.7934 | -271.1505 | -2.6471 | -2.6709 | | 0.1058 | 1.19 | 2300 | 0.5789 | -1.0621 | -3.8941 | 0.7969 | 2.8319 | -295.9106 | -264.8563 | -2.5527 | -2.5798 | | 0.2423 | 1.24 | 2400 | 0.5455 | -1.1963 | -3.5590 | 0.7812 | 2.3627 | -292.5599 | -266.1981 | -2.5414 | -2.5784 | | 0.1177 | 1.29 | 2500 | 0.5889 | -1.8141 | -4.3942 | 0.7969 | 2.5801 | -300.9120 | -272.3761 | -2.4802 | -2.5189 | | 0.1213 | 1.34 | 2600 | 0.5683 | -1.4608 | -3.8420 | 0.8125 | 2.3812 | -295.3901 | -268.8436 | -2.4774 | -2.5207 | | 0.0889 | 1.39 | 2700 | 0.5890 | -1.6007 | -3.7337 | 0.7812 | 2.1330 | -294.3068 | -270.2423 | -2.4123 | -2.4522 | | 0.0995 | 1.45 | 2800 | 0.6073 | -1.5519 | -3.8362 | 0.8281 | 2.2843 | -295.3315 | -269.7538 | -2.4685 | -2.5050 | | 0.1145 | 1.5 | 2900 | 0.5790 | -1.7939 | -4.2876 | 0.8438 | 2.4937 | -299.8461 | -272.1744 | -2.4272 | -2.4674 | | 0.0644 | 1.55 | 3000 | 0.5735 | -1.7285 | -4.2051 | 0.8125 | 2.4766 | -299.0209 | -271.5201 | -2.4193 | -2.4574 | | 0.0798 | 1.6 | 3100 | 0.5537 | -1.7226 | -4.2850 | 0.8438 | 2.5624 | -299.8200 | -271.4610 | -2.5367 | -2.5696 | | 0.1013 | 1.65 | 3200 | 0.5575 | -1.5715 | -3.9813 | 0.875 | 2.4098 | -296.7825 | -269.9498 | -2.4926 | -2.5267 | | 0.1254 | 1.7 | 3300 | 0.5905 | -1.6412 | -4.4703 | 0.8594 | 2.8291 | -301.6730 | -270.6473 | -2.5017 | -2.5340 | | 0.085 | 1.76 | 3400 | 0.6133 | -1.9159 | -4.6760 | 0.8438 | 2.7601 | -303.7296 | -273.3941 | -2.4614 | -2.4960 | | 0.065 | 1.81 | 3500 | 0.6074 | -1.8237 | -4.3525 | 0.8594 | 2.5288 | -300.4951 | -272.4724 | -2.4597 | -2.5004 | | 0.0755 | 1.86 | 3600 | 0.5836 | -1.9252 | -4.4005 | 0.8125 | 2.4753 | -300.9748 | -273.4872 | -2.4327 | -2.4716 | | 0.0746 | 1.91 | 3700 | 0.5789 | -1.9280 | -4.4906 | 0.8125 | 2.5626 | -301.8762 | -273.5149 | -2.4686 | -2.5115 | | 0.1348 | 1.96 | 3800 | 0.6015 | -1.8658 | -4.2428 | 0.8281 | 2.3769 | -299.3976 | -272.8936 | -2.4943 | -2.5393 | | 0.0217 | 2.01 | 3900 | 0.6122 | -2.3335 | -4.9229 | 0.8281 | 2.5894 | -306.1988 | -277.5699 | -2.4841 | -2.5272 | | 0.0219 | 2.07 | 4000 | 0.6522 | -2.9890 | -6.0164 | 0.8281 | 3.0274 | -317.1334 | -284.1248 | -2.4105 | -2.4545 | | 0.0119 | 2.12 | 4100 | 0.6922 | -3.4777 | -6.6749 | 0.7969 | 3.1972 | -323.7187 | -289.0121 | -2.4272 | -2.4699 | | 0.0153 | 2.17 | 4200 | 0.6993 | -3.2406 | -6.6775 | 0.7969 | 3.4369 | -323.7453 | -286.6413 | -2.4047 | -2.4465 | | 0.011 | 2.22 | 4300 | 0.7178 | -3.7991 | -7.4397 | 0.7656 | 3.6406 | -331.3667 | -292.2260 | -2.3843 | -2.4290 | | 0.0072 | 2.27 | 4400 | 0.6840 | -3.3269 | -6.8021 | 0.8125 | 3.4752 | -324.9908 | -287.5042 | -2.4095 | -2.4536 | | 0.0197 | 2.32 | 4500 | 0.7013 | -3.6890 | -7.3014 | 0.8125 | 3.6124 | -329.9841 | -291.1250 | -2.4118 | -2.4543 | | 0.0182 | 2.37 | 4600 | 0.7476 | -3.8994 | -7.5366 | 0.8281 | 3.6372 | -332.3356 | -293.2291 | -2.4163 | -2.4565 | | 0.0125 | 2.43 | 4700 | 0.7199 | -4.0560 | -7.5765 | 0.8438 | 3.5204 | -332.7345 | -294.7952 | -2.3699 | -2.4100 | | 0.0082 | 2.48 | 4800 | 0.7048 | -3.6613 | -7.1356 | 0.875 | 3.4743 | -328.3255 | -290.8477 | -2.3925 | -2.4303 | | 0.0118 | 2.53 | 4900 | 0.6976 | -3.7908 | -7.3152 | 0.8125 | 3.5244 | -330.1224 | -292.1431 | -2.3633 | -2.4047 | | 0.0118 | 2.58 | 5000 | 0.7198 | -3.9049 | -7.5557 | 0.8281 | 3.6508 | -332.5271 | -293.2844 | -2.3764 | -2.4194 | | 0.006 | 2.63 | 5100 | 0.7506 | -4.2118 | -7.9149 | 0.8125 | 3.7032 | -336.1194 | -296.3530 | -2.3407 | -2.3860 | | 0.0143 | 2.68 | 5200 | 0.7408 | -4.2433 | -7.9802 | 0.8125 | 3.7369 | -336.7721 | -296.6682 | -2.3509 | -2.3946 | | 0.0057 | 2.74 | 5300 | 0.7552 | -4.3392 | -8.0831 | 0.7969 | 3.7439 | -337.8013 | -297.6275 | -2.3388 | -2.3842 | | 0.0138 | 2.79 | 5400 | 0.7404 | -4.2395 | -7.9762 | 0.8125 | 3.7367 | -336.7322 | -296.6304 | -2.3286 | -2.3737 | | 0.0079 | 2.84 | 5500 | 0.7525 | -4.4466 | -8.2196 | 0.7812 | 3.7731 | -339.1662 | -298.7007 | -2.3200 | -2.3641 | | 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 | | 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 | | 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 | - Transformers 4.35.0.dev0 - Pytorch 2.0.1+cu118 - Datasets 2.12.0 - Tokenizers 0.14.0 If you find Zephyr-7B-β is useful in your work, please cite it with:

NaNK
license:mit
944
58

claude2-alpaca-7B-GGUF

NaNK
llama
943
15

bun_mistral_7b_v2-GGUF

NaNK
942
1

Llama-2-13B-fp16

NaNK
llama
931
63

Kunoichi-7B-GGUF

NaNK
license:cc-by-nc-4.0
930
35

SciPhi-Self-RAG-Mistral-7B-32k-GGUF

NaNK
license:mit
919
22

Synthia-7B-v1.3-GGUF

NaNK
license:apache-2.0
911
46

MistRP-Airoboros-7B-GGUF

NaNK
license:cc-by-nc-4.0
911
6

Tinyllama-2-1b-miniguanaco-GGUF

NaNK
llama
904
16

CodeLlama-70B-hf-GGUF

NaNK
llama
902
42

orca_mini_v3_7B-GGUF

NaNK
llama
901
11

Nethena-MLewd-Xwin-23B-GGUF

NaNK
llama
892
36

Llama-2-7B-32K-Instruct-GGUF

NaNK
llama
888
55

Llama-2-70B-fp16

NaNK
llama
888
47

MythoMax-L2-13B-GPTQ

NaNK
llama
880
215

phi-2-orange-GGUF

license:mit
880
20

Dolphin-Llama-13B-GGUF

NaNK
llama
880
4

CAMEL-13B-Role-Playing-Data-GGUF

NaNK
llama
871
5

LLaMA-13b-GGUF

NaNK
llama
867
4

WizardLM-30B-Uncensored-GGUF

NaNK
llama
863
15

Wizard-Vicuna-7B-Uncensored-HF

NaNK
llama
849
24

WhiteRabbitNeo-33B-v1-GGUF

NaNK
835
31

CollectiveCognition-v1-Mistral-7B-GGUF

NaNK
license:apache-2.0
834
4

Mistral-Trismegistus-7B-GGUF

NaNK
license:apache-2.0
830
15

Nous-Hermes-2-Mixtral-8x7B-SFT-GGUF

NaNK
license:apache-2.0
827
25

Mixtral_7Bx2_MoE-GGUF

NaNK
license:cc-by-nc-4.0
827
24

dolphin-2_2-yi-34b-GGUF

NaNK
824
46

Llama-2-13B-chat-GPTQ

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Llama 2 13B Chat - GPTQ - Model creator: Meta Llama 2 - Original model: Llama 2 13B Chat This repo contains GPTQ model files for Meta's Llama 2 13B-chat. Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Meta Llama 2's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. Each separate quant is in a different branch. See below for instructions on fetching from different branches. All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa. - Bits: The bit size of the quantised model. - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value. - Act Order: True or False. Also known as `descact`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy. - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s). - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences. - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit. | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc | | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- | | main | 4 | 128 | No | 0.01 | wikitext | 4096 | 7.26 GB | Yes | 4-bit, without Act Order and group size 128g. | | gptq-4bit-32g-actorderTrue | 4 | 32 | Yes | 0.01 | wikitext | 4096 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | | gptq-4bit-64g-actorderTrue | 4 | 64 | Yes | 0.01 | wikitext | 4096 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. | | gptq-4bit-128g-actorderTrue | 4 | 128 | Yes | 0.01 | wikitext | 4096 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | | gptq-8bit-128g-actorderTrue | 8 | 128 | Yes | 0.01 | wikitext | 4096 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. | | gptq-8bit-64g-actorderTrue | 8 | 64 | Yes | 0.01 | wikitext | 4096 | 13.95 GB | No | 8-bit, with group size 64g and Act Order for even higher inference quality. Poor AutoGPTQ CUDA speed. | | gptq-8bit-128g-actorderFalse | 8 | 128 | No | 0.01 | wikitext | 4096 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. | | gptq-8bit--1g-actorderTrue | 8 | None | Yes | 0.01 | wikitext | 4096 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Llama-2-13B-chat-GPTQ:main` - With Git, you can clone a branch with: - In Python Transformers code, the branch is the `revision` parameter; see below. How to easily download and use this model in text-generation-webui. Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. 1. Click the Model tab. 2. Under Download custom model or LoRA, enter `TheBloke/Llama-2-13B-chat-GPTQ`. - To download from a specific branch, enter for example `TheBloke/Llama-2-13B-chat-GPTQ:main` - see Provided Files above for the list of branches for each option. 3. Click Download. 4. The model will start downloading. Once it's finished it will say "Done". 5. In the top left, click the refresh icon next to Model. 6. In the Model dropdown, choose the model you just downloaded: `Llama-2-13B-chat-GPTQ` 7. The model will automatically load, and is now ready for use! 8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right. Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantizeconfig.json`. 9. Once you're ready, click the Text Generation tab and enter a prompt to get started! Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later. If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead: For CodeLlama models only: you must use Transformers 4.33.0 or later. If 4.33.0 is not yet released when you read this, you will need to install Transformers from source: The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ-for-LLaMa fork. ExLlama is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility. Huggingface Text Generation Inference (TGI) is compatible with all GPTQ models. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Llama 2 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. Model Details Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept our License before requesting access here. Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|A new mix of publicly available online data|7B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|13B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|70B|4k|✔|2.0T|1.5 x 10 -4 | Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ Research Paper "Llama-2: Open Foundation and Fine-tuned Chat Models" Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. To get the expected features and performance for the chat versions, a specific formatting needs to be followed, including the `INST` and ` >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). See our reference code in github for details: `chatcompletion`. Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws).Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Llama 2. Hardware and Software Training Factors We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO 2 eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| CO 2 emissions during pretraining. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Training Data Overview Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|54.2| Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1. |||TruthfulQA|Toxigen| |---|---|---|---| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|21.25| |Llama 2|13B|41.86|26.10| |Llama 2|70B|50.18|24.60| Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |||TruthfulQA|Toxigen| |---|---|---|---| |Llama-2-Chat|7B|57.04|0.00| |Llama-2-Chat|13B|62.18|0.00| |Llama-2-Chat|70B|64.14|0.01| Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above. Ethical Considerations and Limitations Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/ Reporting Issues Please report any software “bug,” or other problems with the models through one of the following means: - Reporting issues with the model: github.com/facebookresearch/llama - Reporting problematic content generated by the model: developers.facebook.com/llamaoutputfeedback - Reporting bugs and security concerns: facebook.com/whitehat/info Llama Model Index |Model|Llama2|Llama2-hf|Llama2-chat|Llama2-chat-hf| |---|---|---|---|---| |7B| Link | Link | Link | Link| |13B| Link | Link | Link | Link| |70B| Link | Link | Link | Link|

NaNK
llama
814
364

Mixtral_34Bx2_MoE_60B-GGUF

NaNK
license:cc-by-nc-4.0
814
35

Rose-20B-GGUF

NaNK
llama
814
29

OpenHermes-2.5-neural-chat-7B-v3-2-7B-GGUF

NaNK
license:apache-2.0
813
25

Hermes-Trismegistus-Mistral-7B-GGUF

NaNK
license:apache-2.0
808
25

sqlcoder2-GGUF

NaNK
804
29

MythoMax-Kimiko-Mix-GGUF

llama
803
10

meditron-70B-GGUF

NaNK
llama
801
20

Nous-Capybara-limarpv3-34B-GGUF

NaNK
llama
797
29

stable-vicuna-13B-HF

NaNK
llama
795
96

Unholy-v2-13B-GGUF

NaNK
llama
794
48

WizardLM-13B-V1-1-SuperHOT-8K-GPTQ

NaNK
llama
790
46

orca_mini_13B-GPTQ

NaNK
llama
790
44

LongChat-13B-GPTQ

NaNK
llama
787
25

lzlv_70B-AWQ

NaNK
llama
787
2

llama-2-70b-Guanaco-QLoRA-fp16

NaNK
llama
785
55

Wizard-Vicuna-30B-Superhot-8K-fp16

NaNK
llama
784
7

Llama-2-7B-vietnamese-20k-GGUF

NaNK
llama
783
6

VicUnlocked-30B-LoRA-HF

NaNK
llama
782
1

vicuna-13b-v1.3.0-GPTQ

NaNK
llama
779
21

WizardLM-30B-GPTQ

NaNK
llama
779
18

Yi-6B-200K-GGUF

NaNK
778
28

Vicuna-33B-1-3-SuperHOT-8K-fp16

NaNK
llama
778
6

guanaco-65B-HF

NaNK
llama
776
27

gpt4-alpaca-lora-30b-HF

NaNK
llama
776
13

Project-Baize-v2-13B-GPTQ

NaNK
llama
776
11

robin-33B-v2-fp16

NaNK
llama
776
3

BigTranslate-13B-GPTQ

NaNK
llama
775
19

wizard-vicuna-13B-HF

NaNK
llama
774
49

tulu-30B-fp16

NaNK
llama
774
5

WizardLM-30B-fp16

NaNK
llama
773
10

CAMEL-33B-Combined-Data-SuperHOT-8K-fp16

NaNK
llama
773
1

OpenAssistant-SFT-7-Llama-30B-HF

NaNK
llama
772
14

law-LLM-13B-GGUF

NaNK
llama
772
9

airoboros-33B-gpt4-1-4-SuperHOT-8K-fp16

NaNK
llama
772
5

Platypus-30B-SuperHOT-8K-fp16

NaNK
llama
772
2

law-chat-GGUF

NaNK
llama
771
21

openchat_v2_openorca_preview-GPTQ

llama
770
15

VicUnlocked-alpaca-65B-QLoRA-fp16

NaNK
llama
770
10

UltraLM-13B-fp16

NaNK
llama
770
4

alpaca-lora-65B-HF

NaNK
llama
770
3

MAmmoTH-Coder-34B-GGUF

NaNK
llama
770
2

OpenAssistant-SFT-7-Llama-30B-GPTQ

NaNK
llama
769
35

dromedary-65b-lora-HF

NaNK
llama
769
20

Wizard-Vicuna-30B-Uncensored-fp16

NaNK
llama
769
17

gpt4-alpaca-lora_mlp-65B-HF

NaNK
llama
769
7

gpt4-alpaca-lora-13B-HF

NaNK
llama
769
4

tulu-13B-fp16

NaNK
llama
769
2

Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-GPTQ

NaNK
llama
767
19

WizardLM-13B-V1-1-SuperHOT-8K-fp16

NaNK
llama
767
4

GPlatty-30B-SuperHOT-8K-fp16

NaNK
llama
767
1

airoboros-13B-HF

NaNK
llama
765
12

robin-13B-v2-fp16

NaNK
llama
765
4

Planner-7B-fp16

NaNK
llama
765
1

Yi-34B-200K-DARE-megamerge-v8-GGUF

NaNK
764
14

Project-Baize-v2-7B-GPTQ

NaNK
llama
764
4

Vicuna-13B-CoT-fp16

NaNK
llama
764
3

robin-33B-v2-GPTQ

NaNK
llama
763
13

tulu-7B-fp16

NaNK
llama
763
4

Llama-2-Coder-7B-GGUF

NaNK
llama
762
14

guanaco-13B-HF

NaNK
llama
762
7

Nous-Hermes-13B-SuperHOT-8K-fp16

NaNK
llama
762
4

robin-65b-v2-fp16

NaNK
llama
762
3

Llama-2-7B-Chat-GGML

NaNK
llama
761
871

Chinese-Alpaca-33B-SuperHOT-8K-fp16

NaNK
llama
761
7

deepseek-llm-7B-base-GGUF

NaNK
760
6

llama-30b-supercot-SuperHOT-8K-fp16

NaNK
llama
760
4

airoboros-7b-gpt4-fp16

NaNK
llama
759
4

h2ogpt-oasst1-512-30B-HF

NaNK
llama
754
2

UNA-TheBeagle-7B-v1-GGUF

NaNK
license:cc-by-nc-nd-4.0
753
20

Generate_Question_Mistral_7B-GGUF

NaNK
llama
753
5

Yi-34B-200K-GGUF

NaNK
750
29

Nethena-13B-GGUF

NaNK
llama
748
5

WizardLM-33B-V1.0-Uncensored-GGUF

NaNK
llama
747
14

MythoLogic-Mini-7B-GGUF

NaNK
llama
746
7

CodeFuse-CodeLlama-34B-GGUF

NaNK
llama
745
20

vicuna-13B-v1.5-16K-GGUF

NaNK
llama
744
43

Rose-20B-AWQ

NaNK
llama
742
1

leo-hessianai-13B-chat-bilingual-GGUF

NaNK
llama
740
7

Noromaid-13B-v0.3-GGUF

NaNK
llama
739
11

WizardLM-13B-V1.0-Uncensored-GGUF

NaNK
llama
731
5

Yi-6B-GGUF

NaNK
730
14

WizardLM-70B-V1.0-GGUF

NaNK
llama
729
17

leo-hessianai-13B-chat-GGUF

NaNK
llama
722
7

MXLewd-L2-20B-GGUF

NaNK
llama
718
29

llama-2-13B-Guanaco-QLoRA-GGUF

NaNK
llama
716
6

guanaco-65B-GPTQ

NaNK
llama
714
263

Llama-2-70B-GGUF

NaNK
llama
713
31

Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Noromaid V0.4 Mixtral Instruct 8X7B ZLoss - GGUF - Model creator: NeverSleep - Original model: Noromaid V0.4 Mixtral Instruct 8X7B ZLoss This repo contains GGUF format model files for NeverSleep's Noromaid V0.4 Mixtral Instruct 8X7B ZLoss. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference NeverSleep's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q2K.gguf | Q2K | 2 | 17.17 GB| 19.67 GB | smallest, significant quality loss - not recommended for most purposes | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q3KM.gguf | Q3KM | 3 | 22.48 GB| 24.98 GB | very small, high quality loss | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q40.gguf | Q40 | 4 | 26.44 GB| 28.94 GB | legacy; small, very high quality loss - prefer using Q3KM | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q4KM.gguf | Q4KM | 4 | 28.38 GB| 30.88 GB | medium, balanced quality - recommended | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q50.gguf | Q50 | 5 | 32.23 GB| 34.73 GB | legacy; medium, balanced quality - prefer using Q4KM | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q5KM.gguf | Q5KM | 5 | 33.23 GB| 35.73 GB | large, very low quality loss - recommended | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q6K.gguf | Q6K | 6 | 38.38 GB| 40.88 GB | very large, extremely low quality loss | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q80.gguf | Q80 | 8 | 49.62 GB| 52.12 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF and below it, a specific filename to download, such as: noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 32768` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Original model card: NeverSleep's Noromaid V0.4 Mixtral Instruct 8X7B ZLoss Disclaimer: This model is experimental, do not expect everything to work. This model was trained on the Zloss fork of Charles, and should fix issue the model had. Use Chatml prompt format, but not the special token. The reason is that Axolotl merge the finetune with the base model at 1.0 weight basically, but this is too much, so I use another script available HERE to merge with less weight, sadly, it don't take the special Chatml token. It's like Orca2 for the matter. This repo contains FP16 files of Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss. Note: We have permission of all users to upload their ratings, we DONT screenshot random reviews without asking if we can put them here! If you want your rating to be here, send us a message over on DC and we'll put up a screenshot of it here. DC name is "ikaridev" and "undi". - Aesir 1, 2 & 3 modified by us, credit to (MinervaAI / Gryphe) - LimaRP-20231109 (Lemonilia) - ToxicQAFinal (NobodyExistsOnTheInternet - No-robots-ShareGPT (Doctor-Shotgun) IkariDev: Visit my retro/neocities style website please kek

NaNK
license:cc-by-nc-4.0
713
19

LLaMA2-13B-Psyfighter2-GGUF

NaNK
llama
711
16

fin-llama-33B-GGUF

NaNK
llama
710
3

Kimiko-Mistral-7B-GGUF

NaNK
license:apache-2.0
708
13

meditron-7B-chat-GGUF

NaNK
llama
706
15

Code-13B-GGUF

NaNK
llama
705
7

Manticore-13B-GGUF

NaNK
llama
702
4

MistralLite-7B-GGUF

NaNK
license:apache-2.0
695
41

LLaMA2-13B-TiefighterLR-GGUF

NaNK
llama
693
5

deepseek-coder-1.3b-instruct-AWQ

NaNK
llama
686
3

GodziLLa2-70B-GGUF

NaNK
llama
683
10

chronos-hermes-13B-GGUF

NaNK
llama
683
8

NeuralHermes-2.5-Mistral-7B-GGUF

NaNK
license:apache-2.0
681
52

Mixtral-SlimOrca-8x7B-GGUF

NaNK
license:apache-2.0
680
24

Airoboros-L2-70b-2.2-GGUF

NaNK
llama
667
13

Xwin-LM-70B-V0.1-GGUF

NaNK
llama
664
53

Mixtral-8x7B-v0.1-GPTQ

NaNK
license:apache-2.0
662
127

chronos007-70B-GGUF

NaNK
llama
660
5

TinyLlama-1.1B-intermediate-step-480k-1T-GGUF

NaNK
tinyllama
654
10

xDAN-L1-Chat-RL-v1-GGUF

NaNK
license:cc-by-4.0
652
12

deepseek-coder-5.7bmqa-base-GGUF

NaNK
650
4

fiction.live-Kimiko-V2-70B-GGUF

NaNK
llama
648
13

AquilaChat2-34B-16K-GGUF

NaNK
643
12

mistral-ft-optimized-1227-GGUF

NaNK
license:apache-2.0
642
14

Xwin-LM-13B-v0.2-GGUF

NaNK
llama
639
20

Writing_Partner_Mistral_7B-GGUF

NaNK
license:apache-2.0
638
12

Ferret_7B-GGUF

NaNK
636
7

StellarBright-GGUF

NaNK
llama
634
7

WestLake-7B-v2-GGUF

NaNK
license:apache-2.0
629
20

Marcoroni-7B-v3-GGUF

NaNK
license:apache-2.0
626
22

juanako-7B-UNA-GGUF

NaNK
license:apache-2.0
626
12

CollectiveCognition-v1.1-Mistral-7B-GGUF

NaNK
license:apache-2.0
625
34

Phind-CodeLlama-34B-v1-GPTQ

NaNK
llama
625
11

med42-70B-GGUF

NaNK
llama
624
24

CodeLlama-7B-Instruct-GPTQ

NaNK
llama
621
46

Sensualize-Mixtral-GGUF

NaNK
license:cc-by-nc-4.0
620
20

vicuna-13B-v1.5-GGUF

NaNK
llama
620
16

gorilla-7B-GGUF

NaNK
llama
619
3

WizardLM-70B-V1.0-GPTQ

NaNK
llama
617
37

Llama-2-70B-Orca-200k-GGUF

NaNK
llama
615
22

Orca-2-7B-GGUF

NaNK
llama
612
56

Llama2-70B-OASST-SFT-v10-GPTQ

NaNK
llama
612
4

Sarah_StoryTeller_13b-GGUF

NaNK
llama
609
9

CodeBooga-34B-v0.1-GGUF

NaNK
llama
608
55

TinyLlama-1.1B-python-v0.1-GGUF

NaNK
tinyllama
606
12

Mistral-7B-Instruct-v0.1-GPTQ

NaNK
license:apache-2.0
604
84

DiscoLM-70B-GGUF

NaNK
llama
604
4

finance-LLM-13B-GGUF

NaNK
llama
600
20

law-LLM-GGUF

NaNK
llama
597
18

wizardLM-7B-GGUF

NaNK
llama
597
2

TinyLlama-1.1B-Chat-v1.0-AWQ

NaNK
llama
596
6

calm2-7B-chat-GGUF

NaNK
llama
590
11

Python-Code-33B-GGUF

NaNK
llama
588
3

goliath-120b-GGUF

NaNK
llama
583
139

alfred-40B-1023-GGUF

NaNK
license:apache-2.0
583
5

Chinese-Llama-2-7B-GGUF

NaNK
llama
582
22

MLewd-ReMM-L2-Chat-20B-GGUF

NaNK
llama
580
39

MetaMath-13B-V1.0-GGUF

NaNK
llama
578
3

Naberius-7B-GGUF

NaNK
llama
572
12

speechless-mistral-dolphin-orca-platypus-samantha-7B-GGUF

NaNK
llama-2
572
10

opus-v0-7B-GGUF

NaNK
570
14

Phind-CodeLlama-34B-Python-v1-GGUF

NaNK
llama
569
13

NeuralBeagle14-7B-GGUF

NaNK
license:apache-2.0
568
26

DPOpenHermes-7B-GGUF

NaNK
license:apache-2.0
566
5

DaringMaid-20B-GGUF

NaNK
llama
561
19

Wizard Vicuna 30B Uncensored GPTQ

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizard Vicuna 30B Uncensored - GPTQ - Model creator: Eric Hartford - Original model: Wizard Vicuna 30B Uncensored This repo contains GPTQ model files for Eric Hartford's Wizard-Vicuna-30B-Uncensored. Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. Each separate quant is in a different branch. See below for instructions on fetching from different branches. All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa. - Bits: The bit size of the quantised model. - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value. - Act Order: True or False. Also known as `descact`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy. - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s). - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences. - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit. | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc | | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- | | main | 4 | None | Yes | 0.01 | wikitext | 2048 | 16.94 GB | Yes | 4-bit, with Act Order. No group size, to lower VRAM requirements. | | gptq-4bit-32g-actorderTrue | 4 | 32 | Yes | 0.01 | wikitext | 2048 | 19.44 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | | gptq-4bit-64g-actorderTrue | 4 | 64 | Yes | 0.01 | wikitext | 2048 | 18.18 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. | | gptq-4bit-128g-actorderTrue | 4 | 128 | Yes | 0.01 | wikitext | 2048 | 17.55 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | | gptq-8bit--1g-actorderTrue | 8 | None | Yes | 0.01 | wikitext | 2048 | 32.99 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | | gptq-8bit-128g-actorderFalse | 8 | 128 | No | 0.01 | wikitext | 2048 | 33.73 GB | No | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. | | gptq-3bit--1g-actorderTrue | 3 | None | Yes | 0.01 | wikitext | 2048 | 12.92 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. | | gptq-3bit-128g-actorderFalse | 3 | 128 | No | 0.01 | wikitext | 2048 | 13.51 GB | No | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. | - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ:main` - With Git, you can clone a branch with: - In Python Transformers code, the branch is the `revision` parameter; see below. How to easily download and use this model in text-generation-webui. Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. 1. Click the Model tab. 2. Under Download custom model or LoRA, enter `TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ`. - To download from a specific branch, enter for example `TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ:main` - see Provided Files above for the list of branches for each option. 3. Click Download. 4. The model will start downloading. Once it's finished it will say "Done". 5. In the top left, click the refresh icon next to Model. 6. In the Model dropdown, choose the model you just downloaded: `Wizard-Vicuna-30B-Uncensored-GPTQ` 7. The model will automatically load, and is now ready for use! 8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right. Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantizeconfig.json`. 9. Once you're ready, click the Text Generation tab and enter a prompt to get started! Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later. If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead: For CodeLlama models only: you must use Transformers 4.33.0 or later. If 4.33.0 is not yet released when you read this, you will need to install Transformers from source: The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ-for-LLaMa fork. ExLlama is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility. Huggingface Text Generation Inference (TGI) is compatible with all GPTQ models. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizard-Vicuna-30B-Uncensored This is an fp16 models of Eric Hartford's Wizard-Vicuna 30B. It is the result of converting Eric's original fp32 upload to fp16. 4bit GPTQ models for GPU inference. 4bit and 5bit GGML models for CPU inference. float16 HF format model for GPU inference and further conversions. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman. This is wizard-vicuna-13b trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.

NaNK
llama
558
586

sheep-duck-llama-2-70B-v1.1-GGUF

NaNK
llama
556
11

Merged-AGI-7B-GGUF

NaNK
license:cc-by-nc-4.0
556
3

Mistral-ClaudeLimaRP-v3-7B-GGUF

NaNK
license:apache-2.0
555
15

airoboros-l2-13b-gpt4-m2.0-GGUF

NaNK
llama
551
5

evolvedSeeker_1_3-GGUF

NaNK
550
2

phi-2-dpo-GGUF

547
18

dolphin-2.2-70B-GGUF

NaNK
llama
543
17

Mistral-7B-Code-16K-qlora-GGUF

NaNK
license:apache-2.0
542
21

MythoMax-L2-13B-AWQ

NaNK
llama
542
12

WizardLM-7B-V1.0-Uncensored-GGUF

NaNK
llama
540
6

MythoMix-L2-13B-GGUF

NaNK
llama
540
3

airoboros-l2-70B-GPT4-2.0-GGUF

NaNK
llama
538
5

sqlcoder-GGUF

536
18

Mixtral-8x7B-Instruct-v0.1-LimaRP-ZLoss-GGUF

NaNK
license:apache-2.0
536
12

open-llama-3b-v2-wizard-evol-instuct-v2-196k-GGUF

NaNK
llama
536
11

llama2_7b_merge_orcafamily-GGUF

NaNK
llama
536
0

GOAT-70B-Storytelling-GGUF

NaNK
llama
535
13

Iambe-20B-DARE-GGUF

NaNK
llama
535
4

Marx-3B-v3-GGUF

NaNK
license:cc-by-sa-4.0
534
8

L2-MythoMax22b-Instruct-Falseblock-GGUF

NaNK
llama
532
7

13B-Ouroboros-GGUF

NaNK
llama
532
2

airoboros-l2-70B-gpt4-1.4.1-GGUF

NaNK
llama
519
3

mistral-ft-optimized-1218-GGUF

NaNK
license:apache-2.0
518
22

WizardMath-7B-V1.1-GGUF

NaNK
516
7

Swallow-7B-GGUF

NaNK
llama
515
4

upstage-llama-30b-instruct-2048-GGUF

NaNK
llama
514
4

Airoboros-L2-70B-2.1-GGUF

NaNK
llama
513
23

Swallow-70B-instruct-GGUF

NaNK
llama
510
9

Llama-2-13B-Chat-Dutch-GGUF

NaNK
llama
510
7

CapybaraHermes-2.5-Mistral-7B-GPTQ

NaNK
license:apache-2.0
509
59

Mythalion-13B-GPTQ

NaNK
llama
509
52

Mistral-7B-codealpaca-lora-GGUF

NaNK
license:apache-2.0
508
10

una-cybertron-7B-v2-GGUF

NaNK
license:apache-2.0
503
33

MLewd-L2-Chat-13B-GGUF

NaNK
llama
503
27

japanese-stablelm-instruct-beta-70B-GGUF

NaNK
llama
502
12

Llama2-70B-OASST-SFT-v10-GGUF

NaNK
llama
502
10

WizardCoder-33B-V1.1-GGUF

NaNK
501
46

Yarn-Llama-2-13B-128K-GGUF

NaNK
llama
500
38

Phind-CodeLlama-34B-v1-GGUF

NaNK
llama
499
15

echidna-tiefigther-25-GGUF

NaNK
llama
499
5

Chronolima-Airo-Grad-L2-13B-GGUF

NaNK
llama
499
2

OpenAssistant-Llama2-13B-Orca-8K-3319-GGUF

NaNK
llama
498
9

13B-Thorns-L2-GGUF

NaNK
llama
498
2

TinyLlama-1.1B-python-v0.1-GPTQ

NaNK
llama
497
2

notus-7B-v1-GGUF

NaNK
license:mit
494
23

airoboros-l2-13b-gpt4-2.0-GGUF

NaNK
llama
491
2

Loyal-Macaroni-Maid-7B-GGUF

NaNK
license:cc-by-nc-4.0
489
32

TowerInstruct-7B-v0.1-GGUF

NaNK
llama
488
18

Garrulus-GGUF

NaNK
license:apache-2.0
487
8

WizardLM-30B-GGUF

NaNK
llama
483
2

WizardLM-7B-uncensored-GPTQ

NaNK
llama
482
195

Skywork-13B-base-GGUF

NaNK
481
6

airoboros-l2-13B-2.2.1-GGUF

NaNK
llama
481
3

OrcaMaidXL-17B-32k-GGUF

NaNK
llama
476
6

vietnamese-llama2-7B-40GB-GGUF

NaNK
llama
476
3

phi-2-GPTQ

NaNK
475
30

finance-chat-GGUF

NaNK
llama
475
14

Mythalion-Kimiko-v2-GGUF

NaNK
llama
475
12

Iambe-Storyteller-20B-GGUF

NaNK
llama
475
7

Dolphin-Llama2-7B-GGUF

NaNK
llama
473
2

Amethyst-13B-Mistral-GGUF

NaNK
llama
472
25

LLaMA-Pro-8B-GGUF

NaNK
llama
471
15

manticore-13b-chat-pyg-GGUF

NaNK
llama
471
8

Mistral-11B-OmniMix-GGUF

NaNK
license:cc-by-nc-4.0
470
14

OpenHermes-2.5-neural-chat-7B-v3-1-7B-GGUF

NaNK
license:apache-2.0
469
52

Chinese-Alpaca-2-7B-GGUF

NaNK
llama
467
9

TimeCrystal-L2-13B-GGUF

NaNK
llama
467
8

Synthia-MoE-v3-Mixtral-8x7B-GGUF

NaNK
license:apache-2.0
466
28

medicine-chat-GGUF

NaNK
llama
466
16

CapyTessBorosYi-34B-200K-DARE-Ties-GGUF

NaNK
466
5

Llama-2-13B-German-Assistant-v4-GGUF

NaNK
llama
466
4

GEITje-7B-chat-GGUF

NaNK
license:apache-2.0
463
4

una-xaberius-34b-v1beta-GGUF

NaNK
license:cc-by-4.0
460
7

dolphin-2.6-mistral-7B-dpo-GGUF

NaNK
license:apache-2.0
456
21

LlamaGuard-7B-GGUF

NaNK
llama
454
6

MixtralOrochi8x7B-GGUF

NaNK
license:cc-by-nc-4.0
452
15

neural-chat-7B-v3-3-GGUF

NaNK
license:apache-2.0
452
14

Everyone-Coder-33B-Base-GGUF

NaNK
451
12

Zarafusionex-1.1-L2-7B-GGUF

NaNK
llama
451
5

TinyLlama-1.1B-intermediate-step-715k-1.5T-GGUF

NaNK
tinyllama
450
6

nontoxic-bagel-34b-v0.2-GGUF

NaNK
450
3

Spring-Dragon-GGUF

llama
446
6

Sonya-7B-GGUF

NaNK
license:cc-by-4.0
445
12

lzlv_70B-GGUF

NaNK
llama
444
49

openbuddy-deepseek-67b-v15-base-GGUF

NaNK
442
3

Xwin-LM-7B-V0.1-GGUF

NaNK
llama
439
12

una-cybertron-7B-v3-OMA-GGUF

NaNK
license:apache-2.0
438
9

llama-2-13B-German-Assistant-v2-GGUF

NaNK
llama
436
2

PuddleJumper-13B-GGUF

NaNK
llama
433
17

Kimiko-v2-13B-GGUF

NaNK
llama
433
12

Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-GGUF

NaNK
license:apache-2.0
433
6

airoboros-l2-13B-3.0-GGUF

NaNK
llama
432
7

llama-2-7B-Arguments-GGUF

NaNK
llama
432
5

Etheria-55b-v0.1-GGUF

NaNK
431
8

Dolphin2.1-OpenOrca-7B-GGUF

NaNK
license:cc-by-nc-4.0
431
6

SauerkrautLM-Mixtral-8x7B-Instruct-GGUF

NaNK
license:apache-2.0
430
12

deepmoney-34b-200k-chat-evaluator-GGUF

NaNK
license:apache-2.0
430
11

MadMix-v0.2-GGUF

NaNK
license:apache-2.0
427
3

MonadGPT-GGUF

license:apache-2.0
426
12

30B-Epsilon-GGUF

NaNK
llama
423
5

openchat_3.5-16k-GGUF

license:apache-2.0
422
22

MetaMath-Mistral-7B-GGUF

NaNK
license:apache-2.0
422
10

CodeUp-Llama-2-13B-Chat-HF-GGUF

NaNK
llama
420
9

Pygmalion-2-13B-SuperCOT-weighed-GGUF

NaNK
llama
419
11

tora-code-34b-v1.0-GGUF

NaNK
llama
419
5

airoboros-m-7B-3.0-GGUF

NaNK
license:apache-2.0
418
4

PsyMedRP-v1-20B-GGUF

NaNK
llama
416
18

Samantha-1.11-CodeLlama-34B-GGUF

NaNK
llama
415
18

OpenOrca_Stx-GGUF

llama
415
4

SlimOrca-13B-GGUF

NaNK
llama
415
4

Velara-11B-V2-GGUF

NaNK
llama-2
413
9

Karen_TheEditor_V2_STRICT_Mistral_7B-GGUF

NaNK
llama
412
10

Uncensored-Jordan-33B-GGUF

NaNK
llama
411
8

jackalope-7B-GGUF

NaNK
license:apache-2.0
411
7

PiVoT-0.1-Evil-a-GGUF

license:cc-by-sa-4.0
410
28

30B-Lazarus-GGUF

NaNK
llama
410
1

Xwin-MLewd-7B-V0.2-GGUF

NaNK
llama
409
13

Nous-Capybara-7B-v1.9-GGUF

NaNK
license:mit
408
28

Arithmo-Mistral-7B-GGUF

NaNK
license:apache-2.0
407
12

vicuna-33B-GGUF

NaNK
llama
406
16

SUS-Chat-34B-GGUF

NaNK
405
15

deepmoney-34b-200k-base-GGUF

NaNK
license:apache-2.0
404
15

Python-Code-13B-GGUF

NaNK
llama
404
6

OpenHermes-2.5-neural-chat-v3-3-Slerp-GGUF

license:apache-2.0
401
29

Sydney_Overthinker_13B-GGUF

NaNK
llama
399
2