TheBloke
✓ VerifiedCommunityProlific quantizer, GGUF format pioneer (now retired)
Noromaid-20B-v0.1.1-GGUF
--- base_model: NeverSleep/Noromaid-20b-v0.1.1 inference: false license: cc-by-nc-4.0 model_creator: IkariDev and Undi model_name: Noromaid 20B v0.1.1 model_type: llama prompt_template: 'Below is an instruction that describes a task. Write a response that appropriately completes the request.
deepseek-coder-6.7B-instruct-AWQ
--- base_model: deepseek-ai/deepseek-coder-6.7b-instruct inference: false license: other license_link: LICENSE license_name: deepseek model_creator: DeepSeek model_name: Deepseek Coder 6.7B Instruct model_type: deepseek prompt_template: 'You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science qu
TinyLlama-1.1B-Chat-v0.3-GPTQ
--- base_model: PY007/TinyLlama-1.1B-Chat-v0.3 datasets: - cerebras/SlimPajama-627B - bigcode/starcoderdata - OpenAssistant/oasst_top1_2023-08-25 inference: false language: - en license: apache-2.0 model_creator: Zhang Peiyuan model_name: TinyLlama 1.1B Chat v0.3 model_type: tinyllama prompt_template: 'system
TinyLlama-1.1B-Chat-v1.0-GGUF
--- base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 datasets: - cerebras/SlimPajama-627B - bigcode/starcoderdata - OpenAssistant/oasst_top1_2023-08-25 inference: false language: - en license: apache-2.0 model_creator: TinyLlama model_name: Tinyllama 1.1B Chat v1.0 model_type: tinyllama prompt_template: '
Mistral-7B-Instruct-v0.2-GPTQ
Mixtral-8x7B-Instruct-v0.1-GPTQ
Llama-2-7B-GPTQ
DiscoLM_German_7b_v1-AWQ
Mistral-7B-Instruct-v0.2-GGUF
--- base_model: mistralai/Mistral-7B-Instruct-v0.2 inference: false license: apache-2.0 model_creator: Mistral AI_ model_name: Mistral 7B Instruct v0.2 model_type: mistral pipeline_tag: text-generation prompt_template: '[INST] {prompt} [/INST]
MythoMax-L2-13B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) MythoMax L2 13B - GGUF - Model creator: Gryphe - Original model: MythoMax L2 13B This repo contains GGUF format model files for Gryphe's MythoMax L2 13B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Gryphe's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The creator of the source model has listed its license as `other`, and this quantization has therefore used that same license. As this model is based on Llama 2, it is also subject to the Meta Llama 2 license terms, and the license files for that are additionally included. It should therefore be considered as being claimed to be licensed under both licenses. I contacted Hugging Face for clarification on dual licensing but they do not yet have an official position. Should this change, or should Meta provide any feedback on this situation, I will update this section accordingly. In the meantime, any questions regarding licensing, and in particular how these two licenses might interact, should be directed to the original model repository: Gryphe's MythoMax L2 13B. These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | mythomax-l2-13b.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | mythomax-l2-13b.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | mythomax-l2-13b.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | mythomax-l2-13b.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | mythomax-l2-13b.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | mythomax-l2-13b.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | mythomax-l2-13b.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | mythomax-l2-13b.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | mythomax-l2-13b.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | mythomax-l2-13b.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | mythomax-l2-13b.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | mythomax-l2-13b.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/MythoMax-L2-13B-GGUF and below it, a specific filename to download, such as: mythomax-l2-13b.q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. An improved, potentially even perfected variant of MythoMix, my MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a model, resulting in increased coherency across the entire structure. The script and the acccompanying templates I used to produce both can be found here. This model is proficient at both roleplaying and storywriting due to its unique nature. Quantized models are available from TheBloke: GGML - GPTQ (You're the best!) The idea behind this merge is that each layer is composed of several tensors, which are in turn responsible for specific functions. Using MythoLogic-L2's robust understanding as its input and Huginn's extensive writing capability as its output seems to have resulted in a model that exceeds at both, confirming my theory. (More details to be released at a later time) This type of merge is incapable of being illustrated, as each of its 363 tensors had an unique ratio applied to it. As with my prior merges, gradients were part of these ratios to further finetune its behaviour. This model primarily uses Alpaca formatting, so for optimal model performance, use:
Mistral-7B-Instruct-v0.2-AWQ
--- base_model: mistralai/Mistral-7B-Instruct-v0.2 inference: false license: apache-2.0 model_creator: Mistral AI_ model_name: Mistral 7B Instruct v0.2 model_type: mistral pipeline_tag: text-generation prompt_template: '[INST] {prompt} [/INST]
Llama-2-7B-Chat-GGUF
--- language: - en license: llama2 tags: - facebook - meta - pytorch - llama - llama-2 model_name: Llama 2 7B Chat arxiv: 2307.09288 base_model: meta-llama/Llama-2-7b-chat-hf inference: false model_creator: Meta Llama 2 model_type: llama pipeline_tag: text-generation prompt_template: '[INST] >
TinyLlama-1.1B-Chat-v0.3-AWQ
phi-2-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Phi 2 - GGUF - Model creator: Microsoft - Original model: Phi 2 This repo contains GGUF format model files for Microsoft's Phi 2. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Microsoft's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | phi-2.Q2K.gguf | Q2K | 2 | 1.17 GB| 3.67 GB | smallest, significant quality loss - not recommended for most purposes | | phi-2.Q3KS.gguf | Q3KS | 3 | 1.25 GB| 3.75 GB | very small, high quality loss | | phi-2.Q3KM.gguf | Q3KM | 3 | 1.48 GB| 3.98 GB | very small, high quality loss | | phi-2.Q40.gguf | Q40 | 4 | 1.60 GB| 4.10 GB | legacy; small, very high quality loss - prefer using Q3KM | | phi-2.Q3KL.gguf | Q3KL | 3 | 1.60 GB| 4.10 GB | small, substantial quality loss | | phi-2.Q4KS.gguf | Q4KS | 4 | 1.62 GB| 4.12 GB | small, greater quality loss | | phi-2.Q4KM.gguf | Q4KM | 4 | 1.79 GB| 4.29 GB | medium, balanced quality - recommended | | phi-2.Q50.gguf | Q50 | 5 | 1.93 GB| 4.43 GB | legacy; medium, balanced quality - prefer using Q4KM | | phi-2.Q5KS.gguf | Q5KS | 5 | 1.93 GB| 4.43 GB | large, low quality loss - recommended | | phi-2.Q5KM.gguf | Q5KM | 5 | 2.07 GB| 4.57 GB | large, very low quality loss - recommended | | phi-2.Q6K.gguf | Q6K | 6 | 2.29 GB| 4.79 GB | very large, extremely low quality loss | | phi-2.Q80.gguf | Q80 | 8 | 2.96 GB| 5.46 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/phi-2-GGUF and below it, a specific filename to download, such as: phi-2.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. Our model hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more. Phi-2 is intended for research purposes only. Given the nature of the training data, the Phi-2 model is best suited for prompts using the QA format, the chat format, and the code format. You can provide the prompt as a standalone question as follows: where the model generates the text after "." . To encourage the model to write more concise answers, you can also try the following QA format using "Instruct: \ \nOutput:" where the model generates the text after "Output:". where the model generates the text after the first "Bob:". where the model generates the text after the comments. Notes: Phi-2 is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications. Direct adoption for production tasks is out of the scope of this research project. As a result, the Phi-2 model has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details. If you are using `transformers>=4.36.0`, always load the model with `trustremotecode=True` to prevent side-effects. To ensure the maximum compatibility, we recommend using the second execution mode (FP16 / CUDA), as follows: Remark: In the generation function, our model currently does not support beam search (`numbeams > 1`). Furthermore, in the forward pass of the model, we currently do not support outputting hidden states or attention values, or using custom input embeddings. Generate Inaccurate Code and Facts: The model may produce incorrect code snippets and statements. Users should treat these outputs as suggestions or starting points, not as definitive or accurate solutions. Limited Scope for code: Majority of Phi-2 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users. Language Limitations: The model is primarily designed to understand standard English. Informal English, slang, or any other languages might pose challenges to its comprehension, leading to potential misinterpretations or errors in response. Potential Societal Biases: Phi-2 is not entirely free from societal biases despite efforts in assuring trainig data safety. There's a possibility it may generate content that mirrors these societal biases, particularly if prompted or instructed to do so. We urge users to be aware of this and to exercise caution and critical thinking when interpreting model outputs. Toxicity: Despite being trained with carefully selected data, the model can still produce harmful content if explicitly prompted or instructed to do so. We chose to release the model for research purposes only -- We hope to help the open-source community develop the most effective ways to reduce the toxicity of a model directly after pretraining. Verbosity: Phi-2 being a base model often produces irrelevant or extra text and responses following its first answer to user prompts within a single turn. This is due to its training dataset being primarily textbooks, which results in textbook-like responses. Architecture: a Transformer-based model with next-word prediction objective Dataset size: 250B tokens, combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4. The model is licensed under the microsoft-research-license. This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
Mistral-7B-Instruct-v0.1-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Mistral 7B Instruct v0.1 - GGUF - Model creator: Mistral AI - Original model: Mistral 7B Instruct v0.1 This re...
TinyLlama-1.1B-Chat-v1.0-GPTQ
dolphin-2.6-mistral-7B-AWQ
Mixtral-8x7B-Instruct-v0.1-GGUF
deepseek-coder-6.7B-instruct-GGUF
OpenHermes-2.5-Mistral-7B-GGUF
Llama-2-13B-chat-GGUF
Mistral-7B-v0.1-GGUF
Wizard-Vicuna-13B-Uncensored-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizard Vicuna 13B Uncensored - GGUF - Model creator: Eric Hartford - Original model: Wizard Vicuna 13B Uncensored This repo contains GGUF format model files for Eric Hartford's Wizard Vicuna 13B Uncensored. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | Wizard-Vicuna-13B-Uncensored.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | Wizard-Vicuna-13B-Uncensored.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | Wizard-Vicuna-13B-Uncensored.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | Wizard-Vicuna-13B-Uncensored.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | Wizard-Vicuna-13B-Uncensored.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | Wizard-Vicuna-13B-Uncensored.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | Wizard-Vicuna-13B-Uncensored.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | Wizard-Vicuna-13B-Uncensored.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | Wizard-Vicuna-13B-Uncensored.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | Wizard-Vicuna-13B-Uncensored.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | Wizard-Vicuna-13B-Uncensored.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | Wizard-Vicuna-13B-Uncensored.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Wizard-Vicuna-13B-Uncensored-GGUF and below it, a specific filename to download, such as: Wizard-Vicuna-13B-Uncensored.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizard Vicuna 13B Uncensored This is wizard-vicuna-13b trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.
deepseek-coder-33B-instruct-GGUF
Llama-2-7B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Llama 2 7B - GGUF - Model creator: Meta - Original model: Llama 2 7B This repo contains GGUF format model files for Meta's Llama 2 7B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Meta's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | llama-2-7b.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | llama-2-7b.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | llama-2-7b.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | llama-2-7b.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | llama-2-7b.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | llama-2-7b.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | llama-2-7b.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | llama-2-7b.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | llama-2-7b.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | llama-2-7b.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | llama-2-7b.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | llama-2-7b.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Llama-2-7B-GGUF and below it, a specific filename to download, such as: llama-2-7b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Llama 2 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pretrained model, converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. Model Details Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept our License before requesting access here. Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|A new mix of publicly available online data|7B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|13B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|70B|4k|✔|2.0T|1.5 x 10 -4 | Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ Research Paper "Llama-2: Open Foundation and Fine-tuned Chat Models" Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. To get the expected features and performance for the chat versions, a specific formatting needs to be followed, including the `INST` and ` >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). See our reference code in github for details: `chatcompletion`. Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws).Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Llama 2. Hardware and Software Training Factors We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO 2 eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| CO 2 emissions during pretraining. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Training Data Overview Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|54.2| Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1. |||TruthfulQA|Toxigen| |---|---|---|---| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|21.25| |Llama 2|13B|41.86|26.10| |Llama 2|70B|50.18|24.60| Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |||TruthfulQA|Toxigen| |---|---|---|---| |Llama-2-Chat|7B|57.04|0.00| |Llama-2-Chat|13B|62.18|0.00| |Llama-2-Chat|70B|64.14|0.01| Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above. Ethical Considerations and Limitations Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/ Reporting Issues Please report any software “bug,” or other problems with the models through one of the following means: - Reporting issues with the model: github.com/facebookresearch/llama - Reporting problematic content generated by the model: developers.facebook.com/llamaoutputfeedback - Reporting bugs and security concerns: facebook.com/whitehat/info Llama Model Index |Model|Llama2|Llama2-hf|Llama2-chat|Llama2-chat-hf| |---|---|---|---|---| |7B| Link | Link | Link | Link| |13B| Link | Link | Link | Link| |70B| Link | Link | Link | Link|
dolphin-2.5-mixtral-8x7b-GGUF
MythoMax-L2-Kimiko-v2-13B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) MythoMax L2 Kimiko v2 13B - GGUF - Model creator: Undi95 - Original model: MythoMax L2 Kimiko v2 13B This repo contains GGUF format model files for Undi95's MythoMax L2 Kimiko v2 13B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Undi95's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The creator of the source model has listed its license as `cc-by-nc-4.0`, and this quantization has therefore used that same license. As this model is based on Llama 2, it is also subject to the Meta Llama 2 license terms, and the license files for that are additionally included. It should therefore be considered as being claimed to be licensed under both licenses. I contacted Hugging Face for clarification on dual licensing but they do not yet have an official position. Should this change, or should Meta provide any feedback on this situation, I will update this section accordingly. In the meantime, any questions regarding licensing, and in particular how these two licenses might interact, should be directed to the original model repository: Undi95's MythoMax L2 Kimiko v2 13B. These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | mythomax-l2-kimiko-v2-13b.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | mythomax-l2-kimiko-v2-13b.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | mythomax-l2-kimiko-v2-13b.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | mythomax-l2-kimiko-v2-13b.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | mythomax-l2-kimiko-v2-13b.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | mythomax-l2-kimiko-v2-13b.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | mythomax-l2-kimiko-v2-13b.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | mythomax-l2-kimiko-v2-13b.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | mythomax-l2-kimiko-v2-13b.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | mythomax-l2-kimiko-v2-13b.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | mythomax-l2-kimiko-v2-13b.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | mythomax-l2-kimiko-v2-13b.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF and below it, a specific filename to download, such as: mythomax-l2-kimiko-v2-13b.q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Undi95's MythoMax L2 Kimiko v2 13B Model : https://huggingface.co/Gryphe/MythoMax-L2-13b
CodeLlama-7B-Instruct-GGUF
dolphin-2.7-mixtral-8x7b-GGUF
deepseek-coder-1.3b-instruct-GGUF
TinyLlama-1.1B-Chat-v0.3-GGUF
Mixtral-8x7B-Instruct-v0.1-AWQ
Yi-34B-200K-AEZAKMI-v2-AWQ
CodeLlama-13B-Instruct-GGUF
Llama-2-7B-Chat-GPTQ
LLaMA2-13B-Tiefighter-AWQ
deepsex-34b-GGUF
WizardLM-7B-uncensored-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizardlm 7B Uncensored - GGUF - Model creator: Eric Hartford - Original model: Wizardlm 7B Uncensored This repo contains GGUF format model files for Eric Hartford's Wizardlm 7B Uncensored. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | WizardLM-7B-uncensored.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | WizardLM-7B-uncensored.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | WizardLM-7B-uncensored.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | WizardLM-7B-uncensored.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | WizardLM-7B-uncensored.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | WizardLM-7B-uncensored.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | WizardLM-7B-uncensored.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | WizardLM-7B-uncensored.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | WizardLM-7B-uncensored.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | WizardLM-7B-uncensored.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | WizardLM-7B-uncensored.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | WizardLM-7B-uncensored.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/WizardLM-7B-uncensored-GGUF and below it, a specific filename to download, such as: WizardLM-7B-uncensored.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizardlm 7B Uncensored This is WizardLM trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.
SOLAR-10.7B-Instruct-v1.0-uncensored-GGUF
Luna-AI-Llama2-Uncensored-GGUF
WizardLM-1.0-Uncensored-Llama2-13B-GGUF
dolphin-2.6-mistral-7B-GGUF
OpenHermes-2.5-Mistral-7B-AWQ
CodeLlama-7B-GGUF
dolphin-2.2.1-mistral-7B-GGUF
Mixtral-8x7B-v0.1-GGUF
Mistral-7B-OpenOrca-GGUF
Llama-2-7B-fp16
deepseek-coder-33B-instruct-GPTQ
Synthia-v3.0-11B-AWQ
Phind-CodeLlama-34B-v2-GGUF
CodeLlama-34B-Instruct-GGUF
WizardCoder-Python-34B-V1.0-GGUF
dolphin-2.1-mistral-7B-GGUF
deepseek-llm-7B-chat-GGUF
Open_Gpt4_8x7B-GGUF
rocket-3B-GGUF
dolphin-2.7-mixtral-8x7b-AWQ
Nous-Hermes-2-SOLAR-10.7B-GGUF
claude2-alpaca-13B-GGUF
CapybaraHermes-2.5-Mistral-7B-GGUF
vicuna-7B-v1.5-GGUF
Mistral-7B-Claude-Chat-GGUF
CodeLlama-13B-GGUF
Llama-2-7B-Chat-AWQ
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Llama 2 7B Chat - AWQ - Model creator: Meta Llama 2 - Original model: Llama 2 7B Chat This repo contains AWQ model files for Meta Llama 2's Llama 2 7B Chat. AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference. It is also now supported by continuous batching server vLLM, allowing use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Meta Llama 2's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions For my first release of AWQ models, I am releasing 128g models only. I will consider adding 32g as well if there is interest, and once I have done perplexity and evaluation comparisons, but at this time 32g models are still not fully tested with AutoAWQ and vLLM. | Branch | Bits | GS | AWQ Dataset | Seq Len | Size | | ------ | ---- | -- | ----------- | ------- | ---- | | main | 4 | 128 | wikitext | 4096 | 3.89 GB Documentation on installing and using vLLM can be found here. - When using vLLM as a server, pass the `--quantization awq` parameter, for example: When using vLLM from Python code, pass the `quantization=awq` parameter, for example: If you have problems installing AutoAWQ using the pre-built wheels, install it from source instead: The files provided are tested to work with AutoAWQ, and vLLM. Huggingface Text Generation Inference (TGI) is not yet compatible with AWQ, but a PR is open which should bring support soon: TGI PR #781. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Meta Llama 2's Llama 2 7B Chat Llama 2 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. Model Details Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept our License before requesting access here. Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|A new mix of publicly available online data|7B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|13B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|70B|4k|✔|2.0T|1.5 x 10 -4 | Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ Research Paper "Llama-2: Open Foundation and Fine-tuned Chat Models" Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. To get the expected features and performance for the chat versions, a specific formatting needs to be followed, including the `INST` and ` >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). See our reference code in github for details: `chatcompletion`. Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws).Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Llama 2. Hardware and Software Training Factors We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO 2 eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| CO 2 emissions during pretraining. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Training Data Overview Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|54.2| Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1. |||TruthfulQA|Toxigen| |---|---|---|---| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|21.25| |Llama 2|13B|41.86|26.10| |Llama 2|70B|50.18|24.60| Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |||TruthfulQA|Toxigen| |---|---|---|---| |Llama-2-Chat|7B|57.04|0.00| |Llama-2-Chat|13B|62.18|0.00| |Llama-2-Chat|70B|64.14|0.01| Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above. Ethical Considerations and Limitations Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/ Reporting Issues Please report any software “bug,” or other problems with the models through one of the following means: - Reporting issues with the model: github.com/facebookresearch/llama - Reporting problematic content generated by the model: developers.facebook.com/llamaoutputfeedback - Reporting bugs and security concerns: facebook.com/whitehat/info Llama Model Index |Model|Llama2|Llama2-hf|Llama2-chat|Llama2-chat-hf| |---|---|---|---|---| |7B| Link | Link | Link | Link| |13B| Link | Link | Link | Link| |70B| Link | Link | Link | Link|
WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) WizardLM Uncensored SuperCOT Storytelling 30B - GGUF - Model creator: YellowRoseCx - Original model: WizardLM Uncensored SuperCOT Storytelling 30B This repo contains GGUF format model files for Monero's WizardLM-Uncensored-SuperCOT-Storytelling-30B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference YellowRoseCx's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | WizardLM-Uncensored-SuperCOT-Storytelling.Q2K.gguf | Q2K | 2 | 13.50 GB| 16.00 GB | smallest, significant quality loss - not recommended for most purposes | | WizardLM-Uncensored-SuperCOT-Storytelling.Q3KS.gguf | Q3KS | 3 | 14.06 GB| 16.56 GB | very small, high quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q3KM.gguf | Q3KM | 3 | 15.76 GB| 18.26 GB | very small, high quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q3KL.gguf | Q3KL | 3 | 17.28 GB| 19.78 GB | small, substantial quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q40.gguf | Q40 | 4 | 18.36 GB| 20.86 GB | legacy; small, very high quality loss - prefer using Q3KM | | WizardLM-Uncensored-SuperCOT-Storytelling.Q4KS.gguf | Q4KS | 4 | 18.44 GB| 20.94 GB | small, greater quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q4KM.gguf | Q4KM | 4 | 19.62 GB| 22.12 GB | medium, balanced quality - recommended | | WizardLM-Uncensored-SuperCOT-Storytelling.Q50.gguf | Q50 | 5 | 22.40 GB| 24.90 GB | legacy; medium, balanced quality - prefer using Q4KM | | WizardLM-Uncensored-SuperCOT-Storytelling.Q5KS.gguf | Q5KS | 5 | 22.40 GB| 24.90 GB | large, low quality loss - recommended | | WizardLM-Uncensored-SuperCOT-Storytelling.Q5KM.gguf | Q5KM | 5 | 23.05 GB| 25.55 GB | large, very low quality loss - recommended | | WizardLM-Uncensored-SuperCOT-Storytelling.Q6K.gguf | Q6K | 6 | 26.69 GB| 29.19 GB | very large, extremely low quality loss | | WizardLM-Uncensored-SuperCOT-Storytelling.Q80.gguf | Q80 | 8 | 34.57 GB| 37.07 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GGUF and below it, a specific filename to download, such as: WizardLM-Uncensored-SuperCOT-Storytelling.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Monero's WizardLM-Uncensored-SuperCOT-Storytelling-30B This model is a triple model merge of WizardLM Uncensored+CoT+Storytelling, resulting in a comprehensive boost in reasoning and story writing capabilities. You've become a compendium of knowledge on a vast array of topics. Lore Mastery is an arcane tradition fixated on understanding the underlying mechanics of magic. It is the most academic of all arcane traditions. The promise of uncovering new knowledge or proving (or discrediting) a theory of magic is usually required to rouse its practitioners from their laboratories, academies, and archives to pursue a life of adventure. Known as savants, followers of this tradition are a bookish lot who see beauty and mystery in the application of magic. The results of a spell are less interesting to them than the process that creates it. Some savants take a haughty attitude toward those who follow a tradition focused on a single school of magic, seeing them as provincial and lacking the sophistication needed to master true magic. Other savants are generous teachers, countering ignorance and deception with deep knowledge and good humor.
Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF
KafkaLM-70B-German-V0.1-GGUF
WhiteRabbitNeo-13B-AWQ
Pygmalion-2-13B-GGUF
openchat_3.5-GGUF
Llama-2-13B-GPTQ
WizardLM-13B-Uncensored-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizardlm 13B Uncensored - GGUF - Model creator: Eric Hartford - Original model: Wizardlm 13B Uncensored This repo contains GGUF format model files for Eric Hartford's Wizardlm 13B Uncensored. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | WizardLM-13B-Uncensored.Q2K.gguf | Q2K | 2 | 5.43 GB| 7.93 GB | smallest, significant quality loss - not recommended for most purposes | | WizardLM-13B-Uncensored.Q3KS.gguf | Q3KS | 3 | 5.66 GB| 8.16 GB | very small, high quality loss | | WizardLM-13B-Uncensored.Q3KM.gguf | Q3KM | 3 | 6.34 GB| 8.84 GB | very small, high quality loss | | WizardLM-13B-Uncensored.Q3KL.gguf | Q3KL | 3 | 6.93 GB| 9.43 GB | small, substantial quality loss | | WizardLM-13B-Uncensored.Q40.gguf | Q40 | 4 | 7.37 GB| 9.87 GB | legacy; small, very high quality loss - prefer using Q3KM | | WizardLM-13B-Uncensored.Q4KS.gguf | Q4KS | 4 | 7.41 GB| 9.91 GB | small, greater quality loss | | WizardLM-13B-Uncensored.Q4KM.gguf | Q4KM | 4 | 7.87 GB| 10.37 GB | medium, balanced quality - recommended | | WizardLM-13B-Uncensored.Q50.gguf | Q50 | 5 | 8.97 GB| 11.47 GB | legacy; medium, balanced quality - prefer using Q4KM | | WizardLM-13B-Uncensored.Q5KS.gguf | Q5KS | 5 | 8.97 GB| 11.47 GB | large, low quality loss - recommended | | WizardLM-13B-Uncensored.Q5KM.gguf | Q5KM | 5 | 9.23 GB| 11.73 GB | large, very low quality loss - recommended | | WizardLM-13B-Uncensored.Q6K.gguf | Q6K | 6 | 10.68 GB| 13.18 GB | very large, extremely low quality loss | | WizardLM-13B-Uncensored.Q80.gguf | Q80 | 8 | 13.83 GB| 16.33 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/WizardLM-13B-Uncensored-GGUF and below it, a specific filename to download, such as: WizardLM-13B-Uncensored.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizardlm 13B Uncensored This is WizardLM trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. Note: An uncensored model has no guardrails. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.
zephyr-7B-beta-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Zephyr 7B Beta - GGUF - Model creator: Hugging Face H4 - Original model: Zephyr 7B Beta This repo contains GGUF format model files for Hugging Face H4's Zephyr 7B Beta. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Hugging Face H4's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | zephyr-7b-beta.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | zephyr-7b-beta.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | zephyr-7b-beta.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | zephyr-7b-beta.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | zephyr-7b-beta.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | zephyr-7b-beta.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | zephyr-7b-beta.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | zephyr-7b-beta.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | zephyr-7b-beta.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | zephyr-7b-beta.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | zephyr-7b-beta.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | zephyr-7b-beta.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/zephyr-7B-beta-GGUF and below it, a specific filename to download, such as: zephyr-7b-beta.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Hugging Face H4's Zephyr 7B Beta Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. You can find more details in the technical report. - Model type: A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets. - Language(s) (NLP): Primarily English - License: MIT - Finetuned from model: mistralai/Mistral-7B-v0.1 - Repository: https://github.com/huggingface/alignment-handbook - Demo: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat - Chatbot Arena: Evaluate Zephyr 7B against 10+ LLMs in the LMSYS arena: http://arena.lmsys.org At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the MT-Bench and AlpacaEval benchmarks: | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | StableLM-Tuned-α | 7B| dSFT |2.75| -| | MPT-Chat | 7B |dSFT |5.42| -| | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83| | Mistral-Instructv0.1 | 7B| - | 6.84 |-| | Zephyr-7b-α |7B| dDPO| 6.88| -| | Zephyr-7b-β 🪁 | 7B | dDPO | 7.34 | 90.60 | | Falcon-Instruct | 40B |dSFT |5.17 |45.71| | Guanaco | 65B | SFT |6.41| 71.80| | Llama2-Chat | 70B |RLHF |6.86| 92.66| | Vicuna v1.3 | 33B |dSFT |7.12 |88.99| | WizardLM v1.0 | 70B |dSFT |7.71 |-| | Xwin-LM v0.1 | 70B |dPPO |- |95.57| | GPT-3.5-turbo | - |RLHF |7.94 |89.37| | Claude 2 | - |RLHF |8.06| 91.36| | GPT-4 | -| RLHF |8.99| 95.28| In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B: However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap. The model was initially fine-tuned on a filtered and preprocessed of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contains 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our demo to test its capabilities. You can find the datasets used for training Zephyr-7B-β here Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: Zephyr-7B-β has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the Falcon 180B model card for an example of this. During DPO training, this model achieves the following results on the evaluation set: - Loss: 0.7496 - Rewards/chosen: -4.5221 - Rewards/rejected: -8.3184 - Rewards/accuracies: 0.7812 - Rewards/margins: 3.7963 - Logps/rejected: -340.1541 - Logps/chosen: -299.4561 - Logits/rejected: -2.3081 - Logits/chosen: -2.3531 The following hyperparameters were used during training: - learningrate: 5e-07 - trainbatchsize: 2 - evalbatchsize: 4 - seed: 42 - distributedtype: multi-GPU - numdevices: 16 - totaltrainbatchsize: 32 - totalevalbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 The table below shows the full set of DPO training metrics: | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 | | 0.4908 | 0.1 | 200 | 0.5426 | -0.0279 | -0.6842 | 0.75 | 0.6563 | -263.8124 | -254.5145 | -2.7719 | -2.7960 | | 0.5264 | 0.15 | 300 | 0.5324 | 0.0414 | -0.9793 | 0.7656 | 1.0207 | -266.7627 | -253.8209 | -2.7892 | -2.8122 | | 0.5536 | 0.21 | 400 | 0.4957 | -0.0185 | -1.5276 | 0.7969 | 1.5091 | -272.2460 | -254.4203 | -2.8542 | -2.8764 | | 0.5362 | 0.26 | 500 | 0.5031 | -0.2630 | -1.5917 | 0.7812 | 1.3287 | -272.8869 | -256.8653 | -2.8702 | -2.8958 | | 0.5966 | 0.31 | 600 | 0.5963 | -0.2993 | -1.6491 | 0.7812 | 1.3499 | -273.4614 | -257.2279 | -2.8778 | -2.8986 | | 0.5014 | 0.36 | 700 | 0.5382 | -0.2859 | -1.4750 | 0.75 | 1.1891 | -271.7204 | -257.0942 | -2.7659 | -2.7869 | | 0.5334 | 0.41 | 800 | 0.5677 | -0.4289 | -1.8968 | 0.7969 | 1.4679 | -275.9378 | -258.5242 | -2.7053 | -2.7265 | | 0.5251 | 0.46 | 900 | 0.5772 | -0.2116 | -1.3107 | 0.7344 | 1.0991 | -270.0768 | -256.3507 | -2.8463 | -2.8662 | | 0.5205 | 0.52 | 1000 | 0.5262 | -0.3792 | -1.8585 | 0.7188 | 1.4793 | -275.5552 | -258.0276 | -2.7893 | -2.7979 | | 0.5094 | 0.57 | 1100 | 0.5433 | -0.6279 | -1.9368 | 0.7969 | 1.3089 | -276.3377 | -260.5136 | -2.7453 | -2.7536 | | 0.5837 | 0.62 | 1200 | 0.5349 | -0.3780 | -1.9584 | 0.7656 | 1.5804 | -276.5542 | -258.0154 | -2.7643 | -2.7756 | | 0.5214 | 0.67 | 1300 | 0.5732 | -1.0055 | -2.2306 | 0.7656 | 1.2251 | -279.2761 | -264.2903 | -2.6986 | -2.7113 | | 0.6914 | 0.72 | 1400 | 0.5137 | -0.6912 | -2.1775 | 0.7969 | 1.4863 | -278.7448 | -261.1467 | -2.7166 | -2.7275 | | 0.4655 | 0.77 | 1500 | 0.5090 | -0.7987 | -2.2930 | 0.7031 | 1.4943 | -279.8999 | -262.2220 | -2.6651 | -2.6838 | | 0.5731 | 0.83 | 1600 | 0.5312 | -0.8253 | -2.3520 | 0.7812 | 1.5268 | -280.4902 | -262.4876 | -2.6543 | -2.6728 | | 0.5233 | 0.88 | 1700 | 0.5206 | -0.4573 | -2.0951 | 0.7812 | 1.6377 | -277.9205 | -258.8084 | -2.6870 | -2.7097 | | 0.5593 | 0.93 | 1800 | 0.5231 | -0.5508 | -2.2000 | 0.7969 | 1.6492 | -278.9703 | -259.7433 | -2.6221 | -2.6519 | | 0.4967 | 0.98 | 1900 | 0.5290 | -0.5340 | -1.9570 | 0.8281 | 1.4230 | -276.5395 | -259.5749 | -2.6564 | -2.6878 | | 0.0921 | 1.03 | 2000 | 0.5368 | -1.1376 | -3.1615 | 0.7812 | 2.0239 | -288.5854 | -265.6111 | -2.6040 | -2.6345 | | 0.0733 | 1.08 | 2100 | 0.5453 | -1.1045 | -3.4451 | 0.7656 | 2.3406 | -291.4208 | -265.2799 | -2.6289 | -2.6595 | | 0.0972 | 1.14 | 2200 | 0.5571 | -1.6915 | -3.9823 | 0.8125 | 2.2908 | -296.7934 | -271.1505 | -2.6471 | -2.6709 | | 0.1058 | 1.19 | 2300 | 0.5789 | -1.0621 | -3.8941 | 0.7969 | 2.8319 | -295.9106 | -264.8563 | -2.5527 | -2.5798 | | 0.2423 | 1.24 | 2400 | 0.5455 | -1.1963 | -3.5590 | 0.7812 | 2.3627 | -292.5599 | -266.1981 | -2.5414 | -2.5784 | | 0.1177 | 1.29 | 2500 | 0.5889 | -1.8141 | -4.3942 | 0.7969 | 2.5801 | -300.9120 | -272.3761 | -2.4802 | -2.5189 | | 0.1213 | 1.34 | 2600 | 0.5683 | -1.4608 | -3.8420 | 0.8125 | 2.3812 | -295.3901 | -268.8436 | -2.4774 | -2.5207 | | 0.0889 | 1.39 | 2700 | 0.5890 | -1.6007 | -3.7337 | 0.7812 | 2.1330 | -294.3068 | -270.2423 | -2.4123 | -2.4522 | | 0.0995 | 1.45 | 2800 | 0.6073 | -1.5519 | -3.8362 | 0.8281 | 2.2843 | -295.3315 | -269.7538 | -2.4685 | -2.5050 | | 0.1145 | 1.5 | 2900 | 0.5790 | -1.7939 | -4.2876 | 0.8438 | 2.4937 | -299.8461 | -272.1744 | -2.4272 | -2.4674 | | 0.0644 | 1.55 | 3000 | 0.5735 | -1.7285 | -4.2051 | 0.8125 | 2.4766 | -299.0209 | -271.5201 | -2.4193 | -2.4574 | | 0.0798 | 1.6 | 3100 | 0.5537 | -1.7226 | -4.2850 | 0.8438 | 2.5624 | -299.8200 | -271.4610 | -2.5367 | -2.5696 | | 0.1013 | 1.65 | 3200 | 0.5575 | -1.5715 | -3.9813 | 0.875 | 2.4098 | -296.7825 | -269.9498 | -2.4926 | -2.5267 | | 0.1254 | 1.7 | 3300 | 0.5905 | -1.6412 | -4.4703 | 0.8594 | 2.8291 | -301.6730 | -270.6473 | -2.5017 | -2.5340 | | 0.085 | 1.76 | 3400 | 0.6133 | -1.9159 | -4.6760 | 0.8438 | 2.7601 | -303.7296 | -273.3941 | -2.4614 | -2.4960 | | 0.065 | 1.81 | 3500 | 0.6074 | -1.8237 | -4.3525 | 0.8594 | 2.5288 | -300.4951 | -272.4724 | -2.4597 | -2.5004 | | 0.0755 | 1.86 | 3600 | 0.5836 | -1.9252 | -4.4005 | 0.8125 | 2.4753 | -300.9748 | -273.4872 | -2.4327 | -2.4716 | | 0.0746 | 1.91 | 3700 | 0.5789 | -1.9280 | -4.4906 | 0.8125 | 2.5626 | -301.8762 | -273.5149 | -2.4686 | -2.5115 | | 0.1348 | 1.96 | 3800 | 0.6015 | -1.8658 | -4.2428 | 0.8281 | 2.3769 | -299.3976 | -272.8936 | -2.4943 | -2.5393 | | 0.0217 | 2.01 | 3900 | 0.6122 | -2.3335 | -4.9229 | 0.8281 | 2.5894 | -306.1988 | -277.5699 | -2.4841 | -2.5272 | | 0.0219 | 2.07 | 4000 | 0.6522 | -2.9890 | -6.0164 | 0.8281 | 3.0274 | -317.1334 | -284.1248 | -2.4105 | -2.4545 | | 0.0119 | 2.12 | 4100 | 0.6922 | -3.4777 | -6.6749 | 0.7969 | 3.1972 | -323.7187 | -289.0121 | -2.4272 | -2.4699 | | 0.0153 | 2.17 | 4200 | 0.6993 | -3.2406 | -6.6775 | 0.7969 | 3.4369 | -323.7453 | -286.6413 | -2.4047 | -2.4465 | | 0.011 | 2.22 | 4300 | 0.7178 | -3.7991 | -7.4397 | 0.7656 | 3.6406 | -331.3667 | -292.2260 | -2.3843 | -2.4290 | | 0.0072 | 2.27 | 4400 | 0.6840 | -3.3269 | -6.8021 | 0.8125 | 3.4752 | -324.9908 | -287.5042 | -2.4095 | -2.4536 | | 0.0197 | 2.32 | 4500 | 0.7013 | -3.6890 | -7.3014 | 0.8125 | 3.6124 | -329.9841 | -291.1250 | -2.4118 | -2.4543 | | 0.0182 | 2.37 | 4600 | 0.7476 | -3.8994 | -7.5366 | 0.8281 | 3.6372 | -332.3356 | -293.2291 | -2.4163 | -2.4565 | | 0.0125 | 2.43 | 4700 | 0.7199 | -4.0560 | -7.5765 | 0.8438 | 3.5204 | -332.7345 | -294.7952 | -2.3699 | -2.4100 | | 0.0082 | 2.48 | 4800 | 0.7048 | -3.6613 | -7.1356 | 0.875 | 3.4743 | -328.3255 | -290.8477 | -2.3925 | -2.4303 | | 0.0118 | 2.53 | 4900 | 0.6976 | -3.7908 | -7.3152 | 0.8125 | 3.5244 | -330.1224 | -292.1431 | -2.3633 | -2.4047 | | 0.0118 | 2.58 | 5000 | 0.7198 | -3.9049 | -7.5557 | 0.8281 | 3.6508 | -332.5271 | -293.2844 | -2.3764 | -2.4194 | | 0.006 | 2.63 | 5100 | 0.7506 | -4.2118 | -7.9149 | 0.8125 | 3.7032 | -336.1194 | -296.3530 | -2.3407 | -2.3860 | | 0.0143 | 2.68 | 5200 | 0.7408 | -4.2433 | -7.9802 | 0.8125 | 3.7369 | -336.7721 | -296.6682 | -2.3509 | -2.3946 | | 0.0057 | 2.74 | 5300 | 0.7552 | -4.3392 | -8.0831 | 0.7969 | 3.7439 | -337.8013 | -297.6275 | -2.3388 | -2.3842 | | 0.0138 | 2.79 | 5400 | 0.7404 | -4.2395 | -7.9762 | 0.8125 | 3.7367 | -336.7322 | -296.6304 | -2.3286 | -2.3737 | | 0.0079 | 2.84 | 5500 | 0.7525 | -4.4466 | -8.2196 | 0.7812 | 3.7731 | -339.1662 | -298.7007 | -2.3200 | -2.3641 | | 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 | | 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 | | 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 | - Transformers 4.35.0.dev0 - Pytorch 2.0.1+cu118 - Datasets 2.12.0 - Tokenizers 0.14.0 If you find Zephyr-7B-β is useful in your work, please cite it with:
CodeLlama-13B-oasst-sft-v10-GGUF
deepseek-coder-6.7B-base-GGUF
Mixtral-8x7B-MoE-RP-Story-GGUF
dolphin-2.0-mistral-7B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Dolphin 2.0 Mistral 7B - GGUF - Model creator: Eric Hartford - Original model: Dolphin 2.0 Mistral 7B This repo contains GGUF format model files for Eric Hartford's Dolphin 2.0 Mistral 7B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | dolphin-2.0-mistral-7b.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | dolphin-2.0-mistral-7b.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | dolphin-2.0-mistral-7b.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | dolphin-2.0-mistral-7b.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | dolphin-2.0-mistral-7b.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | dolphin-2.0-mistral-7b.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | dolphin-2.0-mistral-7b.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | dolphin-2.0-mistral-7b.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | dolphin-2.0-mistral-7b.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | dolphin-2.0-mistral-7b.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | dolphin-2.0-mistral-7b.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | dolphin-2.0-mistral-7b.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/dolphin-2.0-mistral-7B-GGUF and below it, a specific filename to download, such as: dolphin-2.0-mistral-7b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Dolphin 2.0 Mistral 7B Dolphin-2.0-mistral-7b's training was sponsored by a16z. This model is based on mistralAI, so it is suitable for commercial or non-commercial use. This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly. This dataset is Dolphin, an open-source implementation of Microsoft's Orca I modified the dataset for uncensoring, deduping, cleaning, and quality. I added Jon Durbin's excellent Airoboros dataset to increase creativity. Training It took 48 hours to train 10 epochs on 4x A100s. Prompt format: This model (and all my future releases) use ChatML prompt format. Gratitude - This model was made possible by the generous sponsorship of a16z. - Thank you to Microsoft for authoring the Orca paper and inspiring this work. - Special thanks to WingLian, and TheBloke for helpful advice - Thank you to all the other people in the Open Source AI community who have taught me and helped me along the way.
WhiteRabbitNeo-13B-GGUF
CodeLlama-7B-Python-GGUF
deepseek-coder-33B-instruct-AWQ
CodeLlama-34B-GGUF
CausalLM-14B-GGUF
Wizard-Vicuna-30B-Uncensored-GGUF
guanaco-7B-HF
dolphin-2_6-phi-2-GGUF
Everyone-Coder-33B-Base-GPTQ
dolphin-2.6-mistral-7B-GPTQ
Llama-2-70B-Chat-GPTQ
CodeLlama-70B-Instruct-GGUF
Yarn-Mistral-7B-128k-GGUF
llama2_7b_chat_uncensored-GGUF
Speechless-Llama2-Hermes-Orca-Platypus-WizardLM-13B-GGUF
WizardLM-13B-V1.2-GGUF
Mistral-7B-v0.1-AWQ
em_german_mistral_v01-GGUF
Llama-2-70B-Chat-GGUF
Wizard-Vicuna-7B-Uncensored-GGUF
CodeLlama-34B-Python-GGUF
SynthIA-70B-v1.5-AWQ
Psyfighter-13B-GGUF
Starling-LM-7B-alpha-GGUF
Pygmalion-2-7B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Pygmalion 2 7B - GGUF - Model creator: PygmalionAI - Original model: Pygmalion 2 7B This repo contains GGUF format model files for PygmalionAI's Pygmalion 2 7B. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. GGUF offers numerous advantages over GGML, such as better tokenisation, and support for special tokens. It is also supports metadata, and is designed to be extensible. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference PygmalionAI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions The model has been trained on prompts using three different roles, which are denoted by the following tokens: ` `, ` ` and ` `. The ` ` prompt can be used to inject out-of-channel information behind the scenes, while the ` ` prompt should be used to indicate user input. The ` ` token should then be used to indicate that the model should generate a response. These tokens can happen multiple times and be chained up to form a conversation history. The system prompt has been designed to allow the model to "enter" various modes and dictate the reply length. Here's an example: These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d36d5be95a0d9088b674dbb27354107221 They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | pygmalion-2-7b.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | pygmalion-2-7b.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | pygmalion-2-7b.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | pygmalion-2-7b.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | pygmalion-2-7b.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | pygmalion-2-7b.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | pygmalion-2-7b.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | pygmalion-2-7b.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | pygmalion-2-7b.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | pygmalion-2-7b.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | pygmalion-2-7b.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | pygmalion-2-7b.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/Pygmalion-2-7B-GGUF and below it, a specific filename to download, such as: pygmalion-2-7b.q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows CLI users: Use `set HUGGINGFACEHUBENABLEHFTRANSFER=1` before running the download command. Make sure you are using `llama.cpp` from commit d0cee0d36d5be95a0d9088b674dbb27354107221 or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model from Python using ctransformers Simple example code to load one of these GGUF models Here's guides on using llama-cpp-python or ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Pygmalion-2 7B An instruction-tuned Llama-2 biased towards fiction writing and conversation. The long-awaited release of our new models based on Llama-2 is finally here. Pygmalion-2 7B (formerly known as Metharme) is based on Llama-2 7B released by Meta AI. The Metharme models were an experiment to try and get a model that is usable for conversation, roleplaying and storywriting, but which can be guided using natural language like other instruct models. After much deliberation, we reached the conclusion that the Metharme prompting format is superior (and easier to use) compared to the classic Pygmalion. This model was trained by doing supervised fine-tuning over a mixture of regular instruction data alongside roleplay, fictional stories and conversations with synthetically generated instructions attached. This model is freely available for both commercial and non-commercial use, as per the Llama-2 license. The model has been trained on prompts using three different roles, which are denoted by the following tokens: ` `, ` ` and ` `. The ` ` prompt can be used to inject out-of-channel information behind the scenes, while the ` ` prompt should be used to indicate user input. The ` ` token should then be used to indicate that the model should generate a response. These tokens can happen multiple times and be chained up to form a conversation history. The system prompt has been designed to allow the model to "enter" various modes and dictate the reply length. Here's an example: Dataset The dataset used to fine-tune this model includes our own PIPPA, along with several other instruction datasets, and datasets acquired from various RP forums. The intended use-case for this model is fictional writing for entertainment purposes. Any other sort of usage is out of scope. As such, it was not fine-tuned to be safe and harmless: the base model and this fine-tune have been trained on data known to contain profanity and texts that are lewd or otherwise offensive. It may produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. Outputs might often be factually wrong or misleading. Acknowledgements We would like to thank SpicyChat for sponsoring the training for this model.
japanese-stablelm-instruct-gamma-7B-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Japanese StableLM Instruct Gamma 7B - GGUF - Model creator: Stability AI - Original model: Japanese StableLM Instruct Gamma 7B This repo contains GGUF format model files for Stability AI's Japanese StableLM Instruct Gamma 7B. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | japanese-stablelm-instruct-gamma-7b.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | japanese-stablelm-instruct-gamma-7b.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | japanese-stablelm-instruct-gamma-7b.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | japanese-stablelm-instruct-gamma-7b.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | japanese-stablelm-instruct-gamma-7b.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | japanese-stablelm-instruct-gamma-7b.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | japanese-stablelm-instruct-gamma-7b.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | japanese-stablelm-instruct-gamma-7b.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | japanese-stablelm-instruct-gamma-7b.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | japanese-stablelm-instruct-gamma-7b.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | japanese-stablelm-instruct-gamma-7b.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | japanese-stablelm-instruct-gamma-7b.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/japanese-stablelm-instruct-gamma-7B-GGUF and below it, a specific filename to download, such as: japanese-stablelm-instruct-gamma-7b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, SX, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann-Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius And thank you again to a16z for their generous grant. Original model card: Stability AI's Japanese StableLM Instruct Gamma 7B This is a 7B-parameter decoder-only Japanese language model fine-tuned on instruction-following datasets, built on top of the base model Japanese Stable LM Base Gamma 7B. If you are in search of a smaller model, please check Japanese StableLM-3B-4E1T Instruct. Developed by: Stability AI Model type: `Japanese Stable LM Instruct Gamma 7B` model is an auto-regressive language model based on the transformer decoder architecture. Language(s): Japanese License: This model is licensed under Apache License, Version 2.0. Contact: For questions and comments about the model, please join Stable Community Japan. For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAIJP. For details, please see Mistral AI's paper and release blog post. - Japanese translation of the Databricks Dolly-15k dataset - Japanese translation of the subset of the Anthropic HH dataset - Wikinews subset of the izumi-lab/llm-japanese-dataset The model is intended to be used by all individuals as a foundational model for application-specific fine-tuning without strict limitations on commercial use. The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters which can be reflected in the model-generated text. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups. The fine-tuning was carried out by Fujiki Nakamura. Other aspects, including data preparation and evaluation, were handled by the Language Team of Stability AI Japan, notably Meng Lee, Makoto Shing, Paul McCann, Naoki Orii, and Takuya Akiba. This model is based on Mistral-7B-v0.1 released by the Mistral AI team. We are grateful to the Mistral AI team for providing such an excellent base model. We are grateful for the contributions of the EleutherAI Polyglot-JA team in helping us to collect a large amount of pre-training data in Japanese. Polyglot-JA members includes Hyunwoong Ko (Project Lead), Fujiki Nakamura (originally started this project when he commited to the Polyglot team), Yunho Mo, Minji Jung, KeunSeok Im, and Su-Kyeong Jang. We are also appreciative of AI Novelist/Sta (Bit192, Inc.) and the numerous contributors from Stable Community Japan for assisting us in gathering a large amount of high-quality Japanese textual data for model training.
CodeLlama-13B-Instruct-AWQ
stablelm-zephyr-3b-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) StableLM Zephyr 3B - GGUF - Model creator: Stability AI - Original model: StableLM Zephyr 3B This repo contains GGUF format model files for Stability AI's StableLM Zephyr 3B. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | stablelm-zephyr-3b.Q2K.gguf | Q2K | 2 | 1.20 GB| 3.70 GB | smallest, significant quality loss - not recommended for most purposes | | stablelm-zephyr-3b.Q3KS.gguf | Q3KS | 3 | 1.25 GB| 3.75 GB | very small, high quality loss | | stablelm-zephyr-3b.Q3KM.gguf | Q3KM | 3 | 1.39 GB| 3.89 GB | very small, high quality loss | | stablelm-zephyr-3b.Q3KL.gguf | Q3KL | 3 | 1.51 GB| 4.01 GB | small, substantial quality loss | | stablelm-zephyr-3b.Q40.gguf | Q40 | 4 | 1.61 GB| 4.11 GB | legacy; small, very high quality loss - prefer using Q3KM | | stablelm-zephyr-3b.Q4KS.gguf | Q4KS | 4 | 1.62 GB| 4.12 GB | small, greater quality loss | | stablelm-zephyr-3b.Q4KM.gguf | Q4KM | 4 | 1.71 GB| 4.21 GB | medium, balanced quality - recommended | | stablelm-zephyr-3b.Q50.gguf | Q50 | 5 | 1.94 GB| 4.44 GB | legacy; medium, balanced quality - prefer using Q4KM | | stablelm-zephyr-3b.Q5KS.gguf | Q5KS | 5 | 1.94 GB| 4.44 GB | large, low quality loss - recommended | | stablelm-zephyr-3b.Q5KM.gguf | Q5KM | 5 | 1.99 GB| 4.49 GB | large, very low quality loss - recommended | | stablelm-zephyr-3b.Q6K.gguf | Q6K | 6 | 2.30 GB| 4.80 GB | very large, extremely low quality loss | | stablelm-zephyr-3b.Q80.gguf | Q80 | 8 | 2.97 GB| 5.47 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/stablelm-zephyr-3b-GGUF and below it, a specific filename to download, such as: stablelm-zephyr-3b.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 4096` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Original model card: Stability AI's StableLM Zephyr 3B `StableLM Zephyr 3B` is a 3 billion parameter instruction tuned inspired by HugginFaceH4's Zephyr 7B training pipeline this model was trained on a mix of publicly available datasets, synthetic datasets using Direct Preference Optimization (DPO), evaluation for this model based on MT Bench and Alpaca Benchmark `StableLM Zephyr 3B` uses the following instruction format: This format is also available through the tokenizer's `applychattemplate` method: You can also see how to run a performance optimized version of this model here using OpenVINO from Intel. Developed by: Stability AI Model type: `StableLM Zephyr 3B` model is an auto-regressive language model based on the transformer decoder architecture. Language(s): English Library: Alignment Handbook Finetuned from model: stabilityai/stablelm-3b-4e1t License: StabilityAI Non-Commercial Research Community License Contact: For questions and comments about the model, please email `[email protected]` The dataset is comprised of a mixture of open datasets large-scale datasets available on the HuggingFace Hub: 1. SFT Datasets - HuggingFaceH4/ultrachat200k - meta-math/MetaMathQA - WizardLM/WizardLMevolinstructV2196k - Open-Orca/SlimOrca 2. Preference Datasets: - HuggingFaceH4/ultrafeedbackbinarized - Intel/orcadpopairs | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | StableLM Zephyr 3B 🪁 | 3B | DPO | 6.64 | 76.00 | | StableLM Zephyr (SFT only) | 3B | SFT | 6.04 | 71.15 | | Capybara v1.9 | 3B | dSFT | 5.94 | - | | MPT-Chat | 7B |dSFT |5.42| -| | Xwin-LM v0.1 | 7B| dPPO| 6.19| 87.83| | Mistral-Instruct v0.1 | 7B| - | 6.84 |-| | Zephyr-7b-α |7B| dDPO| 6.88| -| | Zephyr-7b-β| 7B | dDPO | 7.34 | 90.60 | | Falcon-Instruct | 40B |dSFT |5.17 |45.71| | Guanaco | 65B | SFT |6.41| 71.80| | Llama2-Chat | 70B |RLHF |6.86| 92.66| | Vicuna v1.3 | 33B |dSFT |7.12 |88.99| | WizardLM v1.0 | 70B |dSFT |7.71 |-| | Xwin-LM v0.1 | 70B |dPPO |- |95.57| | GPT-3.5-turbo | - |RLHF |7.94 |89.37| | Claude 2 | - |RLHF |8.06| 91.36| | GPT-4 | -| RLHF |8.99| 95.28| Other benchmarks: | Task | Value | |-----------------------|---------------------------| | ARC (25-shot) | 47.0 | | HellaSwag (10-shot) | 74.2 | | MMLU (5-shot) | 46.3 | | TruthfulQA (0-shot) | 46.5 | | Winogrande (5-shot) | 65.5 | | GSM8K (5-shot) | 42.3 | | BigBench (Avg) | 35.26 | | AGI Benchmark (Avg) | 33.23 | Hardware: `StableLM Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes. Code Base: We use our internal script for SFT steps and used HuggingFace Alignment Handbook script for DPO training. Commitment to Ethical AI In line with our responsibility towards ethical AI development, `StableLM Zephyr 3B` is released with a focus on ensuring safety, reliability, and appropriateness in its applications. To this end, we have evaluated `StableLM Zephyr 3B` on 488 malicious prompts and used standard protocols to assess the harmfulness of its outputs. Compared to Zephyr-7b-β, `StableLM Zephyr 3B` reduces the number of harmful outputs as assessed by GPT-4 by 55. Additionally, we performed an internal red teaming event targeting the following abuse areas: Self-Harm Methods: (Suicide Methods, Encouragement of Self-Harm, Methods and encouragement of Eating Disorders) Misinformation: (Health, Conspiracy Theories, Social Unrest/Conflict, Political Misinformation, & Climate change) Hate Speech: (Race, Stereotypes, Immigrants, Gender, Personally Identifiable Information such as Social security numbers, Full names, ID numbers, Email addresses, and telephone numbers) We have incorporated the findings of our malicious prompts evaluation and red teaming event into our release. Users are encouraged to fine-tune and evaluate the model to suit their specific needs, considering the potential biases and limitations found in `StableLM Zephyr 3B` and inherent in other LLM models. The model is intended to be used as a foundational base model for application-specific fine-tuning. Developers must evaluate and fine-tune the model for safe performance in downstream applications. Limitations and Bias This model is not trained against adversarial inputs. We strongly recommend pairing this model with an input and output classifier to prevent harmful responses. Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it is willing to output potentially harmful outputs or misinformation when the user requests it. Using this model will require guardrails around your inputs and outputs to ensure that any outputs returned are not misinformation or harmful. Additionally, as each use case is unique, we recommend running your own suite of tests to ensure proper performance of this model. Finally, do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.
WizardCoder-Python-13B-V1.0-GGUF
ReMM-SLERP-L2-13B-GGUF
deepseek-llm-67b-chat-GGUF
Silicon-Maid-7B-GGUF
Stheno-L2-13B-GGUF
Open_Gpt4_8x7B_v0.2-GGUF
Chronos-Hermes-13b-v2-GGUF
OpenHermes-2-Mistral-7B-GGUF
Llama-2-13B-GGUF
koala-13B-HF
Mythalion-13B-GGUF
CodeLlama-70B-Python-GGUF
openchat-3.5-1210-GGUF
Nous-Capybara-34B-GGUF
meditron-7B-GGUF
FusionNet_34Bx2_MoE-GGUF
llama2_70b_chat_uncensored-GGUF
laser-dolphin-mixtral-2x7b-dpo-GGUF
ReMM-SLERP-L2-13B-AWQ
CodeLlama-34B-Python-fp16
CodeLlama-34B-Instruct-fp16
CodeLlama-13B-Python-fp16
CodeLlama-13B-Instruct-fp16
deepseek-coder-1.3b-base-GGUF
HornyEchidna-13B-v0.1-GGUF
openchat-3.5-0106-GGUF
Xwin-MLewd-13B-v0.2-GGUF
dolphin-2.6-mixtral-8x7b-GGUF
CodeLlama-13B-Python-GGUF
medalpaca-13B-GGUF
LLaMA-7b-GGUF
Guanaco-7B-Uncensored-GGUF
CausalLM-7B-GGUF
WizardLM-1.0-Uncensored-CodeLlama-34B-GGUF
stable-code-3b-GGUF
dolphin-2.6-mistral-7B-dpo-laser-GGUF
WizardCoder-Python-7B-V1.0-GGUF
Nous-Hermes-2-Yi-34B-GGUF
agentlm-7B-GGUF
deepseek-coder-33B-base-GGUF
Emerhyst-20B-GGUF
zephyr-7B-alpha-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Zephyr 7B Alpha - GGUF - Model creator: Hugging Face H4 - Original model: Zephyr 7B Alpha This repo contains GGUF format model files for Hugging Face H4's Zephyr 7B Alpha. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplate list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Hugging Face H4's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. The new methods available are: GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | zephyr-7b-alpha.Q2K.gguf | Q2K | 2 | 3.08 GB| 5.58 GB | smallest, significant quality loss - not recommended for most purposes | | zephyr-7b-alpha.Q3KS.gguf | Q3KS | 3 | 3.16 GB| 5.66 GB | very small, high quality loss | | zephyr-7b-alpha.Q3KM.gguf | Q3KM | 3 | 3.52 GB| 6.02 GB | very small, high quality loss | | zephyr-7b-alpha.Q3KL.gguf | Q3KL | 3 | 3.82 GB| 6.32 GB | small, substantial quality loss | | zephyr-7b-alpha.Q40.gguf | Q40 | 4 | 4.11 GB| 6.61 GB | legacy; small, very high quality loss - prefer using Q3KM | | zephyr-7b-alpha.Q4KS.gguf | Q4KS | 4 | 4.14 GB| 6.64 GB | small, greater quality loss | | zephyr-7b-alpha.Q4KM.gguf | Q4KM | 4 | 4.37 GB| 6.87 GB | medium, balanced quality - recommended | | zephyr-7b-alpha.Q50.gguf | Q50 | 5 | 5.00 GB| 7.50 GB | legacy; medium, balanced quality - prefer using Q4KM | | zephyr-7b-alpha.Q5KS.gguf | Q5KS | 5 | 5.00 GB| 7.50 GB | large, low quality loss - recommended | | zephyr-7b-alpha.Q5KM.gguf | Q5KM | 5 | 5.13 GB| 7.63 GB | large, very low quality loss - recommended | | zephyr-7b-alpha.Q6K.gguf | Q6K | 6 | 5.94 GB| 8.44 GB | very large, extremely low quality loss | | zephyr-7b-alpha.Q80.gguf | Q80 | 8 | 7.70 GB| 10.20 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: - LM Studio - LoLLMS Web UI - Faraday.dev Under Download Model, you can enter the model repo: TheBloke/zephyr-7B-alpha-GGUF and below it, a specific filename to download, such as: zephyr-7b-alpha.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions here: text-generation-webui/docs/llama.cpp.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. How to load this model in Python code, using ctransformers Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Hugging Face H4's Zephyr 7B Alpha Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-α is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. - Model type: A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets. - Language(s) (NLP): Primarily English - License: MIT - Finetuned from model: mistralai/Mistral-7B-v0.1 - Repository: https://github.com/huggingface/alignment-handbook - Demo: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat The model was initially fine-tuned on a variant of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contain 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our demo to test its capabilities. Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: Zephyr-7B-α has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the Falcon 180B model card for an example of this. Zephyr 7B Alpha achieves the following results on the evaluation set: - Loss: 0.4605 - Rewards/chosen: -0.5053 - Rewards/rejected: -1.8752 - Rewards/accuracies: 0.7812 - Rewards/margins: 1.3699 - Logps/rejected: -327.4286 - Logps/chosen: -297.1040 - Logits/rejected: -2.7153 - Logits/chosen: -2.7447 The following hyperparameters were used during training: - learningrate: 5e-07 - trainbatchsize: 2 - evalbatchsize: 4 - seed: 42 - distributedtype: multi-GPU - numdevices: 16 - totaltrainbatchsize: 32 - totalevalbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupratio: 0.1 - numepochs: 1 | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.5602 | 0.05 | 100 | 0.5589 | -0.3359 | -0.8168 | 0.7188 | 0.4809 | -306.2607 | -293.7161 | -2.6554 | -2.6797 | | 0.4852 | 0.1 | 200 | 0.5136 | -0.5310 | -1.4994 | 0.8125 | 0.9684 | -319.9124 | -297.6181 | -2.5762 | -2.5957 | | 0.5212 | 0.15 | 300 | 0.5168 | -0.1686 | -1.1760 | 0.7812 | 1.0074 | -313.4444 | -290.3699 | -2.6865 | -2.7125 | | 0.5496 | 0.21 | 400 | 0.4835 | -0.1617 | -1.7170 | 0.8281 | 1.5552 | -324.2635 | -290.2326 | -2.7947 | -2.8218 | | 0.5209 | 0.26 | 500 | 0.5054 | -0.4778 | -1.6604 | 0.7344 | 1.1826 | -323.1325 | -296.5546 | -2.8388 | -2.8667 | | 0.4617 | 0.31 | 600 | 0.4910 | -0.3738 | -1.5180 | 0.7656 | 1.1442 | -320.2848 | -294.4741 | -2.8234 | -2.8521 | | 0.4452 | 0.36 | 700 | 0.4838 | -0.4591 | -1.6576 | 0.7031 | 1.1986 | -323.0770 | -296.1796 | -2.7401 | -2.7653 | | 0.4674 | 0.41 | 800 | 0.5077 | -0.5692 | -1.8659 | 0.7656 | 1.2967 | -327.2416 | -298.3818 | -2.6740 | -2.6945 | | 0.4656 | 0.46 | 900 | 0.4927 | -0.5279 | -1.6614 | 0.7656 | 1.1335 | -323.1518 | -297.5553 | -2.7817 | -2.8015 | | 0.4102 | 0.52 | 1000 | 0.4772 | -0.5767 | -2.0667 | 0.7656 | 1.4900 | -331.2578 | -298.5311 | -2.7160 | -2.7455 | | 0.4663 | 0.57 | 1100 | 0.4740 | -0.8038 | -2.1018 | 0.7656 | 1.2980 | -331.9604 | -303.0741 | -2.6994 | -2.7257 | | 0.4737 | 0.62 | 1200 | 0.4716 | -0.3783 | -1.7015 | 0.7969 | 1.3232 | -323.9545 | -294.5634 | -2.6842 | -2.7135 | | 0.4259 | 0.67 | 1300 | 0.4866 | -0.6239 | -1.9703 | 0.7812 | 1.3464 | -329.3312 | -299.4761 | -2.7046 | -2.7356 | | 0.4935 | 0.72 | 1400 | 0.4747 | -0.5626 | -1.7600 | 0.7812 | 1.1974 | -325.1243 | -298.2491 | -2.7153 | -2.7444 | | 0.4211 | 0.77 | 1500 | 0.4645 | -0.6099 | -1.9993 | 0.7656 | 1.3894 | -329.9109 | -299.1959 | -2.6944 | -2.7236 | | 0.4931 | 0.83 | 1600 | 0.4684 | -0.6798 | -2.1082 | 0.7656 | 1.4285 | -332.0890 | -300.5934 | -2.7006 | -2.7305 | | 0.5029 | 0.88 | 1700 | 0.4595 | -0.5063 | -1.8951 | 0.7812 | 1.3889 | -327.8267 | -297.1233 | -2.7108 | -2.7403 | | 0.4965 | 0.93 | 1800 | 0.4613 | -0.5561 | -1.9079 | 0.7812 | 1.3518 | -328.0831 | -298.1203 | -2.7226 | -2.7523 | | 0.4337 | 0.98 | 1900 | 0.4608 | -0.5066 | -1.8718 | 0.7656 | 1.3652 | -327.3599 | -297.1296 | -2.7175 | -2.7469 | - Transformers 4.34.0 - Pytorch 2.0.1+cu118 - Datasets 2.12.0 - Tokenizers 0.14.0
medicine-LLM-GGUF
koala-7B-HF
Yi-34B-Chat-GGUF
Yi-34B-GGUF
Ziya-Coding-34B-v1.0-GGUF
Llama-2-70B-Chat-AWQ
airoboros-mistral2.2-7B-GGUF
LLaMA-Pro-8B-Instruct-GGUF
SOLAR-10.7B-Instruct-v1.0-GGUF
storytime-13B-GGUF
Dr_Samantha-7B-GGUF
Nous-Hermes-Llama2-GGUF
Guanaco-13B-Uncensored-GGUF
em_german_leo_mistral-GGUF
DaringMaid-13B-GGUF
Yarn-Llama-2-7B-128K-GGUF
LongAlpaca-70B-GGUF
Nous-Hermes-Llama2-70B-GGUF
Magicoder-S-DS-6.7B-GGUF
Mistral-7B-Instruct-v0.2-code-ft-GGUF
sqlcoder-7B-GGUF
EstopianMaid-13B-GGUF
MythoMist-7B-GGUF
airoboros-l2-13B-gpt4-1.4.1-GGUF
OpenHermes-2.5-Mistral-7B-16k-GGUF
LLaMA2-13B-Tiefighter-GGUF
Llama-2-70B-GPTQ
TinyLlama-1.1B-1T-OpenOrca-GGUF
Wizard-Vicuna-13B-Uncensored-HF
Mistral-7B-Instruct-v0.1-AWQ
phi-2-electrical-engineering-GGUF
neural-chat-7B-v3-1-GGUF
Nous-Hermes-13B-GGUF
Chronomaid-Storytelling-13B-GGUF
Mistral-7B-OpenOrca-GPTQ
Nous-Hermes-Llama-2-7B-GGUF
Leo-Mistral-Hessianai-7B-Chat-GGUF
LLaMA-30b-GGUF
DiscoLM_German_7b_v1-GGUF
docsgpt-7B-mistral-GGUF
saiga_mistral_7b-GGUF
Toppy-M-7B-GGUF
Orca-2-13B-GGUF
finance-LLM-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Finance LLM - GGUF - Model creator: AdaptLLM - Original model: Finance LLM This repo contains GGUF format model files for AdaptLLM's Finance LLM. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference AdaptLLM's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | finance-llm.Q2K.gguf | Q2K | 2 | 2.83 GB| 5.33 GB | smallest, significant quality loss - not recommended for most purposes | | finance-llm.Q3KS.gguf | Q3KS | 3 | 2.95 GB| 5.45 GB | very small, high quality loss | | finance-llm.Q3KM.gguf | Q3KM | 3 | 3.30 GB| 5.80 GB | very small, high quality loss | | finance-llm.Q3KL.gguf | Q3KL | 3 | 3.60 GB| 6.10 GB | small, substantial quality loss | | finance-llm.Q40.gguf | Q40 | 4 | 3.83 GB| 6.33 GB | legacy; small, very high quality loss - prefer using Q3KM | | finance-llm.Q4KS.gguf | Q4KS | 4 | 3.86 GB| 6.36 GB | small, greater quality loss | | finance-llm.Q4KM.gguf | Q4KM | 4 | 4.08 GB| 6.58 GB | medium, balanced quality - recommended | | finance-llm.Q50.gguf | Q50 | 5 | 4.65 GB| 7.15 GB | legacy; medium, balanced quality - prefer using Q4KM | | finance-llm.Q5KS.gguf | Q5KS | 5 | 4.65 GB| 7.15 GB | large, low quality loss - recommended | | finance-llm.Q5KM.gguf | Q5KM | 5 | 4.78 GB| 7.28 GB | large, very low quality loss - recommended | | finance-llm.Q6K.gguf | Q6K | 6 | 5.53 GB| 8.03 GB | very large, extremely low quality loss | | finance-llm.Q80.gguf | Q80 | 8 | 7.16 GB| 9.66 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/finance-LLM-GGUF and below it, a specific filename to download, such as: finance-llm.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Adapt (Large) Language Models to Domains This repo contains the domain-specific base model developed from LLaMA-1-7B, using the method in our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to transform large-scale pre-training corpora into reading comprehension texts, consistently improving prompting performance across tasks in biomedicine, finance, and law domains. Our 7B model competes with much larger domain-specific models like BloombergGPT-50B. 🤗 We are currently working hard on developing models across different domains, scales and architectures! Please stay tuned! 🤗 Updates 12/19: Released our 13B base models developed from LLaMA-1-13B. 12/8: Released our chat models developed from LLaMA-2-Chat-7B. 9/18: Released our paper, code, data, and base models developed from LLaMA-1-7B. Domain-Specific LLaMA-1 LLaMA-1-7B In our paper, we develop three domain-specific models from LLaMA-1-7B, which are also available in Huggingface: Biomedicine-LLM, Finance-LLM and Law-LLM, the performances of our AdaptLLM compared to other domain-specific LLMs are: LLaMA-1-13B Moreover, we scale up our base model to LLaMA-1-13B to see if our method is similarly effective for larger-scale models, and the results are consistently positive too: Biomedicine-LLM-13B, Finance-LLM-13B and Law-LLM-13B. Domain-Specific LLaMA-2-Chat Our method is also effective for aligned models! LLaMA-2-Chat requires a specific data format, and our reading comprehension can perfectly fit the data format by transforming the reading comprehension into a multi-turn conversation. We have also open-sourced chat models in different domains: Biomedicine-Chat, Finance-Chat and Law-Chat Domain-Specific Tasks To easily reproduce our results, we have uploaded the filled-in zero/few-shot input instructions and output completions of each domain-specific task: biomedicine-tasks, finance-tasks, and law-tasks. Note: those filled-in instructions are specifically tailored for models before alignment and do NOT fit for the specific data format required for chat models. Citation If you find our work helpful, please cite us:
SOLAR-10.7B-v1.0-GGUF
wizardLM-7B-HF
zephyr-7B-beta-GPTQ
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Zephyr 7B Beta - GPTQ - Model creator: Hugging Face H4 - Original model: Zephyr 7B Beta This repo contains GPTQ model files for Hugging Face H4's Zephyr 7B Beta. Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. These files were quantised using hardware kindly provided by Massed Compute. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Hugging Face H4's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These GPTQ models are known to work in the following inference servers/webuis. - text-generation-webui - KobaldAI United - LoLLMS Web UI - Hugging Face Text Generation Inference (TGI) This may not be a complete list; if you know of others, please let me know! Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. Each separate quant is in a different branch. See below for instructions on fetching from different branches. Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers. - Bits: The bit size of the quantised model. - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value. - Act Order: True or False. Also known as `descact`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy. - GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s). - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences. - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4-bit. | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc | | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- | | main | 4 | 128 | Yes | 0.1 | wikitext | 4096 | 4.16 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | | gptq-4bit-32g-actorderTrue | 4 | 32 | Yes | 0.1 | wikitext | 4096 | 4.57 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | | gptq-8bit--1g-actorderTrue | 8 | None | Yes | 0.1 | wikitext | 4096 | 7.52 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | | gptq-8bit-128g-actorderTrue | 8 | 128 | Yes | 0.1 | wikitext | 4096 | 7.68 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. | | gptq-8bit-32g-actorderTrue | 8 | 32 | Yes | 0.1 | wikitext | 4096 | 8.17 GB | No | 8-bit, with group size 32g and Act Order for maximum inference quality. | | gptq-4bit-64g-actorderTrue | 4 | 64 | Yes | 0.1 | wikitext | 4096 | 4.29 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. | To download from the `main` branch, enter `TheBloke/zephyr-7B-beta-GPTQ` in the "Download model" box. To download from another branch, add `:branchname` to the end of the download name, eg `TheBloke/zephyr-7B-beta-GPTQ:gptq-4bit-32g-actorderTrue` I recommend using the `huggingface-hub` Python library: To download the `main` branch to a folder called `zephyr-7B-beta-GPTQ`: To download from a different branch, add the `--revision` parameter: If you remove the `--local-dir-use-symlinks False` parameter, the files will instead be stored in the central Hugging Face cache directory (default location on Linux is: `~/.cache/huggingface`), and symlinks will be added to the specified `--local-dir`, pointing to their real location in the cache. This allows for interrupted downloads to be resumed, and allows you to quickly clone the repo to multiple places on disk without triggering a download again. The downside, and the reason why I don't list that as the default option, is that the files are then hidden away in a cache folder and it's harder to know where your disk space is being used, and to clear it up if/when you want to remove a download model. The cache location can be changed with the `HFHOME` environment variable, and/or the `--cache-dir` parameter to `huggingface-cli`. For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. To clone a specific branch with `git`, use a command like this: Note that using Git with HF repos is strongly discouraged. It will be much slower than using `huggingface-hub`, and will use twice as much disk space as it has to store the model files twice (it stores every byte both in the intended target folder, and again in the `.git` folder as a blob.) How to easily download and use this model in text-generation-webui Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. 1. Click the Model tab. 2. Under Download custom model or LoRA, enter `TheBloke/zephyr-7B-beta-GPTQ`. - To download from a specific branch, enter for example `TheBloke/zephyr-7B-beta-GPTQ:gptq-4bit-32g-actorderTrue` - see Provided Files above for the list of branches for each option. 3. Click Download. 4. The model will start downloading. Once it's finished it will say "Done". 5. In the top left, click the refresh icon next to Model. 6. In the Model dropdown, choose the model you just downloaded: `zephyr-7B-beta-GPTQ` 7. The model will automatically load, and is now ready for use! 8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right. - Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantizeconfig.json`. 9. Once you're ready, click the Text Generation tab and enter a prompt to get started! Serving this model from Text Generation Inference (TGI) It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0` Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later): Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later. If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead: The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly. ExLlama is compatible with Llama and Mistral models in 4-bit. Please see the Provided Files table above for per-file compatibility. For a list of clients/servers, please see "Known compatible clients / servers", above. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Pierre Kircher, Stanislav Ovsiannikov, Michael Levine, Eugene Pentland, Andrey, 준교 김, Randy H, Fred von Graf, Artur Olbinski, Caitlyn Gatomon, terasurfer, Jeff Scroggin, James Bentley, Vadim, Gabriel Puliatti, Harry Royden McLaughlin, Sean Connelly, Dan Guido, Edmond Seymore, Alicia Loh, subjectnull, AzureBlack, Manuel Alberto Morcote, Thomas Belote, Lone Striker, Chris Smitley, Vitor Caleffi, Johann-Peter Hartmann, Clay Pascal, biorpg, Brandon Frisco, sidney chen, transmissions 11, Pedro Madruga, jinyuan sun, Ajan Kanaga, Emad Mostaque, Trenton Dambrowitz, Jonathan Leane, Iucharbius, usrbinkat, vamX, George Stoitzev, Luke Pendergrass, theTransient, Olakabola, Swaroop Kallakuri, Cap'n Zoog, Brandon Phillips, Michael Dempsey, Nikolai Manek, danny, Matthew Berman, Gabriel Tamborski, alfiei, Raymond Fosdick, Tom X Nguyen, Raven Klaugh, LangChain4j, Magnesian, Illia Dulskyi, David Ziegler, Mano Prime, Luis Javier Navarrete Lozano, Erik Bjäreholt, 阿明, Nathan Dryer, Alex, Rainer Wilmers, zynix, TL, Joseph William Delisle, John Villwock, Nathan LeClaire, Willem Michiel, Joguhyik, GodLy, OG, Alps Aficionado, Jeffrey Morgan, ReadyPlayerEmma, Tiffany J. Kim, Sebastain Graf, Spencer Kim, Michael Davis, webtim, Talal Aujan, knownsqashed, John Detwiler, Imad Khwaja, Deo Leter, Jerry Meng, Elijah Stavena, Rooh Singh, Pieter, SuperWojo, Alexandros Triantafyllidis, Stephen Murray, Ai Maven, ya boyyy, Enrico Ros, Ken Nordquist, Deep Realms, Nicholas, Spiking Neurons AB, Elle, Will Dee, Jack West, RoA, Luke @flexchar, Viktor Bowallius, Derek Yates, Subspace Studios, jjj, Toran Billups, Asp the Wyvern, Fen Risland, Ilya, NimbleBox.ai, Chadd, Nitin Borwankar, Emre, Mandus, Leonard Tan, Kalila, K, Trailburnt, SX, Cory Kujawski And thank you again to a16z for their generous grant. Original model card: Hugging Face H4's Zephyr 7B Beta Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes. You can find more details in the technical report. - Model type: A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets. - Language(s) (NLP): Primarily English - License: MIT - Finetuned from model: mistralai/Mistral-7B-v0.1 - Repository: https://github.com/huggingface/alignment-handbook - Demo: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat - Chatbot Arena: Evaluate Zephyr 7B against 10+ LLMs in the LMSYS arena: http://arena.lmsys.org At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the MT-Bench and AlpacaEval benchmarks: | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | StableLM-Tuned-α | 7B| dSFT |2.75| -| | MPT-Chat | 7B |dSFT |5.42| -| | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83| | Mistral-Instructv0.1 | 7B| - | 6.84 |-| | Zephyr-7b-α |7B| dDPO| 6.88| -| | Zephyr-7b-β 🪁 | 7B | dDPO | 7.34 | 90.60 | | Falcon-Instruct | 40B |dSFT |5.17 |45.71| | Guanaco | 65B | SFT |6.41| 71.80| | Llama2-Chat | 70B |RLHF |6.86| 92.66| | Vicuna v1.3 | 33B |dSFT |7.12 |88.99| | WizardLM v1.0 | 70B |dSFT |7.71 |-| | Xwin-LM v0.1 | 70B |dPPO |- |95.57| | GPT-3.5-turbo | - |RLHF |7.94 |89.37| | Claude 2 | - |RLHF |8.06| 91.36| | GPT-4 | -| RLHF |8.99| 95.28| In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B: However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap. The model was initially fine-tuned on a filtered and preprocessed of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contains 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our demo to test its capabilities. You can find the datasets used for training Zephyr-7B-β here Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: Zephyr-7B-β has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the Falcon 180B model card for an example of this. During DPO training, this model achieves the following results on the evaluation set: - Loss: 0.7496 - Rewards/chosen: -4.5221 - Rewards/rejected: -8.3184 - Rewards/accuracies: 0.7812 - Rewards/margins: 3.7963 - Logps/rejected: -340.1541 - Logps/chosen: -299.4561 - Logits/rejected: -2.3081 - Logits/chosen: -2.3531 The following hyperparameters were used during training: - learningrate: 5e-07 - trainbatchsize: 2 - evalbatchsize: 4 - seed: 42 - distributedtype: multi-GPU - numdevices: 16 - totaltrainbatchsize: 32 - totalevalbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 The table below shows the full set of DPO training metrics: | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 | | 0.4908 | 0.1 | 200 | 0.5426 | -0.0279 | -0.6842 | 0.75 | 0.6563 | -263.8124 | -254.5145 | -2.7719 | -2.7960 | | 0.5264 | 0.15 | 300 | 0.5324 | 0.0414 | -0.9793 | 0.7656 | 1.0207 | -266.7627 | -253.8209 | -2.7892 | -2.8122 | | 0.5536 | 0.21 | 400 | 0.4957 | -0.0185 | -1.5276 | 0.7969 | 1.5091 | -272.2460 | -254.4203 | -2.8542 | -2.8764 | | 0.5362 | 0.26 | 500 | 0.5031 | -0.2630 | -1.5917 | 0.7812 | 1.3287 | -272.8869 | -256.8653 | -2.8702 | -2.8958 | | 0.5966 | 0.31 | 600 | 0.5963 | -0.2993 | -1.6491 | 0.7812 | 1.3499 | -273.4614 | -257.2279 | -2.8778 | -2.8986 | | 0.5014 | 0.36 | 700 | 0.5382 | -0.2859 | -1.4750 | 0.75 | 1.1891 | -271.7204 | -257.0942 | -2.7659 | -2.7869 | | 0.5334 | 0.41 | 800 | 0.5677 | -0.4289 | -1.8968 | 0.7969 | 1.4679 | -275.9378 | -258.5242 | -2.7053 | -2.7265 | | 0.5251 | 0.46 | 900 | 0.5772 | -0.2116 | -1.3107 | 0.7344 | 1.0991 | -270.0768 | -256.3507 | -2.8463 | -2.8662 | | 0.5205 | 0.52 | 1000 | 0.5262 | -0.3792 | -1.8585 | 0.7188 | 1.4793 | -275.5552 | -258.0276 | -2.7893 | -2.7979 | | 0.5094 | 0.57 | 1100 | 0.5433 | -0.6279 | -1.9368 | 0.7969 | 1.3089 | -276.3377 | -260.5136 | -2.7453 | -2.7536 | | 0.5837 | 0.62 | 1200 | 0.5349 | -0.3780 | -1.9584 | 0.7656 | 1.5804 | -276.5542 | -258.0154 | -2.7643 | -2.7756 | | 0.5214 | 0.67 | 1300 | 0.5732 | -1.0055 | -2.2306 | 0.7656 | 1.2251 | -279.2761 | -264.2903 | -2.6986 | -2.7113 | | 0.6914 | 0.72 | 1400 | 0.5137 | -0.6912 | -2.1775 | 0.7969 | 1.4863 | -278.7448 | -261.1467 | -2.7166 | -2.7275 | | 0.4655 | 0.77 | 1500 | 0.5090 | -0.7987 | -2.2930 | 0.7031 | 1.4943 | -279.8999 | -262.2220 | -2.6651 | -2.6838 | | 0.5731 | 0.83 | 1600 | 0.5312 | -0.8253 | -2.3520 | 0.7812 | 1.5268 | -280.4902 | -262.4876 | -2.6543 | -2.6728 | | 0.5233 | 0.88 | 1700 | 0.5206 | -0.4573 | -2.0951 | 0.7812 | 1.6377 | -277.9205 | -258.8084 | -2.6870 | -2.7097 | | 0.5593 | 0.93 | 1800 | 0.5231 | -0.5508 | -2.2000 | 0.7969 | 1.6492 | -278.9703 | -259.7433 | -2.6221 | -2.6519 | | 0.4967 | 0.98 | 1900 | 0.5290 | -0.5340 | -1.9570 | 0.8281 | 1.4230 | -276.5395 | -259.5749 | -2.6564 | -2.6878 | | 0.0921 | 1.03 | 2000 | 0.5368 | -1.1376 | -3.1615 | 0.7812 | 2.0239 | -288.5854 | -265.6111 | -2.6040 | -2.6345 | | 0.0733 | 1.08 | 2100 | 0.5453 | -1.1045 | -3.4451 | 0.7656 | 2.3406 | -291.4208 | -265.2799 | -2.6289 | -2.6595 | | 0.0972 | 1.14 | 2200 | 0.5571 | -1.6915 | -3.9823 | 0.8125 | 2.2908 | -296.7934 | -271.1505 | -2.6471 | -2.6709 | | 0.1058 | 1.19 | 2300 | 0.5789 | -1.0621 | -3.8941 | 0.7969 | 2.8319 | -295.9106 | -264.8563 | -2.5527 | -2.5798 | | 0.2423 | 1.24 | 2400 | 0.5455 | -1.1963 | -3.5590 | 0.7812 | 2.3627 | -292.5599 | -266.1981 | -2.5414 | -2.5784 | | 0.1177 | 1.29 | 2500 | 0.5889 | -1.8141 | -4.3942 | 0.7969 | 2.5801 | -300.9120 | -272.3761 | -2.4802 | -2.5189 | | 0.1213 | 1.34 | 2600 | 0.5683 | -1.4608 | -3.8420 | 0.8125 | 2.3812 | -295.3901 | -268.8436 | -2.4774 | -2.5207 | | 0.0889 | 1.39 | 2700 | 0.5890 | -1.6007 | -3.7337 | 0.7812 | 2.1330 | -294.3068 | -270.2423 | -2.4123 | -2.4522 | | 0.0995 | 1.45 | 2800 | 0.6073 | -1.5519 | -3.8362 | 0.8281 | 2.2843 | -295.3315 | -269.7538 | -2.4685 | -2.5050 | | 0.1145 | 1.5 | 2900 | 0.5790 | -1.7939 | -4.2876 | 0.8438 | 2.4937 | -299.8461 | -272.1744 | -2.4272 | -2.4674 | | 0.0644 | 1.55 | 3000 | 0.5735 | -1.7285 | -4.2051 | 0.8125 | 2.4766 | -299.0209 | -271.5201 | -2.4193 | -2.4574 | | 0.0798 | 1.6 | 3100 | 0.5537 | -1.7226 | -4.2850 | 0.8438 | 2.5624 | -299.8200 | -271.4610 | -2.5367 | -2.5696 | | 0.1013 | 1.65 | 3200 | 0.5575 | -1.5715 | -3.9813 | 0.875 | 2.4098 | -296.7825 | -269.9498 | -2.4926 | -2.5267 | | 0.1254 | 1.7 | 3300 | 0.5905 | -1.6412 | -4.4703 | 0.8594 | 2.8291 | -301.6730 | -270.6473 | -2.5017 | -2.5340 | | 0.085 | 1.76 | 3400 | 0.6133 | -1.9159 | -4.6760 | 0.8438 | 2.7601 | -303.7296 | -273.3941 | -2.4614 | -2.4960 | | 0.065 | 1.81 | 3500 | 0.6074 | -1.8237 | -4.3525 | 0.8594 | 2.5288 | -300.4951 | -272.4724 | -2.4597 | -2.5004 | | 0.0755 | 1.86 | 3600 | 0.5836 | -1.9252 | -4.4005 | 0.8125 | 2.4753 | -300.9748 | -273.4872 | -2.4327 | -2.4716 | | 0.0746 | 1.91 | 3700 | 0.5789 | -1.9280 | -4.4906 | 0.8125 | 2.5626 | -301.8762 | -273.5149 | -2.4686 | -2.5115 | | 0.1348 | 1.96 | 3800 | 0.6015 | -1.8658 | -4.2428 | 0.8281 | 2.3769 | -299.3976 | -272.8936 | -2.4943 | -2.5393 | | 0.0217 | 2.01 | 3900 | 0.6122 | -2.3335 | -4.9229 | 0.8281 | 2.5894 | -306.1988 | -277.5699 | -2.4841 | -2.5272 | | 0.0219 | 2.07 | 4000 | 0.6522 | -2.9890 | -6.0164 | 0.8281 | 3.0274 | -317.1334 | -284.1248 | -2.4105 | -2.4545 | | 0.0119 | 2.12 | 4100 | 0.6922 | -3.4777 | -6.6749 | 0.7969 | 3.1972 | -323.7187 | -289.0121 | -2.4272 | -2.4699 | | 0.0153 | 2.17 | 4200 | 0.6993 | -3.2406 | -6.6775 | 0.7969 | 3.4369 | -323.7453 | -286.6413 | -2.4047 | -2.4465 | | 0.011 | 2.22 | 4300 | 0.7178 | -3.7991 | -7.4397 | 0.7656 | 3.6406 | -331.3667 | -292.2260 | -2.3843 | -2.4290 | | 0.0072 | 2.27 | 4400 | 0.6840 | -3.3269 | -6.8021 | 0.8125 | 3.4752 | -324.9908 | -287.5042 | -2.4095 | -2.4536 | | 0.0197 | 2.32 | 4500 | 0.7013 | -3.6890 | -7.3014 | 0.8125 | 3.6124 | -329.9841 | -291.1250 | -2.4118 | -2.4543 | | 0.0182 | 2.37 | 4600 | 0.7476 | -3.8994 | -7.5366 | 0.8281 | 3.6372 | -332.3356 | -293.2291 | -2.4163 | -2.4565 | | 0.0125 | 2.43 | 4700 | 0.7199 | -4.0560 | -7.5765 | 0.8438 | 3.5204 | -332.7345 | -294.7952 | -2.3699 | -2.4100 | | 0.0082 | 2.48 | 4800 | 0.7048 | -3.6613 | -7.1356 | 0.875 | 3.4743 | -328.3255 | -290.8477 | -2.3925 | -2.4303 | | 0.0118 | 2.53 | 4900 | 0.6976 | -3.7908 | -7.3152 | 0.8125 | 3.5244 | -330.1224 | -292.1431 | -2.3633 | -2.4047 | | 0.0118 | 2.58 | 5000 | 0.7198 | -3.9049 | -7.5557 | 0.8281 | 3.6508 | -332.5271 | -293.2844 | -2.3764 | -2.4194 | | 0.006 | 2.63 | 5100 | 0.7506 | -4.2118 | -7.9149 | 0.8125 | 3.7032 | -336.1194 | -296.3530 | -2.3407 | -2.3860 | | 0.0143 | 2.68 | 5200 | 0.7408 | -4.2433 | -7.9802 | 0.8125 | 3.7369 | -336.7721 | -296.6682 | -2.3509 | -2.3946 | | 0.0057 | 2.74 | 5300 | 0.7552 | -4.3392 | -8.0831 | 0.7969 | 3.7439 | -337.8013 | -297.6275 | -2.3388 | -2.3842 | | 0.0138 | 2.79 | 5400 | 0.7404 | -4.2395 | -7.9762 | 0.8125 | 3.7367 | -336.7322 | -296.6304 | -2.3286 | -2.3737 | | 0.0079 | 2.84 | 5500 | 0.7525 | -4.4466 | -8.2196 | 0.7812 | 3.7731 | -339.1662 | -298.7007 | -2.3200 | -2.3641 | | 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 | | 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 | | 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 | - Transformers 4.35.0.dev0 - Pytorch 2.0.1+cu118 - Datasets 2.12.0 - Tokenizers 0.14.0 If you find Zephyr-7B-β is useful in your work, please cite it with:
claude2-alpaca-7B-GGUF
bun_mistral_7b_v2-GGUF
Llama-2-13B-fp16
Kunoichi-7B-GGUF
SciPhi-Self-RAG-Mistral-7B-32k-GGUF
Synthia-7B-v1.3-GGUF
MistRP-Airoboros-7B-GGUF
Tinyllama-2-1b-miniguanaco-GGUF
CodeLlama-70B-hf-GGUF
orca_mini_v3_7B-GGUF
Nethena-MLewd-Xwin-23B-GGUF
Llama-2-7B-32K-Instruct-GGUF
Llama-2-70B-fp16
MythoMax-L2-13B-GPTQ
phi-2-orange-GGUF
Dolphin-Llama-13B-GGUF
CAMEL-13B-Role-Playing-Data-GGUF
LLaMA-13b-GGUF
WizardLM-30B-Uncensored-GGUF
Wizard-Vicuna-7B-Uncensored-HF
WhiteRabbitNeo-33B-v1-GGUF
CollectiveCognition-v1-Mistral-7B-GGUF
Mistral-Trismegistus-7B-GGUF
Nous-Hermes-2-Mixtral-8x7B-SFT-GGUF
Mixtral_7Bx2_MoE-GGUF
dolphin-2_2-yi-34b-GGUF
Llama-2-13B-chat-GPTQ
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Llama 2 13B Chat - GPTQ - Model creator: Meta Llama 2 - Original model: Llama 2 13B Chat This repo contains GPTQ model files for Meta's Llama 2 13B-chat. Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Meta Llama 2's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. Each separate quant is in a different branch. See below for instructions on fetching from different branches. All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa. - Bits: The bit size of the quantised model. - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value. - Act Order: True or False. Also known as `descact`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy. - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s). - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences. - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit. | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc | | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- | | main | 4 | 128 | No | 0.01 | wikitext | 4096 | 7.26 GB | Yes | 4-bit, without Act Order and group size 128g. | | gptq-4bit-32g-actorderTrue | 4 | 32 | Yes | 0.01 | wikitext | 4096 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | | gptq-4bit-64g-actorderTrue | 4 | 64 | Yes | 0.01 | wikitext | 4096 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. | | gptq-4bit-128g-actorderTrue | 4 | 128 | Yes | 0.01 | wikitext | 4096 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | | gptq-8bit-128g-actorderTrue | 8 | 128 | Yes | 0.01 | wikitext | 4096 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. | | gptq-8bit-64g-actorderTrue | 8 | 64 | Yes | 0.01 | wikitext | 4096 | 13.95 GB | No | 8-bit, with group size 64g and Act Order for even higher inference quality. Poor AutoGPTQ CUDA speed. | | gptq-8bit-128g-actorderFalse | 8 | 128 | No | 0.01 | wikitext | 4096 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. | | gptq-8bit--1g-actorderTrue | 8 | None | Yes | 0.01 | wikitext | 4096 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Llama-2-13B-chat-GPTQ:main` - With Git, you can clone a branch with: - In Python Transformers code, the branch is the `revision` parameter; see below. How to easily download and use this model in text-generation-webui. Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. 1. Click the Model tab. 2. Under Download custom model or LoRA, enter `TheBloke/Llama-2-13B-chat-GPTQ`. - To download from a specific branch, enter for example `TheBloke/Llama-2-13B-chat-GPTQ:main` - see Provided Files above for the list of branches for each option. 3. Click Download. 4. The model will start downloading. Once it's finished it will say "Done". 5. In the top left, click the refresh icon next to Model. 6. In the Model dropdown, choose the model you just downloaded: `Llama-2-13B-chat-GPTQ` 7. The model will automatically load, and is now ready for use! 8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right. Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantizeconfig.json`. 9. Once you're ready, click the Text Generation tab and enter a prompt to get started! Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later. If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead: For CodeLlama models only: you must use Transformers 4.33.0 or later. If 4.33.0 is not yet released when you read this, you will need to install Transformers from source: The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ-for-LLaMa fork. ExLlama is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility. Huggingface Text Generation Inference (TGI) is compatible with all GPTQ models. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Llama 2 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. Model Details Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept our License before requesting access here. Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|A new mix of publicly available online data|7B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|13B|4k|✗|2.0T|3.0 x 10 -4 | |Llama 2|A new mix of publicly available online data|70B|4k|✔|2.0T|1.5 x 10 -4 | Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models - 70B -- use Grouped-Query Attention (GQA) for improved inference scalability. Model Dates Llama 2 was trained between January 2023 and July 2023. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ Research Paper "Llama-2: Open Foundation and Fine-tuned Chat Models" Intended Use Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. To get the expected features and performance for the chat versions, a specific formatting needs to be followed, including the `INST` and ` >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). See our reference code in github for details: `chatcompletion`. Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws).Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Llama 2. Hardware and Software Training Factors We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO 2 eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| CO 2 emissions during pretraining. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Training Data Overview Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|37.5|71.9|63.6|69.4|35.2|68.9|51.2|54.2| Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1. |||TruthfulQA|Toxigen| |---|---|---|---| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|21.25| |Llama 2|13B|41.86|26.10| |Llama 2|70B|50.18|24.60| Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |||TruthfulQA|Toxigen| |---|---|---|---| |Llama-2-Chat|7B|57.04|0.00| |Llama-2-Chat|13B|62.18|0.00| |Llama-2-Chat|70B|64.14|0.01| Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above. Ethical Considerations and Limitations Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/ Reporting Issues Please report any software “bug,” or other problems with the models through one of the following means: - Reporting issues with the model: github.com/facebookresearch/llama - Reporting problematic content generated by the model: developers.facebook.com/llamaoutputfeedback - Reporting bugs and security concerns: facebook.com/whitehat/info Llama Model Index |Model|Llama2|Llama2-hf|Llama2-chat|Llama2-chat-hf| |---|---|---|---|---| |7B| Link | Link | Link | Link| |13B| Link | Link | Link | Link| |70B| Link | Link | Link | Link|
Mixtral_34Bx2_MoE_60B-GGUF
Rose-20B-GGUF
OpenHermes-2.5-neural-chat-7B-v3-2-7B-GGUF
Hermes-Trismegistus-Mistral-7B-GGUF
sqlcoder2-GGUF
MythoMax-Kimiko-Mix-GGUF
meditron-70B-GGUF
Nous-Capybara-limarpv3-34B-GGUF
stable-vicuna-13B-HF
Unholy-v2-13B-GGUF
WizardLM-13B-V1-1-SuperHOT-8K-GPTQ
orca_mini_13B-GPTQ
LongChat-13B-GPTQ
lzlv_70B-AWQ
llama-2-70b-Guanaco-QLoRA-fp16
Wizard-Vicuna-30B-Superhot-8K-fp16
Llama-2-7B-vietnamese-20k-GGUF
VicUnlocked-30B-LoRA-HF
vicuna-13b-v1.3.0-GPTQ
WizardLM-30B-GPTQ
Yi-6B-200K-GGUF
Vicuna-33B-1-3-SuperHOT-8K-fp16
guanaco-65B-HF
gpt4-alpaca-lora-30b-HF
Project-Baize-v2-13B-GPTQ
robin-33B-v2-fp16
BigTranslate-13B-GPTQ
wizard-vicuna-13B-HF
tulu-30B-fp16
WizardLM-30B-fp16
CAMEL-33B-Combined-Data-SuperHOT-8K-fp16
OpenAssistant-SFT-7-Llama-30B-HF
law-LLM-13B-GGUF
airoboros-33B-gpt4-1-4-SuperHOT-8K-fp16
Platypus-30B-SuperHOT-8K-fp16
law-chat-GGUF
openchat_v2_openorca_preview-GPTQ
VicUnlocked-alpaca-65B-QLoRA-fp16
UltraLM-13B-fp16
alpaca-lora-65B-HF
MAmmoTH-Coder-34B-GGUF
OpenAssistant-SFT-7-Llama-30B-GPTQ
dromedary-65b-lora-HF
Wizard-Vicuna-30B-Uncensored-fp16
gpt4-alpaca-lora_mlp-65B-HF
gpt4-alpaca-lora-13B-HF
tulu-13B-fp16
Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-GPTQ
WizardLM-13B-V1-1-SuperHOT-8K-fp16
GPlatty-30B-SuperHOT-8K-fp16
airoboros-13B-HF
robin-13B-v2-fp16
Planner-7B-fp16
Yi-34B-200K-DARE-megamerge-v8-GGUF
Project-Baize-v2-7B-GPTQ
Vicuna-13B-CoT-fp16
robin-33B-v2-GPTQ
tulu-7B-fp16
Llama-2-Coder-7B-GGUF
guanaco-13B-HF
Nous-Hermes-13B-SuperHOT-8K-fp16
robin-65b-v2-fp16
Llama-2-7B-Chat-GGML
Chinese-Alpaca-33B-SuperHOT-8K-fp16
deepseek-llm-7B-base-GGUF
llama-30b-supercot-SuperHOT-8K-fp16
airoboros-7b-gpt4-fp16
h2ogpt-oasst1-512-30B-HF
UNA-TheBeagle-7B-v1-GGUF
Generate_Question_Mistral_7B-GGUF
Yi-34B-200K-GGUF
Nethena-13B-GGUF
WizardLM-33B-V1.0-Uncensored-GGUF
MythoLogic-Mini-7B-GGUF
CodeFuse-CodeLlama-34B-GGUF
vicuna-13B-v1.5-16K-GGUF
Rose-20B-AWQ
leo-hessianai-13B-chat-bilingual-GGUF
Noromaid-13B-v0.3-GGUF
WizardLM-13B-V1.0-Uncensored-GGUF
Yi-6B-GGUF
WizardLM-70B-V1.0-GGUF
leo-hessianai-13B-chat-GGUF
MXLewd-L2-20B-GGUF
llama-2-13B-Guanaco-QLoRA-GGUF
guanaco-65B-GPTQ
Llama-2-70B-GGUF
Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Noromaid V0.4 Mixtral Instruct 8X7B ZLoss - GGUF - Model creator: NeverSleep - Original model: Noromaid V0.4 Mixtral Instruct 8X7B ZLoss This repo contains GGUF format model files for NeverSleep's Noromaid V0.4 Mixtral Instruct 8X7B ZLoss. These files were quantised using hardware kindly provided by Massed Compute. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference NeverSleep's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q2K.gguf | Q2K | 2 | 17.17 GB| 19.67 GB | smallest, significant quality loss - not recommended for most purposes | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q3KM.gguf | Q3KM | 3 | 22.48 GB| 24.98 GB | very small, high quality loss | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q40.gguf | Q40 | 4 | 26.44 GB| 28.94 GB | legacy; small, very high quality loss - prefer using Q3KM | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q4KM.gguf | Q4KM | 4 | 28.38 GB| 30.88 GB | medium, balanced quality - recommended | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q50.gguf | Q50 | 5 | 32.23 GB| 34.73 GB | legacy; medium, balanced quality - prefer using Q4KM | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q5KM.gguf | Q5KM | 5 | 33.23 GB| 35.73 GB | large, very low quality loss - recommended | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q6K.gguf | Q6K | 6 | 38.38 GB| 40.88 GB | very large, extremely low quality loss | | noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q80.gguf | Q80 | 8 | 49.62 GB| 52.12 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-GGUF and below it, a specific filename to download, such as: noromaid-v0.4-mixtral-instruct-8x7b-zloss.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 32768` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Original model card: NeverSleep's Noromaid V0.4 Mixtral Instruct 8X7B ZLoss Disclaimer: This model is experimental, do not expect everything to work. This model was trained on the Zloss fork of Charles, and should fix issue the model had. Use Chatml prompt format, but not the special token. The reason is that Axolotl merge the finetune with the base model at 1.0 weight basically, but this is too much, so I use another script available HERE to merge with less weight, sadly, it don't take the special Chatml token. It's like Orca2 for the matter. This repo contains FP16 files of Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss. Note: We have permission of all users to upload their ratings, we DONT screenshot random reviews without asking if we can put them here! If you want your rating to be here, send us a message over on DC and we'll put up a screenshot of it here. DC name is "ikaridev" and "undi". - Aesir 1, 2 & 3 modified by us, credit to (MinervaAI / Gryphe) - LimaRP-20231109 (Lemonilia) - ToxicQAFinal (NobodyExistsOnTheInternet - No-robots-ShareGPT (Doctor-Shotgun) IkariDev: Visit my retro/neocities style website please kek
LLaMA2-13B-Psyfighter2-GGUF
fin-llama-33B-GGUF
Kimiko-Mistral-7B-GGUF
meditron-7B-chat-GGUF
Code-13B-GGUF
Manticore-13B-GGUF
MistralLite-7B-GGUF
LLaMA2-13B-TiefighterLR-GGUF
deepseek-coder-1.3b-instruct-AWQ
GodziLLa2-70B-GGUF
chronos-hermes-13B-GGUF
NeuralHermes-2.5-Mistral-7B-GGUF
Mixtral-SlimOrca-8x7B-GGUF
Airoboros-L2-70b-2.2-GGUF
Xwin-LM-70B-V0.1-GGUF
Mixtral-8x7B-v0.1-GPTQ
chronos007-70B-GGUF
TinyLlama-1.1B-intermediate-step-480k-1T-GGUF
xDAN-L1-Chat-RL-v1-GGUF
deepseek-coder-5.7bmqa-base-GGUF
fiction.live-Kimiko-V2-70B-GGUF
AquilaChat2-34B-16K-GGUF
mistral-ft-optimized-1227-GGUF
Xwin-LM-13B-v0.2-GGUF
Writing_Partner_Mistral_7B-GGUF
Ferret_7B-GGUF
StellarBright-GGUF
WestLake-7B-v2-GGUF
Marcoroni-7B-v3-GGUF
juanako-7B-UNA-GGUF
CollectiveCognition-v1.1-Mistral-7B-GGUF
Phind-CodeLlama-34B-v1-GPTQ
med42-70B-GGUF
CodeLlama-7B-Instruct-GPTQ
Sensualize-Mixtral-GGUF
vicuna-13B-v1.5-GGUF
gorilla-7B-GGUF
WizardLM-70B-V1.0-GPTQ
Llama-2-70B-Orca-200k-GGUF
Orca-2-7B-GGUF
Llama2-70B-OASST-SFT-v10-GPTQ
Sarah_StoryTeller_13b-GGUF
CodeBooga-34B-v0.1-GGUF
TinyLlama-1.1B-python-v0.1-GGUF
Mistral-7B-Instruct-v0.1-GPTQ
DiscoLM-70B-GGUF
finance-LLM-13B-GGUF
law-LLM-GGUF
wizardLM-7B-GGUF
TinyLlama-1.1B-Chat-v1.0-AWQ
calm2-7B-chat-GGUF
Python-Code-33B-GGUF
goliath-120b-GGUF
alfred-40B-1023-GGUF
Chinese-Llama-2-7B-GGUF
MLewd-ReMM-L2-Chat-20B-GGUF
MetaMath-13B-V1.0-GGUF
Naberius-7B-GGUF
speechless-mistral-dolphin-orca-platypus-samantha-7B-GGUF
opus-v0-7B-GGUF
Phind-CodeLlama-34B-Python-v1-GGUF
NeuralBeagle14-7B-GGUF
DPOpenHermes-7B-GGUF
DaringMaid-20B-GGUF
Wizard Vicuna 30B Uncensored GPTQ
TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Wizard Vicuna 30B Uncensored - GPTQ - Model creator: Eric Hartford - Original model: Wizard Vicuna 30B Uncensored This repo contains GPTQ model files for Eric Hartford's Wizard-Vicuna-30B-Uncensored. Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. AWQ model(s) for GPU inference. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. Each separate quant is in a different branch. See below for instructions on fetching from different branches. All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa. - Bits: The bit size of the quantised model. - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value. - Act Order: True or False. Also known as `descact`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy. - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s). - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences. - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit. | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc | | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- | | main | 4 | None | Yes | 0.01 | wikitext | 2048 | 16.94 GB | Yes | 4-bit, with Act Order. No group size, to lower VRAM requirements. | | gptq-4bit-32g-actorderTrue | 4 | 32 | Yes | 0.01 | wikitext | 2048 | 19.44 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. | | gptq-4bit-64g-actorderTrue | 4 | 64 | Yes | 0.01 | wikitext | 2048 | 18.18 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. | | gptq-4bit-128g-actorderTrue | 4 | 128 | Yes | 0.01 | wikitext | 2048 | 17.55 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. | | gptq-8bit--1g-actorderTrue | 8 | None | Yes | 0.01 | wikitext | 2048 | 32.99 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. | | gptq-8bit-128g-actorderFalse | 8 | 128 | No | 0.01 | wikitext | 2048 | 33.73 GB | No | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. | | gptq-3bit--1g-actorderTrue | 3 | None | Yes | 0.01 | wikitext | 2048 | 12.92 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. | | gptq-3bit-128g-actorderFalse | 3 | 128 | No | 0.01 | wikitext | 2048 | 13.51 GB | No | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. | - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ:main` - With Git, you can clone a branch with: - In Python Transformers code, the branch is the `revision` parameter; see below. How to easily download and use this model in text-generation-webui. Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. 1. Click the Model tab. 2. Under Download custom model or LoRA, enter `TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ`. - To download from a specific branch, enter for example `TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ:main` - see Provided Files above for the list of branches for each option. 3. Click Download. 4. The model will start downloading. Once it's finished it will say "Done". 5. In the top left, click the refresh icon next to Model. 6. In the Model dropdown, choose the model you just downloaded: `Wizard-Vicuna-30B-Uncensored-GPTQ` 7. The model will automatically load, and is now ready for use! 8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right. Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantizeconfig.json`. 9. Once you're ready, click the Text Generation tab and enter a prompt to get started! Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later. If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead: For CodeLlama models only: you must use Transformers 4.33.0 or later. If 4.33.0 is not yet released when you read this, you will need to install Transformers from source: The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ-for-LLaMa fork. ExLlama is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility. Huggingface Text Generation Inference (TGI) is compatible with all GPTQ models. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfiei, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, SX, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov And thank you again to a16z for their generous grant. Original model card: Eric Hartford's Wizard-Vicuna-30B-Uncensored This is an fp16 models of Eric Hartford's Wizard-Vicuna 30B. It is the result of converting Eric's original fp32 upload to fp16. 4bit GPTQ models for GPU inference. 4bit and 5bit GGML models for CPU inference. float16 HF format model for GPU inference and further conversions. For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman. This is wizard-vicuna-13b trained with a subset of the dataset - responses that contained alignment / moralizing were removed. The intent is to train a WizardLM that doesn't have alignment built-in, so that alignment (of any sort) can be added separately with for example with a RLHF LoRA. Shout out to the open source AI/ML community, and everyone who helped me out. You are responsible for anything you do with the model, just as you are responsible for anything you do with any dangerous object such as a knife, gun, lighter, or car. Publishing anything this model generates is the same as publishing it yourself. You are responsible for the content you publish, and you cannot blame the model any more than you can blame the knife, gun, lighter, or car for what you do with it.