speakleash
Bielik-11B-v3.0-Instruct
Bielik-11B-v2.3-Instruct
Bielik-11B-v2.3-Instruct is a generative text model featuring 11 billion parameters. It is a linear merge of the Bielik-11B-v2.0-Instruct, Bielik-11B-v2.1-Instruct, and Bielik-11B-v2.2-Instruct models, which are instruct fine-tuned versions of the Bielik-11B-v2. Forementioned model stands as a testament to the unique collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which has been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The creation and training of the Bielik-11B-v2.3-Instruct was propelled by the support of computational grant number PLG/2024/016951, conducted on the Athena and Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision. 📚 Technical report: https://arxiv.org/abs/2505.02410 Chat Arena is a platform for testing and comparing different AI language models, allowing users to evaluate their performance and quality. The SpeakLeash team is working on their own set of instructions in Polish, which is continuously being expanded and refined by annotators. A portion of these instructions, which had been manually verified and corrected, has been utilized for training purposes. Moreover, due to the limited availability of high-quality instructions in Polish, synthetic instructions were generated with Mixtral 8x22B and used in training. The dataset used for training comprised over 20 million instructions, consisting of more than 10 billion tokens. The instructions varied in quality, leading to a deterioration in the model’s performance. To counteract this while still allowing ourselves to utilize the aforementioned datasets, several improvements were introduced: Weighted tokens level loss - a strategy inspired by offline reinforcement learning and C-RLFT Adaptive learning rate inspired by the study on Learning Rates as a Function of Batch Size Masked prompt tokens To align the model with user preferences we tested many different techniques: DPO, PPO, KTO, SiMPO. Finally the DPO-Positive method was employed, utilizing both generated and manually corrected examples, which were scored by a metamodel. A dataset comprising over 66,000 examples of varying lengths to address different aspects of response style. It was filtered and evaluated by the reward model to select instructions with the right level of difference between chosen and rejected. The novelty introduced in DPO-P was multi-turn conversations introduction. Bielik instruct models have been trained with the use of an original open source framework called ALLaMo implemented by Krzysztof Ociepa. This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way. Bielik-11B-v2.3-Instruct is a merge of the Bielik-11B-v2.0-Instruct, Bielik-11B-v2.1-Instruct, and Bielik-11B-v2.2-Instruct models. The merge was performed in float16 precision by Remigiusz Kinas using mergekit. Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Merged from: Bielik-11B-v2.0-Instruct, Bielik-11B-v2.1-Instruct, Bielik-11B-v2.2-Instruct License: Apache 2.0 and Terms of Use Quantized models: We know that some people want to explore smaller models or don't have the resources to run a full model. Therefore, we have prepared quantized versions of the Bielik-11B-v2.3-Instruct model in separate repositories: - GGUF - Q4KM, Q5KM, Q6K, Q80 - GPTQ - 4bit - FP8 (vLLM, SGLang - Ada Lovelace, Hopper optimized) - GGUF - experimental - IQ imatrix IQ1M, IQ2XXS, IQ3XXS, IQ4XS and calibrated Q4KM, Q5KM, Q6K, Q80 Please note that quantized models may offer lower quality of generated answers compared to full sized variatns. Bielik-11B-v2.3-Instruct uses ChatML as the prompt format. This format is available as a chat template via the `applychattemplate()` method: Fully formated input conversation by applychattemplate from previous example: Bielik-11B-v2.3-Instruct has been evaluated on several benchmarks to assess its performance across various tasks and languages. These benchmarks include: 1. Open PL LLM Leaderboard 2. Open LLM Leaderboard 3. Polish MT-Bench 4. Polish EQ-Bench (Emotional Intelligence Benchmark) 5. MixEval The following sections provide detailed results for each of these benchmarks, demonstrating the model's capabilities in both Polish and English language tasks. Models have been evaluated on Open PL LLM Leaderboard 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores. | Model | Parameters (B)| Average | |---------------------------------|------------|---------| | Meta-Llama-3.1-405B-Instruct-FP8,API | 405 | 69.44 | | Mistral-Large-Instruct-2407 | 123 | 69.11 | | Qwen2-72B-Instruct | 72 | 65.87 | | Bielik-11B-v2.3-Instruct | 11 | 65.71 | | Bielik-11B-v2.2-Instruct | 11 | 65.57 | | Meta-Llama-3.1-70B-Instruct | 70 | 65.49 | | Bielik-11B-v2.1-Instruct | 11 | 65.45 | | Mixtral-8x22B-Instruct-v0.1 | 141 | 65.23 | | Bielik-11B-v2.0-Instruct | 11 | 64.98 | | Meta-Llama-3-70B-Instruct | 70 | 64.45 | | Athene-70B | 70 | 63.65 | | WizardLM-2-8x22B | 141 | 62.35 | | Qwen1.5-72B-Chat | 72 | 58.67 | | Qwen2-57B-A14B-Instruct | 57 | 56.89 | | glm-4-9b-chat | 9 | 56.61 | | aya-23-35B | 35 | 56.37 | | Phi-3.5-MoE-instruct | 41.9 | 56.34 | | openchat-3.5-0106-gemma | 7 | 55.69 | | Mistral-Nemo-Instruct-2407 | 12 | 55.27 | | SOLAR-10.7B-Instruct-v1.0 | 10.7 | 55.24 | | Mixtral-8x7B-Instruct-v0.1 | 46.7 | 55.07 | | Bielik-7B-Instruct-v0.1 | 7 | 44.70 | | trurl-2-13b-academic | 13 | 36.28 | | trurl-2-7b | 7 | 26.93 | The results from the Open PL LLM Leaderboard demonstrate the exceptional performance of Bielik-11B-v2.3-Instruct: 1. Superior performance in its class: Bielik-11B-v2.3-Instruct outperforms all other models with less than 70B parameters. This is a significant achievement, showcasing its efficiency and effectiveness despite having fewer parameters than many competitors. 2. Competitive with larger models: with a score of 65.71, Bielik-11B-v2.3-Instruct performs on par with models in the 70B parameter range. This indicates that it achieves comparable results to much larger models, demonstrating its advanced architecture and training methodology. 3. Substantial improvement over previous version: the model shows a marked improvement over its predecessor, Bielik-7B-Instruct-v0.1, which scored 43.64. This leap in performance highlights the successful enhancements and optimizations implemented in this newer version. 4. Leading position for Polish language models: in the context of Polish language models, Bielik-11B-v2.3-Instruct stands out as a leader. There are no other competitive models specifically tailored for the Polish language that match its performance, making it a crucial resource for Polish NLP tasks. These results underscore Bielik-11B-v2.3-Instruct's position as a state-of-the-art model for Polish language processing, offering high performance with relatively modest computational requirements. Open PL LLM Leaderboard - Generative Tasks Performance This section presents a focused comparison of generative Polish language task performance between Bielik models and GPT-3.5. The evaluation is limited to generative tasks due to the constraints of assessing OpenAI models. The comprehensive nature and associated costs of the benchmark explain the limited number of models evaluated. | Model | Parameters (B) | Average g | |-------------------------------|----------------|---------------| | Bielik-11B-v2.3-Instruct | 11 | 67.47 | Bielik-11B-v2.1-Instruct | 11 | 66.58 | | Bielik-11B-v2.2-Instruct | 11 | 66.11 | | Bielik-11B-v2.0-Instruct | 11 | 65.58 | | gpt-3.5-turbo-instruct | Unknown | 55.65 | The performance variation among Bielik versions is minimal, indicating consistent quality across iterations. Bielik-11B-v2.3-Instruct demonstrates an impressive 21.2% performance advantage over GPT-3.5. The Open LLM Leaderboard evaluates models on various English language tasks, providing insights into the model's performance across different linguistic challenges. | Model | AVG | arcchallenge | hellaswag | truthfulqamc2 | mmlu | winogrande | gsm8k | |--------------------------|-------|---------------|-----------|----------------|-------|------------|-------| | Bielik-11B-v2.2-Instruct | 69.86 | 59.90 | 80.16 | 58.34 | 64.34 | 75.30 | 81.12 | | Bielik-11B-v2.3-Instruct | 69.82 | 59.30 | 80.11 | 57.42 | 64.57 | 76.24 | 81.27 | | Bielik-11B-v2.1-Instruct | 69.82 | 59.56 | 80.20 | 59.35 | 64.18 | 75.06 | 80.59 | | Bielik-11B-v2.0-Instruct | 68.04 | 58.62 | 78.65 | 54.65 | 63.71 | 76.32 | 76.27 | | Bielik-11B-v2 | 65.87 | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 | | Mistral-7B-Instruct-v0.2 | 65.71 | 63.14 | 84.88 | 68.26 | 60.78 | 77.19 | 40.03 | | Bielik-7B-Instruct-v0.1 | 51.26 | 47.53 | 68.91 | 49.47 | 46.18 | 65.51 | 29.95 | Bielik-11B-v2.3-Instruct shows impressive performance on English language tasks: 1. Significant improvement over its base model (4-point increase). 2. Substantial 18-point improvement over Bielik-7B-Instruct-v0.1. These results demonstrate Bielik-11B-v2.3-Instruct's versatility in both Polish and English, highlighting the effectiveness of its instruction tuning process. Polish MT-Bench The Bielik-11B-v2.3-Instruct (16 bit) model was also evaluated using the MT-Bench benchmark. The quality of the model was evaluated using the English version (original version without modifications) and the Polish version created by Speakleash (tasks and evaluation in Polish, the content of the tasks was also changed to take into account the context of the Polish language). MT-Bench English | Model | Score | |-----------------|----------| | Bielik-11B-v2.1 | 8.537500 | | Bielik-11B-v2.3 | 8.531250 | | Bielik-11B-v2.2 | 8.390625 | | Bielik-11B-v2.0 | 8.159375 | MT-Bench Polish | Model | Parameters (B) | Score | |-------------------------------------|----------------|----------| | Qwen2-72B-Instruct | 72 | 8.775000 | | Mistral-Large-Instruct-2407 (123B) | 123 | 8.662500 | | gemma-2-27b-it | 27 | 8.618750 | | Bielik-11B-v2.3-Instruct | 11 | 8.556250 | | Mixtral-8x22b | 141 | 8.231250 | | Meta-Llama-3.1-405B-Instruct | 405 | 8.168750 | | Meta-Llama-3.1-70B-Instruct | 70 | 8.150000 | | Bielik-11B-v2.2-Instruct | 11 | 8.115625 | | Bielik-11B-v2.1-Instruct | 11 | 7.996875 | | gpt-3.5-turbo | Unknown | 7.868750 | | Mixtral-8x7b | 46.7 | 7.637500 | | Bielik-11B-v2.0-Instruct | 11 | 7.562500 | | Mistral-Nemo-Instruct-2407 | 12 | 7.368750 | | openchat-3.5-0106-gemma | 7 | 6.812500 | | Mistral-7B-Instruct-v0.2 | 7 | 6.556250 | | Meta-Llama-3.1-8B-Instruct | 8 | 6.556250 | | Bielik-7B-Instruct-v0.1 | 7 | 6.081250 | | Mistral-7B-Instruct-v0.3 | 7 | 5.818750 | | Polka-Mistral-7B-SFT | 7 | 4.518750 | | trurl-2-7b | 7 | 2.762500 | 1. Strong performance among mid-sized models: Bielik-11B-v2.3-Instruct scored 8.556250, placing it ahead of several well-known models like GPT-3.5-turbo (7.868750) and Mixtral-8x7b (7.637500). This indicates that Bielik-11B-v2.3-Instruct is competitive among mid-sized models, particularly those in the 11B-70B parameter range. 2. Competitive against larger models: Bielik-11B-v2.3-Instruct performs close to Meta-Llama-3.1-70B-Instruct (8.150000), Meta-Llama-3.1-405B-Instruct (8.168750) and even Mixtral-8x22b (8.231250), which have significantly more parameters. This efficiency in performance relative to size could make it an attractive option for tasks where resource constraints are a consideration. Bielik 100% generated answers in Polish, while other models (not typically trained for Polish) can answer Polish questions in English. 3. Significant improvement over previous versions: compared to its predecessor, Bielik-7B-Instruct-v0.1, which scored 6.081250, the Bielik-11B-v2.3-Instruct shows a significant improvement. The score increased by almost 2.5 points, highlighting substantial advancements in model quality, optimization and training methodology. For more information - answers to test tasks and values in each category, visit the MT-Bench PL website. | Model | Parameters (B) | Score | |-------------------------------|--------|-------| | Mistral-Large-Instruct-2407 | 123 | 78.07 | | Meta-Llama-3.1-405B-Instruct-FP8 | 405 | 77.23 | | gpt-4o-2024-08-06 | ? | 75.15 | | gpt-4-turbo-2024-04-09 | ? | 74.59 | | Meta-Llama-3.1-70B-Instruct | 70 | 72.53 | | Qwen2-72B-Instruct | 72 | 71.23 | | Meta-Llama-3-70B-Instruct | 70 | 71.21 | | gpt-4o-mini-2024-07-18 | ? | 71.15 | | Bielik-11B-v2.3-Instruct | 11 | 70.86 | | WizardLM-2-8x22B | 141 | 69.56 | | Bielik-11B-v2.2-Instruct | 11 | 69.05 | | Bielik-11B-v2.0-Instruct | 11 | 68.24 | | Qwen1.5-72B-Chat | 72 | 68.03 | | Mixtral-8x22B-Instruct-v0.1 | 141 | 67.63 | | Bielik-11B-v2.1-Instruct | 11 | 60.07 | | Qwen1.5-32B-Chat | 32 | 59.63 | | openchat-3.5-0106-gemma | 7 | 59.58 | | aya-23-35B | 35 | 58.41 | | gpt-3.5-turbo | ? | 57.7 | | Qwen2-57B-A14B-Instruct | 57 | 57.64 | | Mixtral-8x7B-Instruct-v0.1 | 47 | 57.61 | | SOLAR-10.7B-Instruct-v1.0 | 10.7 | 55.21 | | Mistral-7B-Instruct-v0.2 | 7 | 47.02 | MixEval is a ground-truth-based English benchmark designed to evaluate Large Language Models (LLMs) efficiently and effectively. Key features of MixEval include: 1. Derived from off-the-shelf benchmark mixtures 2. Highly capable model ranking with a 0.96 correlation to Chatbot Arena 3. Local and quick execution, requiring only 6% of the time and cost compared to running MMLU This benchmark provides a robust and time-efficient method for assessing LLM performance, making it a valuable tool for ongoing model evaluation and comparison. | Model | MixEval | MixEval-Hard | |-------------------------------|---------|--------------| | Bielik-11B-v2.1-Instruct | 74.55 | 45.00 | | Bielik-11B-v2.3-Instruct | 72.95 | 43.20 | | Bielik-11B-v2.2-Instruct | 72.35 | 39.65 | | Bielik-11B-v2.0-Instruct | 72.10 | 40.20 | | Mistral-7B-Instruct-v0.2 | 70.00 | 36.20 | The results show that Bielik-11B-v2.3-Instruct performs well on the MixEval benchmark, achieving a score of 72.95 on the standard MixEval and 43.20 on MixEval-Hard. Notably, Bielik-11B-v2.3-Instruct significantly outperforms Mistral-7B-Instruct-v0.2 on both metrics, demonstrating its improved capabilities despite being based on a similar architecture. Bielik-11B-v2.3-Instruct is a quick demonstration that the base model can be easily fine-tuned to achieve compelling and promising performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community in ways to make the model respect guardrails, allowing for deployment in environments requiring moderated outputs. Bielik-11B-v2.3-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2.3-Instruct was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs. Citation Please cite this model using the following format: Krzysztof Ociepa SpeakLeash - team leadership, conceptualizing, data preparation, process optimization and oversight of training Łukasz Flis Cyfronet AGH - coordinating and supervising the training Remigiusz Kinas SpeakLeash - conceptualizing and coordinating DPO training, data preparation Adrian Gwoździej SpeakLeash - data preparation and ensuring data quality Krzysztof Wróbel SpeakLeash - benchmarks The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model: Sebastian Kondracki, Igor Ciuciura, Paweł Kiszczak, Szymon Baczyński, Jacek Chwiła, Maria Filipkowska, Jan Maria Kowalski, Karol Jezierski, Kacper Milan, Jan Sowa, Len Krawczyk, Marta Seidler, Agnieszka Ratajska, Krzysztof Koziarek, Szymon Pepliński, Zuzanna Dabić, Filip Bogacz, Agnieszka Kosiak, Izabela Babis, Nina Babis. Members of the ACK Cyfronet AGH team providing valuable support and expertise: Szymon Mazurek, Marek Magryś, Mieszko Cholewa . If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
Bielik-11B-v2.6-Instruct
Bielik-11B-v3.0-Instruct-GGUF
Bielik-11B-v3.0-Instruct-FP8-Dynamic
Bielik-Minitron-7B-v3.0-Instruct
Bielik-11B-v2.3-Instruct-GGUF
Bielik-4.5B-v3.0-Instruct
Bielik-7B-Instruct-v0.1
Bielik-4.5B-v3.0-Instruct-GGUF
Bielik-1.5B-v3.0-Instruct
Bielik-11B-v2.6-Instruct-GGUF
Bielik-11B-v2.6-Instruct-FP8-Dynamic
This model was obtained by quantizing the weights and activations of Bielik-11B-v2.6-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.0 or SGLang. AutoFP8 is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. FP8 compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). DISCLAIMER: Be aware that quantised models show reduced response quality and possible hallucinations! This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Use with SGLang Runtime Launch a server of SGLang Runtime: Then you can send http request or use OpenAI Compatible API. Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Quant from: Bielik-11B-v2.6-Instruct Finetuned from: Bielik-11B-v2 License: Apache 2.0 and Terms of Use Responsible for model quantization Remigiusz Kinas SpeakLeash - team leadership, conceptualizing, calibration data preparation, process creation and quantized model delivery. If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
Bielik-Guard-0.1B-v1.0
Bielik-7B-Instruct-v0.1-GGUF
Bielik-7B-Instruct-v0.1-AWQ
Bielik-11B-v2
Bielik-11B-v2 is a generative text model featuring 11 billion parameters. It is initialized from its predecessor, Mistral-7B-v0.2, and trained on 400 billion tokens. The aforementioned model stands as a testament to the unique collaboration between the open-science/open-source project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which have been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC center: ACK Cyfronet AGH. The creation and training of the Bielik-11B-v2 was propelled by the support of computational grant number PLG/2024/016951, conducted on the Athena and Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision. ⚠️ This is a base model intended for further fine-tuning across most use cases. If you're looking for a model ready for chatting or following instructions out-of-the-box, please use Bielik-11B-v.2.2-Instruct. Chat Arena is a platform for testing and comparing different AI language models, allowing users to evaluate their performance and quality. Bielik-11B-v2 has been trained with Megatron-LM using different parallelization techniques. The model training was conducted on the Helios Supercomputer at the ACK Cyfronet AGH, utilizing 256 NVidia GH200 cards. The training dataset was composed of Polish texts collected and made available through the SpeakLeash project, as well as a subset of CommonCrawl data. We used 200 billion tokens (over 700 GB of plain text) for two epochs of training. Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Initialized from: Mistral-7B-v0.2 License: Apache 2.0 and Terms of Use Model ref: speakleash:45b6efdb701991181a05968fc53d2a8e An XGBoost classification model was prepared and created to evaluate the quality of texts in native Polish language. It is based on 93 features, such as the ratio of out-of-vocabulary words to all words (OOVs), the number of nouns, verbs, average sentence length etc. The model outputs the category of a given document (either HIGH, MEDIUM or LOW) along with the probability. This approach allows implementation of a dedicated pipeline to choose documents, from which we've used entries with HIGH quality index and probability exceeding 90%. This filtration and appropriate selection of texts enable the provision of a condensed and high-quality database of texts in Polish for training purposes. This model can be easily loaded using the AutoModelForCausalLM functionality. In order to reduce the memory usage, you can use smaller precision (`bfloat16`). And then you can use HuggingFace Pipelines to generate text: Generated output: > Najważniejszym celem człowieka na ziemi jest życie w pokoju, harmonii i miłości. Dla każdego z nas bardzo ważne jest, aby otaczać się kochanymi osobami. Models have been evaluated on two leaderboards: Open PL LLM Leaderboard and Open LLM Leaderboard. The Open PL LLM Leaderboard uses a 5-shot evaluation and focuses on NLP tasks in Polish, while the Open LLM Leaderboard evaluates models on various English language tasks. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores. | Model | Parameters (B) | Average | |------------------------|------------|---------| | Meta-Llama-3-70B | 70 | 62.07 | | Qwen1.5-72B | 72 | 61.11 | | Meta-Llama-3.1-70B | 70 | 60.87 | | Mixtral-8x22B-v0.1 | 141 | 60.75 | | Qwen1.5-32B | 32 | 58.71 | | Bielik-11B-v2 | 11 | 58.14 | | Qwen2-7B | 7 | 49.39 | | SOLAR-10.7B-v1.0 | 10.7 | 47.54 | | Mistral-Nemo-Base-2407 | 12 | 47.28 | | internlm2-20b | 20 | 47.15 | | Meta-Llama-3.1-8B | 8 | 43.77 | | Meta-Llama-3-8B | 8 | 43.30 | | Mistral-7B-v0.2 | 7 | 38.81 | | Bielik-7B-v0.1 | 7 | 34.34 | | Qra-13b | 13 | 33.90 | | Qra-7b | 7 | 16.60 | The results from the Open PL LLM Leaderboard show that the Bielik-11B-v2 model, with 11 billion parameters, achieved an average score of 58.14. This makes it the best performing model among those under 20B parameters, outperforming the second-best model in this category by an impressive 8.75 percentage points. This significant lead not only places it ahead of its predecessor, the Bielik-7B-v0.1 (which scored 34.34), but also demonstrates its superiority over other larger models. The substantial improvement highlights the remarkable advancements and optimizations made in this newer version. Other Polish models listed include Qra-13b and Qra-7b, scoring 33.90 and 16.60 respectively, indicating that Bielik-11B-v2 outperforms these models by a considerable margin. Additionally, the Bielik-11B-v2 was initialized from the weights of Mistral-7B-v0.2, which itself scored 38.81, further demonstrating the effective enhancements incorporated into the Bielik-11B-v2 model. The Open LLM Leaderboard evaluates models on various English language tasks, providing insights into the model's performance across different linguistic challenges. | Model | AVG | arcchallenge | hellaswag | truthfulqamc2 | mmlu | winogrande | gsm8k | |-------------------------|-------|---------------|-----------|----------------|-------|------------|-------| | Bielik-11B-v2 | 65.87 | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 | | Mistral-7B-v0.2 | 60.37 | 60.84 | 83.08 | 41.76 | 63.62 | 78.22 | 34.72 | | Bielik-7B-v0.1 | 49.98 | 45.22 | 67.92 | 47.16 | 43.20 | 66.85 | 29.49 | The results from the Open LLM Leaderboard demonstrate the impressive performance of Bielik-11B-v2 across various NLP tasks. With an average score of 65.87, it significantly outperforms its predecessor, Bielik-7B-v0.1, and even surpasses Mistral-7B-v0.2, which served as its initial weight basis. Key observations: 1. Bielik-11B-v2 shows substantial improvements in most categories compared to Bielik-7B-v0.1, highlighting the effectiveness of the model's enhancements. 2. It performs exceptionally well in tasks like hellaswag (common sense reasoning), winogrande (commonsense reasoning), and gsm8k (mathematical problem-solving), indicating its versatility across different types of language understanding and generation tasks. 3. While Mistral-7B-v0.2 outperforms in truthfulqamc2, Bielik-11B-v2 maintains competitive performance in this truth-discernment task. Although Bielik-11B-v2 was primarily trained on Polish data, it has retained and even improved its ability to understand and operate in English, as evidenced by its strong performance across these English-language benchmarks. This suggests that the model has effectively leveraged cross-lingual transfer learning, maintaining its Polish language expertise while enhancing its English language capabilities. Bielik-11B-v2 is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent. Bielik-11B-v2 can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2 was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs. Citation Please cite this model using the following format: Krzysztof Ociepa SpeakLeash - team leadership, conceptualizing, data preparation, process optimization and oversight of training Łukasz Flis Cyfronet AGH - coordinating and supervising the training Adrian Gwoździej SpeakLeash - data cleaning and quality Krzysztof Wróbel SpeakLeash - benchmarks The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model: Grzegorz Urbanowicz, Igor Ciuciura, Jacek Chwiła, Szymon Baczyński, Paweł Kiszczak, Aleksander Smywiński-Pohl. Members of the ACK Cyfronet AGH team providing valuable support and expertise: Szymon Mazurek, Marek Magryś. If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
Bielik-4.5B-v3.0-Instruct-FP8-Dynamic
Bielik-11B-v2.6-Instruct-AWQ
Bielik-4.5B-v3
Bielik-Minitron-7B-v3.0-Instruct-GGUF
Bielik-11B-v2.2-Instruct-GGUF
Bielik-11B-v2.6-Instruct-bnb-4bit
Bielik-1.5B-v3.0-Instruct-GGUF
Bielik-11B-v3-Base-20250730
Bielik-11B-v3.0-Instruct-MLX-8bit
Bielik-11B-v2.3-Instruct-FP8
Bielik-Guard-0.1B-v1.1
Bielik-Guard-0.5B-v1.1
Bielik-1.5B-v3.0-Instruct-FP8-Dynamic
Bielik-11B-v3.0-Instruct-MLX-4bit
Bielik-1.5B-v3
Bielik-7B-v0.1
Bielik-11B-v2.2-Instruct-GPTQ
Bielik-11B-v2.0-Instruct-GGUF
Bielik-11B-v2.2-Instruct-GGUF-IQ-Imatrix
Bielik-11B-v2.2-Instruct-MLX-8bit
Bielik-11B-v2.5-Instruct-GGUF
This repo contains GGUF format model files for SpeakLeash's Bielik-11B-v2.5-Instruct. DISCLAIMER: Be aware that quantised models show reduced response quality and possible hallucinations! Available quantization formats: q4km: Uses Q6K for half of the attention.wv and feedforward.w2 tensors, else Q4K q5km: Uses Q6K for half of the attention.wv and feedforward.w2 tensors, else Q5K q6k: Uses Q8K for all tensors q80: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. 16bit: Converted fp16 to GGUF format. Ollama Modfile The GGUF file can be used with Ollama. To do this, you need to import the model using the configuration defined in the Modfile. For model eg. Bielik-11B-v2.5-Instruct.Q4KM.gguf (full path to model location) Modfile looks like: Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Quant from: Bielik-11B-v2.5-Instruct Finetuned from: Bielik-11B-v2 License: Apache 2.0 and Terms of Use GGUF is a new format introduced by the llama.cpp team on August 21st 2023. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows, macOS (Silicon) and Linux, with GPU acceleration LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note ctransformers has not been updated in a long time and does not support many recent models. Responsible for model quantization Remigiusz Kinas SpeakLeash - team leadership, conceptualizing, calibration data preparation, process creation and quantized model delivery. If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
Bielik-11B-v2.3-Instruct-GGUF-IQ-Imatrix
Bielik-11B-v2.2-Instruct
Bielik-11B-v2.6-Instruct-MLX-8bit
Bielik-11B-v2.5-Instruct-FP8-Dynamic
Bielik-11B-v2.0-Instruct-GGUF-IQ-Imatrix
Bielik-11B-v2.2-Instruct-EXL2-4.5bit
Bielik-11B-v2.6-Instruct-MLX-4bit
Bielik-11B-v2.2-Instruct-W8A8
Bielik-11B-v2.5-Instruct-AWQ
This repo contains AWQ format model files for SpeakLeash's Bielik-11B-v2.5-Instruct. DISCLAIMER: Be aware that quantised models show reduced response quality and possible hallucinations! Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Quant from: Bielik-11B-v2.5-Instruct Finetuned from: Bielik-11B-v2 License: Apache 2.0 and Terms of Use Responsible for model quantization Remigiusz Kinas SpeakLeash - team leadership, conceptualizing, calibration data preparation, process creation and quantized model delivery. If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
Bielik-Minitron-7B-v3.0-Instruct-FP8-Dynamic
Bielik-11B-v2.2-Instruct-Quanto-8bit
Bielik-7B-Instruct-v0.1-MLX
Bielik-11B-v2.1-Instruct-GGUF-IQ-Imatrix
Bielik-11B-v2.2-Instruct-FP8
Bielik-11B-v2.1-Instruct-GGUF
Bielik-11B-v2.2-Instruct-MLX-4bit
Bielik-11B-v2.2-Instruct-AWQ
Bielik-11B-v2.0-Instruct
Bielik-11B-v2.0-Instruct is a generative text model featuring 11 billion parameters. It is an instruct fine-tuned version of the Bielik-11B-v2. Forementioned model stands as a testament to the unique collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which has been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The creation and training of the Bielik-11B-v2.0-Instruct was propelled by the support of computational grant number PLG/2024/016951, conducted on the Athena and Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision. Chat Arena is a platform for testing and comparing different AI language models, allowing users to evaluate their performance and quality. The SpeakLeash team is working on their own set of instructions in Polish, which is continuously being expanded and refined by annotators. A portion of these instructions, which had been manually verified and corrected, has been utilized for training purposes. Moreover, due to the limited availability of high-quality instructions in Polish, synthetic instructions were generated with Mixtral 8x22B and used in training. The dataset used for training comprised over 16 million instructions, consisting of more than 8 billion tokens. The instructions varied in quality, leading to a deterioration in the model’s performance. To counteract this while still allowing ourselves to utilize the aforementioned datasets, several improvements were introduced: Weighted tokens level loss - a strategy inspired by offline reinforcement learning and C-RLFT Adaptive learning rate inspired by the study on Learning Rates as a Function of Batch Size Masked prompt tokens Bielik-11B-v2.0-Instruct has been trained with the use of an original open source framework called ALLaMo implemented by Krzysztof Ociepa. This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way. Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Finetuned from: Bielik-11B-v2 License: Apache 2.0 and Terms of Use Model ref: speakleash:16d24fc7821149765826d22f335eee5f Quantized models: We know that some people want to explore smaller models or don't have the resources to run a full model. Therefore, we have prepared quantized versions of the Bielik-11B-v2.0-Instruct model in separate repositories: - GGUF - Q4KM, Q5KM, Q6K, Q80 - GPTQ - 4bit - FP8 (vLLM, SGLang - Ada Lovelace, Hopper optimized) - GGUF - experimental - IQ imatrix IQ1M, IQ2XXS, IQ3XXS, IQ4XS and calibrated Q4KM, Q5KM, Q6K, Q80 Please note that quantized models may offer lower quality of generated answers compared to full sized variatns. Bielik-11B-v2.0-Instruct uses ChatML as the prompt format. This format is available as a chat template via the `applychattemplate()` method: Fully formated input conversation by applychattemplate from previous example: Bielik-11B-v2.0-Instruct has been evaluated on several benchmarks to assess its performance across various tasks and languages. These benchmarks include: 1. Open PL LLM Leaderboard 2. Open LLM Leaderboard 3. Polish MT-Bench 4. Polish EQ-Bench (Emotional Intelligence Benchmark) 5. MixEval The following sections provide detailed results for each of these benchmarks, demonstrating the model's capabilities in both Polish and English language tasks. Models have been evaluated on Open PL LLM Leaderboard 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores. | Model | Parameters (B)| Average | |---------------------------------|------------|---------| | Meta-Llama-3.1-405B-Instruct-FP8,API | 405 | 69.44 | | Mistral-Large-Instruct-2407 | 123 | 69.11 | | Qwen2-72B-Instruct | 72 | 65.87 | | Bielik-11B-v2.2-Instruct | 11 | 65.57 | | Meta-Llama-3.1-70B-Instruct | 70 | 65.49 | | Bielik-11B-v2.1-Instruct | 11 | 65.45 | | Mixtral-8x22B-Instruct-v0.1 | 141 | 65.23 | | Bielik-11B-v2.0-Instruct | 11 | 64.98 | | Meta-Llama-3-70B-Instruct | 70 | 64.45 | | Athene-70B | 70 | 63.65 | | WizardLM-2-8x22B | 141 | 62.35 | | Qwen1.5-72B-Chat | 72 | 58.67 | | Qwen2-57B-A14B-Instruct | 57 | 56.89 | | glm-4-9b-chat | 9 | 56.61 | | aya-23-35B | 35 | 56.37 | | Phi-3.5-MoE-instruct | 41.9 | 56.34 | | openchat-3.5-0106-gemma | 7 | 55.69 | | Mistral-Nemo-Instruct-2407 | 12 | 55.27 | | SOLAR-10.7B-Instruct-v1.0 | 10.7 | 55.24 | | Mixtral-8x7B-Instruct-v0.1 | 46.7 | 55.07 | | Bielik-7B-Instruct-v0.1 | 7 | 44.70 | | trurl-2-13b-academic | 13 | 36.28 | | trurl-2-7b | 7 | 26.93 | The results from the Open PL LLM Leaderboard demonstrate the exceptional performance of Bielik-11B-v2.0-Instruct: 1. Superior performance in its class: Bielik-11B-v2.0-Instruct outperforms all other models with less than 70B parameters. This is a significant achievement, showcasing its efficiency and effectiveness despite having fewer parameters than many competitors. 2. Competitive with larger models: with a score of 64.98, Bielik-11B-v2.0-Instruct performs on par with models in the 70B parameter range. This indicates that it achieves comparable results to much larger models, demonstrating its advanced architecture and training methodology. 3. Substantial improvement over previous version: the model shows a marked improvement over its predecessor, Bielik-7B-Instruct-v0.1, which scored 43.64. This leap in performance highlights the successful enhancements and optimizations implemented in this newer version. 4. Leading position for Polish language models: in the context of Polish language models, Bielik-11B-v2 Instruct stands out as a leader. There are no other competitive models specifically tailored for the Polish language that match its performance, making it a crucial resource for Polish NLP tasks. These results underscore Bielik-11B-v2.0-Instruct's position as a state-of-the-art model for Polish language processing, offering high performance with relatively modest computational requirements. Open PL LLM Leaderboard - Generative Tasks Performance This section presents a focused comparison of generative Polish language task performance between Bielik models and GPT-3.5. The evaluation is limited to generative tasks due to the constraints of assessing OpenAI models. The comprehensive nature and associated costs of the benchmark explain the limited number of models evaluated. | Model | Parameters (B) | Average g | |-------------------------------|----------------|---------------| | Bielik-11B-v2.1-Instruct | 11 | 66.58 | | Bielik-11B-v2.2-Instruct | 11 | 66.11 | | Bielik-11B-v2.0-Instruct | 11 | 65.58 | | gpt-3.5-turbo-instruct | Unknown | 55.65 | The performance variation among Bielik versions is minimal, indicating consistent quality across iterations. Bielik-11B-v2.1-Instruct demonstrates an impressive 17.8% performance advantage over GPT-3.5. The Open LLM Leaderboard evaluates models on various English language tasks, providing insights into the model's performance across different linguistic challenges. | Model | AVG | arcchallenge | hellaswag | truthfulqamc2 | mmlu | winogrande | gsm8k | |--------------------------|-------|---------------|-----------|----------------|-------|------------|-------| | Bielik-11B-v2.2-Instruct | 69.86 | 59.90 | 80.16 | 58.34 | 64.34 | 75.30 | 81.12 | | Bielik-11B-v2.1-Instruct | 69.82 | 59.56 | 80.20 | 59.35 | 64.18 | 75.06 | 80.59 | | Bielik-11B-v2.0-Instruct | 68.04 | 58.62 | 78.65 | 54.65 | 63.71 | 76.32 | 76.27 | | Bielik-11B-v2 | 65.87 | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 | | Mistral-7B-Instruct-v0.2 | 65.71 | 63.14 | 84.88 | 68.26 | 60.78 | 77.19 | 40.03 | | Bielik-7B-Instruct-v0.1 | 51.26 | 47.53 | 68.91 | 49.47 | 46.18 | 65.51 | 29.95 | Bielik-11B-v2.0-Instruct shows impressive performance on English language tasks: 1. Improvement over its base model (2-point increase). 2. Substantial 16-point improvement over Bielik-7B-Instruct-v0.1. These results demonstrate Bielik-11B-v2.0-Instruct's versatility in both Polish and English, highlighting the effectiveness of its instruction tuning process. Polish MT-Bench The Bielik-11B-v2.0-Instruct (16 bit) model was also evaluated using the MT-Bench benchmark. The quality of the model was evaluated using the English version (original version without modifications) and the Polish version created by Speakleash (tasks and evaluation in Polish, the content of the tasks was also changed to take into account the context of the Polish language). MT-Bench English | Model | Score | |-----------------|----------| | Bielik-11B-v2.1 | 8.537500 | | Bielik-11B-v2.2 | 8.390625 | | Bielik-11B-v2.0 | 8.159375 | MT-Bench Polish | Model | Parameters (B) | Score | |-------------------------------------|----------------|----------| | Qwen2-72B-Instruct | 72 | 8.775000 | | Mistral-Large-Instruct-2407 (123B) | 123 | 8.662500 | | gemma-2-27b-it | 27 | 8.618750 | | Mixtral-8x22b | 141 | 8.231250 | | Meta-Llama-3.1-405B-Instruct | 405 | 8.168750 | | Meta-Llama-3.1-70B-Instruct | 70 | 8.150000 | | Bielik-11B-v2.2-Instruct | 11 | 8.115625 | | Bielik-11B-v2.1-Instruct | 11 | 7.996875 | | gpt-3.5-turbo | Unknown | 7.868750 | | Mixtral-8x7b | 46.7 | 7.637500 | | Bielik-11B-v2.0-Instruct | 11 | 7.562500 | | Mistral-Nemo-Instruct-2407 | 12 | 7.368750 | | openchat-3.5-0106-gemma | 7 | 6.812500 | | Mistral-7B-Instruct-v0.2 | 7 | 6.556250 | | Meta-Llama-3.1-8B-Instruct | 8 | 6.556250 | | Bielik-7B-Instruct-v0.1 | 7 | 6.081250 | | Mistral-7B-Instruct-v0.3 | 7 | 5.818750 | | Polka-Mistral-7B-SFT | 7 | 4.518750 | | trurl-2-7b | 7 | 2.762500 | For more information - answers to test tasks and values in each category, visit the MT-Bench PL website. | Model | Parameters (B) | Score | |-------------------------------|--------|-------| | Mistral-Large-Instruct-2407 | 123 | 78.07 | | Meta-Llama-3.1-405B-Instruct-FP8 | 405 | 77.23 | | gpt-4o-2024-08-06 | ? | 75.15 | | gpt-4-turbo-2024-04-09 | ? | 74.59 | | Meta-Llama-3.1-70B-Instruct | 70 | 72.53 | | Qwen2-72B-Instruct | 72 | 71.23 | | Meta-Llama-3-70B-Instruct | 70 | 71.21 | | gpt-4o-mini-2024-07-18 | ? | 71.15 | | WizardLM-2-8x22B | 141 | 69.56 | | Bielik-11B-v2.2-Instruct | 11 | 69.05 | | Bielik-11B-v2.0-Instruct | 11 | 68.24 | | Qwen1.5-72B-Chat | 72 | 68.03 | | Mixtral-8x22B-Instruct-v0.1 | 141 | 67.63 | | Bielik-11B-v2.1-Instruct | 11 | 60.07 | | Qwen1.5-32B-Chat | 32 | 59.63 | | openchat-3.5-0106-gemma | 7 | 59.58 | | aya-23-35B | 35 | 58.41 | | gpt-3.5-turbo | ? | 57.7 | | Qwen2-57B-A14B-Instruct | 57 | 57.64 | | Mixtral-8x7B-Instruct-v0.1 | 47 | 57.61 | | SOLAR-10.7B-Instruct-v1.0 | 10.7 | 55.21 | | Mistral-7B-Instruct-v0.2 | 7 | 47.02 | MixEval is a ground-truth-based English benchmark designed to evaluate Large Language Models (LLMs) efficiently and effectively. Key features of MixEval include: 1. Derived from off-the-shelf benchmark mixtures 2. Highly capable model ranking with a 0.96 correlation to Chatbot Arena 3. Local and quick execution, requiring only 6% of the time and cost compared to running MMLU This benchmark provides a robust and time-efficient method for assessing LLM performance, making it a valuable tool for ongoing model evaluation and comparison. | Model | MixEval | MixEval-Hard | |-------------------------------|---------|--------------| | Bielik-11B-v2.1-Instruct | 74.55 | 45.00 | | Bielik-11B-v2.2-Instruct | 72.35 | 39.65 | | Bielik-11B-v2.0-Instruct | 72.10 | 40.20 | | Mistral-7B-Instruct-v0.2 | 70.00 | 36.20 | The results show that Bielik-11B-v2.0-Instruct performs well on the MixEval benchmark, achieving a score of 72.10 on the standard MixEval and 40.20 on MixEval-Hard. Notably, Bielik-11B-v2.0-Instruct significantly outperforms Mistral-7B-Instruct-v0.2 on both metrics, demonstrating its improved capabilities despite being based on a similar architecture. Chat Arena PL is a human-evaluated benchmark that provides a direct comparison of model performance through head-to-head battles. Unlike the automated benchmarks mentioned above, this evaluation relies on human judgment to assess the quality and effectiveness of model responses. The results offer valuable insights into how different models perform in real-world, conversational scenarios as perceived by human evaluators. | # | Model | Battles | Won | Lost | Draws | Win % | ELO | |---|-------|-------|---------|-----------|--------|-------------|-----| | 1 | Bielik-11B-v2.2-Instruct | 92 | 72 | 14 | 6 | 83.72% | 1234 | | 2 | Bielik-11B-v2.1-Instruct | 240 | 171 | 50 | 19 | 77.38% | 1174 | | 3 | gpt-4o-mini | 639 | 402 | 117 | 120 | 77.46% | 1141 | | 4 | Mistral Large 2 (2024-07) | 324 | 188 | 69 | 67 | 73.15% | 1125 | | 5 | Llama-3.1-405B | 548 | 297 | 144 | 107 | 67.35% | 1090 | | 6 | Bielik-11B-v2.0-Instruct | 1289 | 695 | 352 | 242 | 66.38% | 1059 | | 7 | Llama-3.1-70B | 498 | 221 | 187 | 90 | 54.17% | 1033 | | 8 | Bielik-1-7B | 2041 | 1029 | 638 | 374 | 61.73% | 1020 | | 9 | Mixtral-8x22B-v0.1 | 432 | 166 | 167 | 99 | 49.85% | 1018 | | 10 | Qwen2-72B | 451 | 179 | 177 | 95 | 50.28% | 1011 | | 11 | gpt-3.5-turbo | 2186 | 1007 | 731 | 448 | 57.94% | 1008 | | 12 | Llama-3.1-8B | 440 | 155 | 227 | 58 | 40.58% | 975 | | 13 | Mixtral-8x7B-v0.1 | 1997 | 794 | 804 | 399 | 49.69% | 973 | | 14 | Llama-3-70b | 2008 | 733 | 909 | 366 | 44.64% | 956 | | 15 | Mistral Nemo (2024-07) | 301 | 84 | 164 | 53 | 33.87% | 954 | | 16 | Llama-3-8b | 1911 | 473 | 1091 | 347 | 30.24% | 909 | | 17 | gemma-7b-it | 1928 | 418 | 1221 | 289 | 25.5% | 888 | Bielik-11B-v2.0-Instruct is a quick demonstration that the base model can be easily fine-tuned to achieve compelling and promising performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community in ways to make the model respect guardrails, allowing for deployment in environments requiring moderated outputs. Bielik-11B-v2.0-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2.0-Instruct was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs. Citation Please cite this model using the following format: Krzysztof Ociepa SpeakLeash - team leadership, conceptualizing, data preparation, process optimization and oversight of training Łukasz Flis Cyfronet AGH - coordinating and supervising the training Remigiusz Kinas SpeakLeash - conceptualizing and coordinating DPO training, data preparation Adrian Gwoździej SpeakLeash - data preparation and ensuring data quality Krzysztof Wróbel SpeakLeash - benchmarks The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model: Sebastian Kondracki, Igor Ciuciura, Paweł Kiszczak, Szymon Baczyński, Jacek Chwiła, Maria Filipkowska, Jan Maria Kowalski, Karol Jezierski, Kacper Milan, Jan Sowa, Len Krawczyk, Marta Seidler, Agnieszka Ratajska, Krzysztof Koziarek, Szymon Pepliński, Zuzanna Dabić, Filip Bogacz, Agnieszka Kosiak, Izabela Babis, Nina Babis. Members of the ACK Cyfronet AGH team providing valuable support and expertise: Szymon Mazurek, Marek Magryś, Mieszko Cholewa . If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.
Bielik-11B-v2.2-Instruct-HQQ-8bit-128gs
Bielik-11B-v2.2-Instruct-HQQ-4bit-128gs
Bielik-4.5B-v3.0-Instruct-MLX-8bit
Bielik-11B-v2.2-Instruct-Quanto-4bit
Bielik-11B-v2.2-Instruct-EXL2-6.5bit
Bielik-1.5B-v3.0-Instruct-MLX-8bit
Bielik-11B-v2.1-Instruct-FP8
Bielik-11B-v2.0-Instruct-FP8
Bielik-11B-v2.0-Instruct-GPTQ
Bielik-11B-v2.3-Instruct-4bit-ov
Bielik-11B-v2.1-Instruct-GPTQ
Bielik-PL-Minitron-7B-v3.0-Instruct
Bielik-11B-v2.1-Instruct
Bielik-11B-v2.1-Instruct is a generative text model featuring 11 billion parameters. It is an instruct fine-tuned version of the Bielik-11B-v2. Forementioned model stands as a testament to the unique collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. Developed and trained on Polish text corpora, which has been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. The creation and training of the Bielik-11B-v2.1-Instruct was propelled by the support of computational grant number PLG/2024/016951, conducted on the Athena and Helios supercomputer, enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision. Chat Arena is a platform for testing and comparing different AI language models, allowing users to evaluate their performance and quality. The SpeakLeash team is working on their own set of instructions in Polish, which is continuously being expanded and refined by annotators. A portion of these instructions, which had been manually verified and corrected, has been utilized for training purposes. Moreover, due to the limited availability of high-quality instructions in Polish, synthetic instructions were generated with Mixtral 8x22B and used in training. The dataset used for training comprised over 20 million instructions, consisting of more than 10 billion tokens. The instructions varied in quality, leading to a deterioration in the model’s performance. To counteract this while still allowing ourselves to utilize the aforementioned datasets, several improvements were introduced: Weighted tokens level loss - a strategy inspired by offline reinforcement learning and C-RLFT Adaptive learning rate inspired by the study on Learning Rates as a Function of Batch Size Masked prompt tokens To align the model with user preferences we tested many different techniques: DPO, PPO, KTO, SiMPO. Finally the DPO-Positive method was employed, utilizing both generated and manually corrected examples, which were scored by a metamodel. A dataset comprising over 60,000 examples of varying lengths to address different aspects of response style. It was filtered and evaluated by the reward model to select instructions with the right level of difference between chosen and rejected. The novelty introduced in DPO-P was multi-turn conversations introduction. Bielik-11B-v2.1-Instruct has been trained with the use of an original open source framework called ALLaMo implemented by Krzysztof Ociepa. This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way. Developed by: SpeakLeash & ACK Cyfronet AGH Language: Polish Model type: causal decoder-only Finetuned from: Bielik-11B-v2 License: Apache 2.0 and Terms of Use Model ref: speakleash:a05d7fe0995e191985a863b48a39259b Quantized models: We know that some people want to explore smaller models or don't have the resources to run a full model. Therefore, we have prepared quantized versions of the Bielik-11B-v2.1-Instruct model in separate repositories: - GGUF - Q4KM, Q5KM, Q6K, Q80 - GPTQ - 4bit - FP8 (vLLM, SGLang - Ada Lovelace, Hopper optimized) - GGUF - experimental - IQ imatrix IQ1M, IQ2XXS, IQ3XXS, IQ4XS and calibrated Q4KM, Q5KM, Q6K, Q80 Please note that quantized models may offer lower quality of generated answers compared to full sized variatns. Bielik-11B-v2.1-Instruct uses ChatML as the prompt format. This format is available as a chat template via the `applychattemplate()` method: Fully formated input conversation by applychattemplate from previous example: Bielik-11B-v2.1-Instruct has been evaluated on several benchmarks to assess its performance across various tasks and languages. These benchmarks include: 1. Open PL LLM Leaderboard 2. Open LLM Leaderboard 3. Polish MT-Bench 4. Polish EQ-Bench (Emotional Intelligence Benchmark) 5. MixEval The following sections provide detailed results for each of these benchmarks, demonstrating the model's capabilities in both Polish and English language tasks. Models have been evaluated on Open PL LLM Leaderboard 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores. | Model | Parameters (B)| Average | |---------------------------------|------------|---------| | Meta-Llama-3.1-405B-Instruct-FP8,API | 405 | 69.44 | | Mistral-Large-Instruct-2407 | 123 | 69.11 | | Qwen2-72B-Instruct | 72 | 65.87 | | Bielik-11B-v2.2-Instruct | 11 | 65.57 | | Meta-Llama-3.1-70B-Instruct | 70 | 65.49 | | Bielik-11B-v2.1-Instruct | 11 | 65.45 | | Mixtral-8x22B-Instruct-v0.1 | 141 | 65.23 | | Bielik-11B-v2.0-Instruct | 11 | 64.98 | | Meta-Llama-3-70B-Instruct | 70 | 64.45 | | Athene-70B | 70 | 63.65 | | WizardLM-2-8x22B | 141 | 62.35 | | Qwen1.5-72B-Chat | 72 | 58.67 | | Qwen2-57B-A14B-Instruct | 57 | 56.89 | | glm-4-9b-chat | 9 | 56.61 | | aya-23-35B | 35 | 56.37 | | Phi-3.5-MoE-instruct | 41.9 | 56.34 | | openchat-3.5-0106-gemma | 7 | 55.69 | | Mistral-Nemo-Instruct-2407 | 12 | 55.27 | | SOLAR-10.7B-Instruct-v1.0 | 10.7 | 55.24 | | Mixtral-8x7B-Instruct-v0.1 | 46.7 | 55.07 | | Bielik-7B-Instruct-v0.1 | 7 | 44.70 | | trurl-2-13b-academic | 13 | 36.28 | | trurl-2-7b | 7 | 26.93 | The results from the Open PL LLM Leaderboard demonstrate the exceptional performance of Bielik-11B-v2.1-Instruct: 1. Superior performance in its class: Bielik-11B-v2.1-Instruct outperforms all other models with less than 70B parameters. This is a significant achievement, showcasing its efficiency and effectiveness despite having fewer parameters than many competitors. 2. Competitive with larger models: with a score of 65.45, Bielik-11B-v2.1-Instruct performs on par with models in the 70B parameter range. This indicates that it achieves comparable results to much larger models, demonstrating its advanced architecture and training methodology. 3. Substantial improvement over previous version: the model shows a marked improvement over its predecessor, Bielik-7B-Instruct-v0.1, which scored 43.64. This leap in performance highlights the successful enhancements and optimizations implemented in this newer version. 4. Leading position for Polish language models: in the context of Polish language models, Bielik-11B-v2.1-Instruct stands out as a leader. There are no other competitive models specifically tailored for the Polish language that match its performance, making it a crucial resource for Polish NLP tasks. These results underscore Bielik-11B-v2.1-Instruct's position as a state-of-the-art model for Polish language processing, offering high performance with relatively modest computational requirements. Open PL LLM Leaderboard - Generative Tasks Performance This section presents a focused comparison of generative Polish language task performance between Bielik models and GPT-3.5. The evaluation is limited to generative tasks due to the constraints of assessing OpenAI models. The comprehensive nature and associated costs of the benchmark explain the limited number of models evaluated. | Model | Parameters (B) | Average g | |-------------------------------|----------------|---------------| | Bielik-11B-v2.1-Instruct | 11 | 66.58 | | Bielik-11B-v2.2-Instruct | 11 | 66.11 | | Bielik-11B-v2.0-Instruct | 11 | 65.58 | | gpt-3.5-turbo-instruct | Unknown | 55.65 | The performance variation among Bielik versions is minimal, indicating consistent quality across iterations. Bielik-11B-v2.1-Instruct demonstrates an impressive 19.6% performance advantage over GPT-3.5. The Open LLM Leaderboard evaluates models on various English language tasks, providing insights into the model's performance across different linguistic challenges. | Model | AVG | arcchallenge | hellaswag | truthfulqamc2 | mmlu | winogrande | gsm8k | |--------------------------|-------|---------------|-----------|----------------|-------|------------|-------| | Bielik-11B-v2.2-Instruct | 69.86 | 59.90 | 80.16 | 58.34 | 64.34 | 75.30 | 81.12 | | Bielik-11B-v2.1-Instruct | 69.82 | 59.56 | 80.20 | 59.35 | 64.18 | 75.06 | 80.59 | | Bielik-11B-v2.0-Instruct | 68.04 | 58.62 | 78.65 | 54.65 | 63.71 | 76.32 | 76.27 | | Bielik-11B-v2 | 65.87 | 60.58 | 79.84 | 46.13 | 63.06 | 77.82 | 67.78 | | Mistral-7B-Instruct-v0.2 | 65.71 | 63.14 | 84.88 | 68.26 | 60.78 | 77.19 | 40.03 | | Bielik-7B-Instruct-v0.1 | 51.26 | 47.53 | 68.91 | 49.47 | 46.18 | 65.51 | 29.95 | Bielik-11B-v2.1-Instruct shows impressive performance on English language tasks: 1. Significant improvement over its base model (4-point increase). 2. Substantial 18-point improvement over Bielik-7B-Instruct-v0.1. These results demonstrate Bielik-11B-v2.1-Instruct's versatility in both Polish and English, highlighting the effectiveness of its instruction tuning process. Polish MT-Bench The Bielik-11B-v2.1-Instruct (16 bit) model was also evaluated using the MT-Bench benchmark. The quality of the model was evaluated using the English version (original version without modifications) and the Polish version created by Speakleash (tasks and evaluation in Polish, the content of the tasks was also changed to take into account the context of the Polish language). MT-Bench English | Model | Score | |-----------------|----------| | Bielik-11B-v2.1 | 8.537500 | | Bielik-11B-v2.2 | 8.390625 | | Bielik-11B-v2.0 | 8.159375 | MT-Bench Polish | Model | Parameters (B) | Score | |-------------------------------------|----------------|----------| | Qwen2-72B-Instruct | 72 | 8.775000 | | Mistral-Large-Instruct-2407 (123B) | 123 | 8.662500 | | gemma-2-27b-it | 27 | 8.618750 | | Mixtral-8x22b | 141 | 8.231250 | | Meta-Llama-3.1-405B-Instruct | 405 | 8.168750 | | Meta-Llama-3.1-70B-Instruct | 70 | 8.150000 | | Bielik-11B-v2.2-Instruct | 11 | 8.115625 | | Bielik-11B-v2.1-Instruct | 11 | 7.996875 | | gpt-3.5-turbo | Unknown | 7.868750 | | Mixtral-8x7b | 46.7 | 7.637500 | | Bielik-11B-v2.0-Instruct | 11 | 7.562500 | | Mistral-Nemo-Instruct-2407 | 12 | 7.368750 | | openchat-3.5-0106-gemma | 7 | 6.812500 | | Mistral-7B-Instruct-v0.2 | 7 | 6.556250 | | Meta-Llama-3.1-8B-Instruct | 8 | 6.556250 | | Bielik-7B-Instruct-v0.1 | 7 | 6.081250 | | Mistral-7B-Instruct-v0.3 | 7 | 5.818750 | | Polka-Mistral-7B-SFT | 7 | 4.518750 | | trurl-2-7b | 7 | 2.762500 | 1. Strong performance among mid-sized models: Bielik-11B-v2.1-Instruct scored 7.996875, placing it ahead of several well-known models like GPT-3.5-turbo (7.868750) and Mixtral-8x7b (7.637500). This indicates that Bielik-11B-v2.1-Instruct is competitive among mid-sized models, particularly those in the 11B-70B parameter range. 2. Competitive against larger models: Bielik-11B-v2.1-Instruct performs close to Meta-Llama-3.1-70B-Instruct (8.150000), Meta-Llama-3.1-405B-Instruct (8.168750) and even Mixtral-8x22b (8.231250), which have significantly more parameters. This efficiency in performance relative to size could make it an attractive option for tasks where resource constraints are a consideration. Bielik 100% generated answers in Polish, while other models (not typically trained for Polish) can answer Polish questions in English. 3. Significant improvement over previous versions: compared to its predecessor, Bielik-7B-Instruct-v0.1, which scored 6.081250, the Bielik-11B-v2.1-Instruct shows a significant improvement. The score increased by almost 2 points, highlighting substantial advancements in model quality, optimization and training methodology. For more information - answers to test tasks and values in each category, visit the MT-Bench PL website. | Model | Parameters (B) | Score | |-------------------------------|--------|-------| | Mistral-Large-Instruct-2407 | 123 | 78.07 | | Meta-Llama-3.1-405B-Instruct-FP8 | 405 | 77.23 | | gpt-4o-2024-08-06 | ? | 75.15 | | gpt-4-turbo-2024-04-09 | ? | 74.59 | | Meta-Llama-3.1-70B-Instruct | 70 | 72.53 | | Qwen2-72B-Instruct | 72 | 71.23 | | Meta-Llama-3-70B-Instruct | 70 | 71.21 | | gpt-4o-mini-2024-07-18 | ? | 71.15 | | WizardLM-2-8x22B | 141 | 69.56 | | Bielik-11B-v2.2-Instruct | 11 | 69.05 | | Bielik-11B-v2.0-Instruct | 11 | 68.24 | | Qwen1.5-72B-Chat | 72 | 68.03 | | Mixtral-8x22B-Instruct-v0.1 | 141 | 67.63 | | Bielik-11B-v2.1-Instruct | 11 | 60.07 | | Qwen1.5-32B-Chat | 32 | 59.63 | | openchat-3.5-0106-gemma | 7 | 59.58 | | aya-23-35B | 35 | 58.41 | | gpt-3.5-turbo | ? | 57.7 | | Qwen2-57B-A14B-Instruct | 57 | 57.64 | | Mixtral-8x7B-Instruct-v0.1 | 47 | 57.61 | | SOLAR-10.7B-Instruct-v1.0 | 10.7 | 55.21 | | Mistral-7B-Instruct-v0.2 | 7 | 47.02 | MixEval is a ground-truth-based English benchmark designed to evaluate Large Language Models (LLMs) efficiently and effectively. Key features of MixEval include: 1. Derived from off-the-shelf benchmark mixtures 2. Highly capable model ranking with a 0.96 correlation to Chatbot Arena 3. Local and quick execution, requiring only 6% of the time and cost compared to running MMLU This benchmark provides a robust and time-efficient method for assessing LLM performance, making it a valuable tool for ongoing model evaluation and comparison. | Model | MixEval | MixEval-Hard | |-------------------------------|---------|--------------| | Bielik-11B-v2.1-Instruct | 74.55 | 45.00 | | Bielik-11B-v2.2-Instruct | 72.35 | 39.65 | | Bielik-11B-v2.0-Instruct | 72.10 | 40.20 | | Mistral-7B-Instruct-v0.2 | 70.00 | 36.20 | The results show that Bielik-11B-v2.1-Instruct performs well on the MixEval benchmark, achieving a score of 74.55 on the standard MixEval and 45.00 on MixEval-Hard. Notably, Bielik-11B-v2.1-Instruct significantly outperforms Mistral-7B-Instruct-v0.2 on both metrics, demonstrating its improved capabilities despite being based on a similar architecture. Chat Arena PL is a human-evaluated benchmark that provides a direct comparison of model performance through head-to-head battles. Unlike the automated benchmarks mentioned above, this evaluation relies on human judgment to assess the quality and effectiveness of model responses. The results offer valuable insights into how different models perform in real-world, conversational scenarios as perceived by human evaluators. | # | Model | Battles | Won | Lost | Draws | Win % | ELO | |---|-------|-------|---------|-----------|--------|-------------|-----| | 1 | Bielik-11B-v2.2-Instruct | 92 | 72 | 14 | 6 | 83.72% | 1234 | | 2 | Bielik-11B-v2.1-Instruct | 240 | 171 | 50 | 19 | 77.38% | 1174 | | 3 | gpt-4o-mini | 639 | 402 | 117 | 120 | 77.46% | 1141 | | 4 | Mistral Large 2 (2024-07) | 324 | 188 | 69 | 67 | 73.15% | 1125 | | 5 | Llama-3.1-405B | 548 | 297 | 144 | 107 | 67.35% | 1090 | | 6 | Bielik-11B-v2.0-Instruct | 1289 | 695 | 352 | 242 | 66.38% | 1059 | | 7 | Llama-3.1-70B | 498 | 221 | 187 | 90 | 54.17% | 1033 | | 8 | Bielik-1-7B | 2041 | 1029 | 638 | 374 | 61.73% | 1020 | | 9 | Mixtral-8x22B-v0.1 | 432 | 166 | 167 | 99 | 49.85% | 1018 | | 10 | Qwen2-72B | 451 | 179 | 177 | 95 | 50.28% | 1011 | | 11 | gpt-3.5-turbo | 2186 | 1007 | 731 | 448 | 57.94% | 1008 | | 12 | Llama-3.1-8B | 440 | 155 | 227 | 58 | 40.58% | 975 | | 13 | Mixtral-8x7B-v0.1 | 1997 | 794 | 804 | 399 | 49.69% | 973 | | 14 | Llama-3-70b | 2008 | 733 | 909 | 366 | 44.64% | 956 | | 15 | Mistral Nemo (2024-07) | 301 | 84 | 164 | 53 | 33.87% | 954 | | 16 | Llama-3-8b | 1911 | 473 | 1091 | 347 | 30.24% | 909 | | 17 | gemma-7b-it | 1928 | 418 | 1221 | 289 | 25.5% | 888 | The results show that Bielik-11B-v2.1-Instruct outperforms almost all other models in this benchmark. This impressive performance demonstrates its effectiveness in real-world conversational scenarios, as judged by human evaluators. Bielik-11B-v2.1-Instruct is a quick demonstration that the base model can be easily fine-tuned to achieve compelling and promising performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community in ways to make the model respect guardrails, allowing for deployment in environments requiring moderated outputs. Bielik-11B-v2.1-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2.1-Instruct was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs. Citation Please cite this model using the following format: Krzysztof Ociepa SpeakLeash - team leadership, conceptualizing, data preparation, process optimization and oversight of training Łukasz Flis Cyfronet AGH - coordinating and supervising the training Remigiusz Kinas SpeakLeash - conceptualizing and coordinating DPO training, data preparation Adrian Gwoździej SpeakLeash - data preparation and ensuring data quality Krzysztof Wróbel SpeakLeash - benchmarks The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model: Sebastian Kondracki, Igor Ciuciura, Paweł Kiszczak, Szymon Baczyński, Jacek Chwiła, Maria Filipkowska, Jan Maria Kowalski, Karol Jezierski, Kacper Milan, Jan Sowa, Len Krawczyk, Marta Seidler, Agnieszka Ratajska, Krzysztof Koziarek, Szymon Pepliński, Zuzanna Dabić, Filip Bogacz, Agnieszka Kosiak, Izabela Babis, Nina Babis. Members of the ACK Cyfronet AGH team providing valuable support and expertise: Szymon Mazurek, Marek Magryś, Mieszko Cholewa . If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our Discord SpeakLeash.