Mistral Syndicate is in no way a state-of-the-art model, rather it is a fine-tuning experiment to explore the training dynamics specific to large language models. The dataset used in finetuning was generated via a "syndicate" of other open language models both of similar parameter size and larger. Each model would generate a response for a given instruction, and the group would vote on which model's response was best.
The instruction inputs used for the output label synthesis were a curated subset of VMWare/open-instruct with additional instructions synthesized from scratch.
Evaluation Results 12.30.23 | Benchmark | Result | |------------|--------| | ARC | 60.84 | | HellaSwag | 82.91 | | MMLU | 60.83 | | TruthfulQA | 43.71 | | Winogrande | 78.61 | | GSM8K | 44.50 |
Open LLM Leaderboard Evaluation Results Detailed results can be found here
| Metric |Value| |---------------------------------|----:| |Avg. |61.90| |AI2 Reasoning Challenge (25-Shot)|60.84| |HellaSwag (10-Shot) |82.91| |MMLU (5-Shot) |60.83| |TruthfulQA (0-shot) |43.71| |Winogrande (5-shot) |78.61| |GSM8k (5-shot) |44.50|
Open LLM Leaderboard Evaluation Results Detailed results can be found here
| Metric |Value| |-------------------|----:| |Avg. |13.85| |IFEval (0-Shot) |24.96| |BBH (3-Shot) |20.51| |MATH Lvl 5 (4-Shot)| 2.42| |GPQA (0-shot) | 3.47| |MuSR (0-shot) |13.62| |MMLU-PRO (5-shot) |18.13|