mispeech
midashenglm-7b-0804-fp32
dasheng-base
ced-base
dasheng-1.2B
midashenglm-7b-1021-bf16
The bfloat16 (bf16) weights for mispeech/midashenglm-7b-1021-fp32. Recommended for most general-purpose scenarios, including inference and fine-tuning. It delivers quality comparable to FP32 while being significantly faster on modern GPUs (e.g., A100, H100, RTX 4090). The original fp32 model is only for strict numerical reproduction of benchmark results. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:
ced-small
ced-mini
ced-tiny
dashengtokenizer
midashenglm-7b-0804-bf16
midashenglm-7b-1021-fp32
MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512. 📖 For more detailed introduction and technical report, please visit our GitHub repository. Note that for most applications, we strongly recommend using the BF16 version (mispeech/midashenglm-7b-1021-bf16) for optimal performance and efficiency. The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`. | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:| | Music | MusicCaps | 59.11 | 43.71 | 35.43 | | Music | Songdescriber | 46.42 | 45.31 | 44.63 | | Sound | AudioCaps | 62.13 | 60.79 | 49.00 | | Sound | ClothoV2 | 49.35 | 47.55 | 48.01 | | Sound | AutoACD | 67.13 | 55.93 | 44.76 | | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:----------------:|:------:|:--------------:|:----------------:|:------------------:| | VoxCeleb1 | ACC↑ | 92.66 | 59.71 | 82.72 | | VoxLingua107 | ACC↑ | 93.72 | 51.03 | 73.65 | | VoxCeleb-Gender | ACC↑ | 97.72 | 99.82 | 99.69 | | VGGSound | ACC↑ | 52.19 | 0.97 | 2.20 | | Cochlscene | ACC↑ | 75.81 | 23.88 | 18.34 | | NSynth | ACC↑ | 80.32 | 60.45 | 38.09 | | FMA | ACC↑ | 62.94 | 66.77 | 27.91 | | FSDKaggle2018 | ACC↑ | 73.38 | 31.38 | 24.75 | | AudioSet | mAP↑ | 9.90 | 6.48 | 3.47 | | FSD50K | mAP↑ | 38.10 | 23.87 | 27.23 | | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:------------------:|:------------:|:-------------:|:------------:|:-------------------:| | LibriSpeech test-clean | English | 3.6 | 1.7 | 1.3 | | LibriSpeech test-other | English | 5.9 | 3.4 | 2.4 | | People's Speech | English | 26.12 | 28.6 | 22.3 | | AISHELL2 Mic | Chinese | 3.2 | 2.5 | 2.7 | | AISHELL2 iOS | Chinese | 2.9 | 2.6 | 2.6 | | AISHELL2 Android | Chinese | 3.1 | 2.7 | 2.6 | | GigaSpeech2 | Indonesian | 22.3 | 21.2 | >100 | | GigaSpeech2 | Thai | 38.4 | 53.8 | >100 | | GigaSpeech2 | Viet | 17.7 | 18.6 | >100 | | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:| | MMAU-Pro | IF | ACC↑ | 37.93 | 61.30 | 42.30 | | MMAU-Pro | Multi-Audio | ACC↑ | 42.33 | 24.30 | 17.20 | | MMAU-Pro | Music | ACC↑ | 62.20 | 61.50 | 57.60 | | MMAU-Pro | Open-ended | ACC↑ | 63.21 | 52.30 | 34.50 | | MMAU-Pro | Sound | ACC↑ | 58.36 | 47.60 | 46.00 | | MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | 46.00 | | MMAU-Pro | Sound–Music–Speech | ACC↑ | 71.43 | 28.50 | 42.80 | | MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | 43.70 | | MMAU-Pro | Speech | ACC↑ | 61.17 | 57.40 | 52.20 | | MMAU-Pro | Speech–Music | ACC↑ | 58.70 | 53.20 | 54.30 | | MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | 60.20 | 48.90 | | MMAU-Pro | Voice | ACC↑ | 54.83 | 60.00 | 50.60 | | MMAU-Pro | Average | ACC↑ | 55.92 | 52.20 | 46.60 | | MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | 78.10 | 75.68 | | MMAU-v05.15.25 | Music | ACC↑ | 70.96 | 65.90 | 66.77 | | MMAU-v05.15.25 | Speech | ACC↑ | 76.28 | 70.60 | 62.16 | | MMAU-v05.15.25 | Average | ACC↑ | 74.90 | 71.50 | 68.20 | | MuChoMusic | | ACC↑ | 73.04 | 64.79 | 67.40 | | MusicQA | | FENSE↑ | 61.56 | 60.60 | 40.00 | | AudioCaps-QA | | FENSE↑ | 54.20 | 53.28 | 47.34 | MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:
dasheng-0.6B
midashenglm-7b-0804-w4a16-gptq
The 4bit (w4a16) weights for mispeech/midashenglm-7b-0804-fp32, quantized by GPTQ. An ideal choice for resource-constrained environments. It offers broad GPU compatibility and a smaller memory footprint, making it suitable for deployment where VRAM, memory, or storage is limited, provided that a slight trade-off in quality is acceptable. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:
r1-aqa
midashenglm-0.6b-fp32
midashenglm-7b-0804-fp8
The FP8 weights for mispeech/midashenglm-7b-0804-fp32. Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:
midashenglm-7b-1021-w4a16-gptq
midashenglm-7b-1021-fp8
The FP8 weights for mispeech/midashenglm-7b-1021-fp32. Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:
midashenglm-7b-0804-4bit-bnb
The bnb-4bit weights for mispeech/midashenglm-7b-0804-fp32. Note: This is a basic 4-bit quantization using bitsandbytes. For better performance and accuracy, we recommend using our GPTQ-quantized version which maintains higher quality while still providing significant memory savings. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work: