mispeech

21 models • 1 total models in database

Sort by:

midashenglm-7b-1021-bf16

The bfloat16 (bf16) weights for mispeech/midashenglm-7b-1021-fp32. Recommended for most general-purpose scenarios, including inference and fine-tuning. It delivers quality comparable to FP32 while being significantly faster on modern GPUs (e.g., A100, H100, RTX 4090). The original fp32 model is only for strict numerical reproduction of benchmark results. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK

license:apache-2.0

446

ced-small

license:apache-2.0

396

ced-mini

license:apache-2.0

272

ced-tiny

license:apache-2.0

245

dashengtokenizer

license:apache-2.0

198

midashenglm-7b-0804-bf16

NaNK

license:apache-2.0

153

midashenglm-7b-1021-fp32

MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512. 📖 For more detailed introduction and technical report, please visit our GitHub repository. Note that for most applications, we strongly recommend using the BF16 version (mispeech/midashenglm-7b-1021-bf16) for optimal performance and efficiency. The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`. | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:| | Music | MusicCaps | 59.11 | 43.71 | 35.43 | | Music | Songdescriber | 46.42 | 45.31 | 44.63 | | Sound | AudioCaps | 62.13 | 60.79 | 49.00 | | Sound | ClothoV2 | 49.35 | 47.55 | 48.01 | | Sound | AutoACD | 67.13 | 55.93 | 44.76 | | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:----------------:|:------:|:--------------:|:----------------:|:------------------:| | VoxCeleb1 | ACC↑ | 92.66 | 59.71 | 82.72 | | VoxLingua107 | ACC↑ | 93.72 | 51.03 | 73.65 | | VoxCeleb-Gender | ACC↑ | 97.72 | 99.82 | 99.69 | | VGGSound | ACC↑ | 52.19 | 0.97 | 2.20 | | Cochlscene | ACC↑ | 75.81 | 23.88 | 18.34 | | NSynth | ACC↑ | 80.32 | 60.45 | 38.09 | | FMA | ACC↑ | 62.94 | 66.77 | 27.91 | | FSDKaggle2018 | ACC↑ | 73.38 | 31.38 | 24.75 | | AudioSet | mAP↑ | 9.90 | 6.48 | 3.47 | | FSD50K | mAP↑ | 38.10 | 23.87 | 27.23 | | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:------------------:|:------------:|:-------------:|:------------:|:-------------------:| | LibriSpeech test-clean | English | 3.6 | 1.7 | 1.3 | | LibriSpeech test-other | English | 5.9 | 3.4 | 2.4 | | People's Speech | English | 26.12 | 28.6 | 22.3 | | AISHELL2 Mic | Chinese | 3.2 | 2.5 | 2.7 | | AISHELL2 iOS | Chinese | 2.9 | 2.6 | 2.6 | | AISHELL2 Android | Chinese | 3.1 | 2.7 | 2.6 | | GigaSpeech2 | Indonesian | 22.3 | 21.2 | >100 | | GigaSpeech2 | Thai | 38.4 | 53.8 | >100 | | GigaSpeech2 | Viet | 17.7 | 18.6 | >100 | | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:| | MMAU-Pro | IF | ACC↑ | 37.93 | 61.30 | 42.30 | | MMAU-Pro | Multi-Audio | ACC↑ | 42.33 | 24.30 | 17.20 | | MMAU-Pro | Music | ACC↑ | 62.20 | 61.50 | 57.60 | | MMAU-Pro | Open-ended | ACC↑ | 63.21 | 52.30 | 34.50 | | MMAU-Pro | Sound | ACC↑ | 58.36 | 47.60 | 46.00 | | MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | 46.00 | | MMAU-Pro | Sound–Music–Speech | ACC↑ | 71.43 | 28.50 | 42.80 | | MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | 43.70 | | MMAU-Pro | Speech | ACC↑ | 61.17 | 57.40 | 52.20 | | MMAU-Pro | Speech–Music | ACC↑ | 58.70 | 53.20 | 54.30 | | MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | 60.20 | 48.90 | | MMAU-Pro | Voice | ACC↑ | 54.83 | 60.00 | 50.60 | | MMAU-Pro | Average | ACC↑ | 55.92 | 52.20 | 46.60 | | MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | 78.10 | 75.68 | | MMAU-v05.15.25 | Music | ACC↑ | 70.96 | 65.90 | 66.77 | | MMAU-v05.15.25 | Speech | ACC↑ | 76.28 | 70.60 | 62.16 | | MMAU-v05.15.25 | Average | ACC↑ | 74.90 | 71.50 | 68.20 | | MuChoMusic | | ACC↑ | 73.04 | 64.79 | 67.40 | | MusicQA | | FENSE↑ | 61.56 | 60.60 | 40.00 | | AudioCaps-QA | | FENSE↑ | 54.20 | 53.28 | 47.34 | MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK

license:apache-2.0

134

dasheng-0.6B

NaNK

license:apache-2.0

110

midashenglm-7b-0804-w4a16-gptq

The 4bit (w4a16) weights for mispeech/midashenglm-7b-0804-fp32, quantized by GPTQ. An ideal choice for resource-constrained environments. It offers broad GPU compatibility and a smaller memory footprint, making it suitable for deployment where VRAM, memory, or storage is limited, provided that a slight trade-off in quality is acceptable. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK

license:apache-2.0

109

r1-aqa

license:apache-2.0

106

midashenglm-0.6b-fp32

NaNK

license:apache-2.0

midashenglm-7b-0804-fp8

The FP8 weights for mispeech/midashenglm-7b-0804-fp32. Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK

license:apache-2.0

midashenglm-7b-1021-w4a16-gptq

NaNK

license:apache-2.0

midashenglm-7b-1021-fp8

The FP8 weights for mispeech/midashenglm-7b-1021-fp32. Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK

license:apache-2.0

midashenglm-7b-0804-4bit-bnb

The bnb-4bit weights for mispeech/midashenglm-7b-0804-fp32. Note: This is a basic 4-bit quantization using bitsandbytes. For better performance and accuracy, we recommend using our GPTQ-quantized version which maintains higher quality while still providing significant memory savings. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK

license:apache-2.0

GLAP

license:apache-2.0

dasheng-denoiser

license:apache-2.0

mispeech

midashenglm-7b-0804-fp32

dasheng-base

ced-base

dasheng-1.2B

midashenglm-7b-1021-bf16

ced-small

ced-mini

ced-tiny

dashengtokenizer

midashenglm-7b-0804-bf16

midashenglm-7b-1021-fp32

dasheng-0.6B

midashenglm-7b-0804-w4a16-gptq

r1-aqa

midashenglm-0.6b-fp32

midashenglm-7b-0804-fp8

midashenglm-7b-1021-w4a16-gptq

midashenglm-7b-1021-fp8

midashenglm-7b-0804-4bit-bnb

GLAP

dasheng-denoiser