mispeech

21 models • 1 total models in database
Sort by:

midashenglm-7b-0804-fp32

NaNK
license:apache-2.0
40,831
77

dasheng-base

license:apache-2.0
4,127
9

ced-base

license:apache-2.0
2,292
12

dasheng-1.2B

NaNK
license:apache-2.0
706
4

midashenglm-7b-1021-bf16

The bfloat16 (bf16) weights for mispeech/midashenglm-7b-1021-fp32. Recommended for most general-purpose scenarios, including inference and fine-tuning. It delivers quality comparable to FP32 while being significantly faster on modern GPUs (e.g., A100, H100, RTX 4090). The original fp32 model is only for strict numerical reproduction of benchmark results. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK
license:apache-2.0
446
2

ced-small

license:apache-2.0
396
0

ced-mini

license:apache-2.0
272
1

ced-tiny

license:apache-2.0
245
3

dashengtokenizer

license:apache-2.0
198
5

midashenglm-7b-0804-bf16

NaNK
license:apache-2.0
153
0

midashenglm-7b-1021-fp32

MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment. It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512. 📖 For more detailed introduction and technical report, please visit our GitHub repository. Note that for most applications, we strongly recommend using the BF16 version (mispeech/midashenglm-7b-1021-bf16) for optimal performance and efficiency. The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`. | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:| | Music | MusicCaps | 59.11 | 43.71 | 35.43 | | Music | Songdescriber | 46.42 | 45.31 | 44.63 | | Sound | AudioCaps | 62.13 | 60.79 | 49.00 | | Sound | ClothoV2 | 49.35 | 47.55 | 48.01 | | Sound | AutoACD | 67.13 | 55.93 | 44.76 | | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:----------------:|:------:|:--------------:|:----------------:|:------------------:| | VoxCeleb1 | ACC↑ | 92.66 | 59.71 | 82.72 | | VoxLingua107 | ACC↑ | 93.72 | 51.03 | 73.65 | | VoxCeleb-Gender | ACC↑ | 97.72 | 99.82 | 99.69 | | VGGSound | ACC↑ | 52.19 | 0.97 | 2.20 | | Cochlscene | ACC↑ | 75.81 | 23.88 | 18.34 | | NSynth | ACC↑ | 80.32 | 60.45 | 38.09 | | FMA | ACC↑ | 62.94 | 66.77 | 27.91 | | FSDKaggle2018 | ACC↑ | 73.38 | 31.38 | 24.75 | | AudioSet | mAP↑ | 9.90 | 6.48 | 3.47 | | FSD50K | mAP↑ | 38.10 | 23.87 | 27.23 | | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:------------------:|:------------:|:-------------:|:------------:|:-------------------:| | LibriSpeech test-clean | English | 3.6 | 1.7 | 1.3 | | LibriSpeech test-other | English | 5.9 | 3.4 | 2.4 | | People's Speech | English | 26.12 | 28.6 | 22.3 | | AISHELL2 Mic | Chinese | 3.2 | 2.5 | 2.7 | | AISHELL2 iOS | Chinese | 2.9 | 2.6 | 2.6 | | AISHELL2 Android | Chinese | 3.1 | 2.7 | 2.6 | | GigaSpeech2 | Indonesian | 22.3 | 21.2 | >100 | | GigaSpeech2 | Thai | 38.4 | 53.8 | >100 | | GigaSpeech2 | Viet | 17.7 | 18.6 | >100 | | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:| | MMAU-Pro | IF | ACC↑ | 37.93 | 61.30 | 42.30 | | MMAU-Pro | Multi-Audio | ACC↑ | 42.33 | 24.30 | 17.20 | | MMAU-Pro | Music | ACC↑ | 62.20 | 61.50 | 57.60 | | MMAU-Pro | Open-ended | ACC↑ | 63.21 | 52.30 | 34.50 | | MMAU-Pro | Sound | ACC↑ | 58.36 | 47.60 | 46.00 | | MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | 46.00 | | MMAU-Pro | Sound–Music–Speech | ACC↑ | 71.43 | 28.50 | 42.80 | | MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | 43.70 | | MMAU-Pro | Speech | ACC↑ | 61.17 | 57.40 | 52.20 | | MMAU-Pro | Speech–Music | ACC↑ | 58.70 | 53.20 | 54.30 | | MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | 60.20 | 48.90 | | MMAU-Pro | Voice | ACC↑ | 54.83 | 60.00 | 50.60 | | MMAU-Pro | Average | ACC↑ | 55.92 | 52.20 | 46.60 | | MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | 78.10 | 75.68 | | MMAU-v05.15.25 | Music | ACC↑ | 70.96 | 65.90 | 66.77 | | MMAU-v05.15.25 | Speech | ACC↑ | 76.28 | 70.60 | 62.16 | | MMAU-v05.15.25 | Average | ACC↑ | 74.90 | 71.50 | 68.20 | | MuChoMusic | | ACC↑ | 73.04 | 64.79 | 67.40 | | MusicQA | | FENSE↑ | 61.56 | 60.60 | 40.00 | | AudioCaps-QA | | FENSE↑ | 54.20 | 53.28 | 47.34 | MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK
license:apache-2.0
134
0

dasheng-0.6B

NaNK
license:apache-2.0
110
3

midashenglm-7b-0804-w4a16-gptq

The 4bit (w4a16) weights for mispeech/midashenglm-7b-0804-fp32, quantized by GPTQ. An ideal choice for resource-constrained environments. It offers broad GPU compatibility and a smaller memory footprint, making it suitable for deployment where VRAM, memory, or storage is limited, provided that a slight trade-off in quality is acceptable. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK
license:apache-2.0
109
0

r1-aqa

license:apache-2.0
106
17

midashenglm-0.6b-fp32

NaNK
license:apache-2.0
76
1

midashenglm-7b-0804-fp8

The FP8 weights for mispeech/midashenglm-7b-0804-fp32. Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK
license:apache-2.0
56
0

midashenglm-7b-1021-w4a16-gptq

NaNK
license:apache-2.0
43
0

midashenglm-7b-1021-fp8

The FP8 weights for mispeech/midashenglm-7b-1021-fp32. Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK
license:apache-2.0
41
1

midashenglm-7b-0804-4bit-bnb

The bnb-4bit weights for mispeech/midashenglm-7b-0804-fp32. Note: This is a basic 4-bit quantization using bitsandbytes. For better performance and accuracy, we recommend using our GPTQ-quantized version which maintains higher quality while still providing significant memory savings. MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications. If you find MiDashengLM useful in your research, please consider citing our work:

NaNK
license:apache-2.0
13
0

GLAP

license:apache-2.0
0
4

dasheng-denoiser

license:apache-2.0
0
1