hellonlp
simcse-roberta-base-zh
promcse-bert-base-zh
Simcse Roberta Large Zh
Data List The following datasets are all in Chinese. | Data | size(train) | size(valid) | size(test) | |:----------------------:|:----------:|:----------:|:----------:| | ATEC | 62477| 20000| 20000| | BQ | 100000| 10000| 10000| | LCQMC | 238766| 8802| 12500| | PAWSX | 49401| 2000| 2000| | STS-B | 5231| 1458| 1361| | SNLI | 146828| 2699| 2618| | MNLI | 122547| 2932| 2397| Model List The evaluation dataset is in Chinese, and we used the same language model RoBERTa base on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the weighted average (w-avg) method. | Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. | |:-----------------------:|:------------:|:-----------:|:----------|:----------|:----------:|:----------:| | BAAI/bge-large-zh | 78.61| -| -| -| -| -| | BAAI/bge-large-zh-v1.5 | 79.07| -| -| -| -| -| | hellonlp/simcse-large-zh | 81.32| -| -| -| -| -| Uses You can use our model for encoding sentences into embeddings You can also compute the cosine similarities between two sentences