HIT-TMG
KaLM-embedding-multilingual-mini-v1
KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data. KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data. - [x] Model Checkpoint - [x] KaLM-embedding-multilingual-mini-v1 - [x] KaLM-embedding-multilingual-mini-instruct-v1 - [x] KaLM-embedding-multilingual-mini-instruct-v1.5 - [ ] KaLM-embedding-multilingual-max-v1 - [x] Training and Evaluation Code: HITsz-TMG/KaLM-Embedding - [x] Technical Report: KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model - [ ] Training Data | Model Name | Model Size | C-MTEB(35) | MTEB(56) | avg |:----:|:---:|:---:|:---:|:---:| | multilingual-e5-large | 560M | 58.81 | 61.5 | 60.16 | bge-m3 (dense) | 560M | 60.80 | 59.84 | 60.32 | gte-multilingual-base (dense) | 305M | 62.72 | 61.40 | 62.06 | KaLM-embedding-multilingual-mini-v1 | 494M | 62.31 | 61.87 | 62.09 | KaLM-embedding-multilingual-mini-instruct-v1 | 494M | 63.57 | 64.74 | 64.16 | KaLM-embedding-multilingual-mini-instruct-v1.5 | 494M | 64.13 | 64.94 | 64.53 Requirements Since we have used the Qwen2 model, we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Using this model becomes easy when you have sentence-transformers installed: We add instruction for classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this: Contact If you encounter any issue, feel free to contact us via the email: [email protected]
Uni-MoE-2.0-Omni
Uni-MoE 2.0 is a fully open-source omnimodal model that substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. It is powered by Omnimodality 3D RoPE and Dynamic-Capacity Mixture-of-Experts architecture. Uni-MoE 2.0-Omni is the version of the Uni-MoE 2.0 series that integrates full-modality understanding, as well as audio and image generation capabilities If you enjoy our work or want timely updates, please give us a like and follow us. Open-source Plan - [x] Model Checkpoint - [x] Uni-MoE 2.0-Omni - [x] Uni-MoE 2.0-Base - [x] Uni-MoE 2.0-Thinking - [x] Uni-MoE 2.0-Image - [x] Uni-MoE 2.0-MoE-TTS - [x] Inference Code: HITsz-TMG/Uni-MoE-2.0 - [x] Training Code: HITsz-TMG/Uni-MoE-2.0 - [x] Technical Report: arxiv 1. Clone this repository and navigate to the Uni-MoE 2.0 folder 2. Set up environment Install the evaluation environment according to the requirements. Example Usage We provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook
KaLM Embedding Multilingual Mini Instruct V1.5
KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data. KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data. - [x] Model Checkpoint - [x] KaLM-embedding-multilingual-mini-v1 - [x] KaLM-embedding-multilingual-mini-instruct-v1 - [x] KaLM-embedding-multilingual-mini-instruct-v1.5 - [ ] KaLM-embedding-multilingual-max-v1 - [x] Training and Evaluation Code: HITsz-TMG/KaLM-Embedding - [x] Technical Report: KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model - [ ] Training Data | Model Name | Model Size | C-MTEB(35) | MTEB(56) | avg |:----:|:---:|:---:|:---:|:---:| | multilingual-e5-large | 560M | 58.81 | 61.5 | 60.16 | bge-m3 (dense) | 560M | 60.80 | 59.84 | 60.32 | gte-multilingual-base (dense) | 305M | 62.72 | 61.40 | 62.06 | KaLM-embedding-multilingual-mini-v1 | 494M | 62.31 | 61.87 | 62.09 | KaLM-embedding-multilingual-mini-instruct-v1 | 494M | 63.57 | 64.74 | 64.16 | KaLM-embedding-multilingual-mini-instruct-v1.5 | 494M | 64.13 | 64.94 | 64.53 Requirements Since we have used the Qwen2 model, we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Using this model becomes easy when you have sentence-transformers installed: We add instruction for asymmetric tasks: retrieval, reranking, classification and clustering. If you want to add instruction to the query (no instruction for the corpus), you can use the model like this: Citation Please cite the repo if you use the model or code in this repo. Contact If you encounter any issue, feel free to contact us via the email: [email protected]
KaLM Embedding Multilingual Mini Instruct V2
KaLM-Embedding-V2 is a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. KaLM-embedding-multilingual-mini-instruct-v2 is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and high-quality supervised fine-tuning data. The model incorporates several innovative designs: - Architectural Design: integration of bidirectional attention, enhancing representation learning. - Training Recipe: multi-stage training strategy, progressively improving the generalization and performance. - Training Objective: focal-style reweighting mechanism and online hard-negative mixing strategy to improve the efficiency and continuity of embedding training. - Training Data: 20 categories of data for pre-training and 100 categories of data for fine-tuning, as well as comprehensive recipes for curating training datasets. Model Information - Model Size: 0.5B - Embedding Dimension: 896 - Max Input Tokens: 32k - MRL: 896 512 256 128 64 - [x] Model Checkpoint - [x] KaLM-embedding-multilingual-mini-v1 - [x] KaLM-embedding-multilingual-mini-instruct-v1 - [x] KaLM-embedding-multilingual-mini-instruct-v1.5 - [x] KaLM-embedding-multilingual-mini-instruct-v2 - [x] KaLM-embedding-multilingual-mini-instruct-v2.5 - [x] Training and Evaluation Code: HITsz-TMG/KaLM-Embedding - [x] Technical Report: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model - [x] Pre-training Data: Pre-training Data - [x] Fine-tuning Data: Fine-tuning Data Evaluation Overall results on MTEB (cmn, v1) and MTEB (eng, v1). Requirements Since we have used the Qwen2 model, we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Usage sentence-transformers support Using this model becomes easy when you have sentence-transformers installed: We add task instructions for queries in asymmetric tasks: retrieval, reranking, classification, and clustering. And, we add task instructions for both queries and passages in symmetric tasks: STS and pair classification. If you want to add task instructions to the query, you can use the model like this: Citation If you find this model useful, please consider giving a star and citation. Contact If you encounter any issue, feel free to contact us via the email: ,
KaLM-embedding-multilingual-mini-instruct-v1
KaLM-embedding-multilingual-mini-instruct-v1-GGUF
KaLM-embedding-multilingual-mini-instruct-v1.5-GGUF
UniMoE Audio Preview
UniMoE-Audio is a unified framework that seamlessly combines speech and music generation. Powered by a novel Dynamic-Capacity Mixture-of-Experts architecture. If you enjoy our work or want timely updates, please give us a like and follow us. Open-source Plan - [x] Model Checkpoint - [x] UniMoE-Audio-preview - [ ] [UniMoE-Audio]() - [x] Training and Inference Code: HITsz-TMG/UniMoE-Audio - [x] Technical Report: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE Since we have used the Qwen2.5VL model, we advise you to install transformers>=4.53.1, or you might encounter the following error: We use `qwen-vl-utils` to handle various types of visual input. You can install it using the following command: We use the Descript Audio Codec (DAC) for audio compression. You can install it using the following command: The model weight will be automatically downloaded on first run. Here is a code snippet to show you how to use UniMoE-Audio with `transformers` Please cite the repo if you use the model or code in this repo. If you encounter any issue, feel free to contact us via the email: [email protected]
Uni-MoE-2.0-Base
Uni-MoE 2.0 is a fully open-source omnimodal model that substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Uni-MoE 2.0-Base is the version of the Uni-MoE 2.0 series that supports only all-modality understanding and does not include the audio and image generation modules. If you enjoy our work or want timely updates, please give us a like and follow us. Open-source Plan - [x] Model Checkpoint - [x] Uni-MoE 2.0-Omni - [x] Uni-MoE 2.0-Base - [x] Uni-MoE 2.0-Thinking - [x] Uni-MoE 2.0-Image - [x] Uni-MoE 2.0-MoE-TTS - [x] Inference Code: HITsz-TMG/Uni-MoE-2.0 - [x] Training Code: HITsz-TMG/Uni-MoE-2.0 - [x] Technical Report: arxiv 1. Clone this repository and navigate to the Uni-MoE 2.0 folder 2. Set up environment Install the evaluation environment according to the requirements. Example Usage We provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook
EviOmni-nq_train-1.5B
EviOmni is a rational evidence extraction model. Compared to vanilla evidence extraction models, EviOmni demonstrates the superiority in terms of performance, generalization, efficiency, and robustness. The code of EviOmni has been in the latest Huggingface `transformers` and we advise you to use the latest version of `transformers`. With `transformers = self.stoplen: lasttokens = inputids[0][-self.stoplen:].tolist() return lasttokens == self.stopids return False modelname = "HIT-TMG/EviOmni-nqtrain-1.5B" model = AutoModelForCausalLM.frompretrained( modelname, torchdtype="auto", devicemap="auto" ) tokenizer = AutoTokenizer.frompretrained(modelname) prompt = open("eviomniprompt", "r").read() question = "..." passages = "..." instruction = prompt.format(question=question, passages=passages) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": instruction} ] stoptoken = " \n\n" stopids = tokenizer.encode(stoptoken, addspecialtokens=False) stoppingcriteria = StoppingCriteriaList([ MultiTokenStoppingCriteria(stopids, model.device) ]) text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], returntensors="pt").to(model.device) generatedids = model.generate( modelinputs, maxnewtokens=512, stoppingcriteria=stoppingcriteria ) generatedids = [ outputids[len(inputids):] for inputids, outputids in zip(modelinputs.inputids, generatedids) ] response = tokenizer.batchdecode(generatedids, skipspecialtokens=True)[0].strip() match = re.search(r" (.?) ", response, re.DOTALL) evidence = match.group(1).strip() If you find our work helpful, feel free to give us a cite. @misc{EviOmni, title={Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation}, author={Xinping Zhao and Shouzheng Huang and Yan Zhong and Xinshuo Hu and Meishan Zhang and Baotian Hu and Min Zhang}, year={2025}, eprint={2507.15586}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.15586}, }
EviOmni-nq_train-7B
EviOmni is a rational evidence extraction model. Compared to vanilla evidence extraction models, EviOmni demonstrates the superiority in terms of performance, generalization, efficiency, and robustness. The code of EviOmni has been in the latest Huggingface `transformers` and we advise you to use the latest version of `transformers`. With `transformers = self.stoplen: lasttokens = inputids[0][-self.stoplen:].tolist() return lasttokens == self.stopids return False modelname = "HIT-TMG/EviOmni-nqtrain-7B" model = AutoModelForCausalLM.frompretrained( modelname, torchdtype="auto", devicemap="auto" ) tokenizer = AutoTokenizer.frompretrained(modelname) prompt = open("eviomniprompt", "r").read() question = "..." passages = "..." instruction = prompt.format(question=question, passages=passages) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": instruction} ] stoptoken = " \n\n" stopids = tokenizer.encode(stoptoken, addspecialtokens=False) stoppingcriteria = StoppingCriteriaList([ MultiTokenStoppingCriteria(stopids, model.device) ]) text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], returntensors="pt").to(model.device) generatedids = model.generate( modelinputs, maxnewtokens=512, stoppingcriteria=stoppingcriteria ) generatedids = [ outputids[len(inputids):] for inputids, outputids in zip(modelinputs.inputids, generatedids) ] response = tokenizer.batchdecode(generatedids, skipspecialtokens=True)[0].strip() match = re.search(r" (.?) ", response, re.DOTALL) evidence = match.group(1).strip() If you find our work helpful, feel free to give us a cite. @misc{EviOmni, title={Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation}, author={Xinping Zhao and Shouzheng Huang and Yan Zhong and Xinshuo Hu and Meishan Zhang and Baotian Hu and Min Zhang}, year={2025}, eprint={2507.15586}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.15586}, }
CIGEval-Qwen2.5-VL-7B-Instruct-sft
KaLM-embedding-multilingual-mini-unsupervised
dialogue-bart-large-chinese
dialogue-bart-base-chinese
GlyphBERT
yizhao-fin-en-scorer
bge-m3_RAG-conversational-IR
CIGEval-Qwen2-VL-7B-Instruct-sft
dialogue-bart-large-chinese-DuSinc
yizhao-risk-en-scorer
yizhao-risk-en-scorer Introduction This is a BERT model fine-tuned on a high-quality English financial dataset. It generates a security risk score, which helps to identify and remove data with security risks from financial datasets, thereby reducing the proportion of illegal or undesirable data. For the complete data cleaning process, please refer to YiZhao. Quickstart Here is an example code snippet for generating security risk scores using this model.
yizhao-fin-zh-scorer
yizhao-fin-zh-scorer Introduction This is a BERT model fine-tuned on a high-quality Chinese financial dataset. It generates a financial relevance score for each piece of text, and based on this score, different quality financial data can be filtered by strategically setting thresholds. For the complete data cleaning process, please refer to YiZhao. To collect training samples, we use the Qwen-72B model to thoroughly annotate small batches of samples extracted from Chinese datasets, and scored them from 0 to 5 based on financial relevance. Given the uneven class distribution in the labeled samples, we apply undersampling techniques to ensure class balance. As a result, the final Chinese training dataset contains nearly 50,000 samples. During the training process, we fix the embedding layer and encoder layer, and save the model parameters that achieve optimal performance based on the F1 score. Quickstart Here is an example code snippet for generating financial relevance scores using this model.
yizhao-risk-zh-scorer
yizhao-risk-zh-scorer Introduction This is a BERT model fine-tuned on a high-quality Chinese financial dataset. It generates a security risk score, which helps to identify and remove data with security risks from financial datasets, thereby reducing the proportion of illegal or undesirable data. For the complete data cleaning process, please refer to YiZhao. Quickstart Here is an example code snippet for generating security risk scores using this model.