mirth
chonky_distilbert_base_uncased_1
chonky_modernbert_base_1
chonky_modernbert_large_1
chonky_mmbert_small_multilingual_1
Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems. The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline. ⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192). I've made a small python library for this model: chonky But you can use this model using standart NER pipeline: The model was trained to split paragraphs from minipile, bookcorpus and Project Gutenberg datasets. Project Gutenberg validation: | Model | de | en | es | fr | it | nl | pl | pt | ru | sv | zh | |------------------------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------| | chonkymmbertsmallmulti1 🆕 | 0.88 | 0.78 | 0.91 | 0.93 | 0.86 | 0.81 | 0.81 | 0.88 | 0.97 | 0.91 | 0.11 | | chonkymodernbertlarge1 | 0.53 | 0.43 | 0.48 | 0.51 | 0.56 | 0.21 | 0.65 | 0.53 | 0.87 | 0.51 | 0.33 | | chonkymodernbertbase1 | 0.42 | 0.38 | 0.34 | 0.4 | 0.33 | 0.22 | 0.41 | 0.35 | 0.27 | 0.31 | 0.26 | | chonkydistilbertbaseuncased1 | 0.19 | 0.3 | 0.17 | 0.2 | 0.18 | 0.04 | 0.27 | 0.21 | 0.22 | 0.19 | 0.15 | | Number of val tokens | 1m | 1m | 1m | 1m | 1m | 1m | 38k | 1m | 24k | 1m | 132k | Various english datasets: | Model | bookcorpus | enjudgements | paulgraham | 20newsgroups | |------------------------------------------------|-----------------------|---------------------|------------------|----------------------| | chonkYmodernbertlarge1 | 0.79 | 0.29 | 0.69 | 0.17 | | chonkYmodernbertbase1 | 0.72 | 0.08 | 0.63 | 0.15 | | chonkYdistilbertbaseuncased1 | 0.69 | 0.05 | 0.52 | 0.15 | | chonkymmbertsmallmultilingual1 🆕 | 0.72 | 0.2 | 0.56 | 0.13 | Model was fine-tuned on a single H100 for a several hours