mirth

4 models • 1 total models in database

Sort by:

chonky_mmbert_small_multilingual_1

Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems. The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline. ⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192). I've made a small python library for this model: chonky But you can use this model using standart NER pipeline: The model was trained to split paragraphs from minipile, bookcorpus and Project Gutenberg datasets. Project Gutenberg validation: | Model | de | en | es | fr | it | nl | pl | pt | ru | sv | zh | |------------------------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------| | chonkymmbertsmallmulti1 🆕 | 0.88 | 0.78 | 0.91 | 0.93 | 0.86 | 0.81 | 0.81 | 0.88 | 0.97 | 0.91 | 0.11 | | chonkymodernbertlarge1 | 0.53 | 0.43 | 0.48 | 0.51 | 0.56 | 0.21 | 0.65 | 0.53 | 0.87 | 0.51 | 0.33 | | chonkymodernbertbase1 | 0.42 | 0.38 | 0.34 | 0.4 | 0.33 | 0.22 | 0.41 | 0.35 | 0.27 | 0.31 | 0.26 | | chonkydistilbertbaseuncased1 | 0.19 | 0.3 | 0.17 | 0.2 | 0.18 | 0.04 | 0.27 | 0.21 | 0.22 | 0.19 | 0.15 | | Number of val tokens | 1m | 1m | 1m | 1m | 1m | 1m | 38k | 1m | 24k | 1m | 132k | Various english datasets: | Model | bookcorpus | enjudgements | paulgraham | 20newsgroups | |------------------------------------------------|-----------------------|---------------------|------------------|----------------------| | chonkYmodernbertlarge1 | 0.79 | 0.29 | 0.69 | 0.17 | | chonkYmodernbertbase1 | 0.72 | 0.08 | 0.63 | 0.15 | | chonkYdistilbertbaseuncased1 | 0.69 | 0.05 | 0.52 | 0.15 | | chonkymmbertsmallmultilingual1 🆕 | 0.72 | 0.2 | 0.56 | 0.13 | Model was fine-tuned on a single H100 for a several hours

license:mit

1,097

mirth

chonky_distilbert_base_uncased_1

chonky_modernbert_base_1

chonky_modernbert_large_1

chonky_mmbert_small_multilingual_1