jxm

26 models • 1 total models in database

Sort by:

cde-small-v1

The cde-small-v1 model has been deprecated. We highly recommend transitioning to the improved cde-small-v2 model for enhanced performance and support. For more details and to access the latest version, please visit the cde-small-v2 model page . Our new model that naturally integrates "context tokens" into the embedding process. As of October 1st, 2024, `cde-small-v1` is the best small model (under 400M params) on the MTEB leaderboard for text embedding models, with an average score of 65.00. Our embedding model needs to be used in two stages. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time. Click to learn how to use cde-small-v1 with Transformers Our model can be loaded using `transformers` out-of-the-box with "trust remote code" enabled. We use the default BERT uncased tokenizer: Nota bene: Like all state-of-the-art embedding models, our model was trained with task-specific prefixes. To do retrieval, you can prepend the following strings to queries & documents: Now that we have obtained "dataset embeddings" we can embed documents and queries like normal. Remember to use the document prefix for documents: these embeddings can be compared using dot product, since they're normalized. What if I don't know what my corpus will be ahead of time? If you can't obtain corpus information ahead of time, you still have to pass something as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide some random strings that worked well for us that can be used as a substitute for corpus sampling. Click to learn how to use cde-small-v1 with Sentence Transformers Our model can be loaded using `sentence-transformers` out-of-the-box with "trust remote code" enabled: Nota bene: Like all state-of-the-art embedding models, our model was trained with task-specific prefixes. To do retrieval, you can use `promptname="query"` and `promptname="document"` in the `encode` method of the model when embedding queries and documents, respectively. Now that we have obtained "dataset embeddings" we can embed documents and queries like normal. Remember to use the document prompt for documents: these embeddings can be compared using cosine similarity via `model.similarity`: We've set up a short demo in a Colab notebook showing how you might use our model: Try our model in Colab: Early experiments on CDE were done with support from Nomic and Hyperbolic. We're especially indebted to Nomic for open-sourcing their efficient BERT implementation and contrastive pre-training data, which proved vital in the development of CDE. Used our model, method, or architecture? Want to cite us? Here's the ArXiv citation information:

—

284

t5-base_llama-7b_one-million-instructions__correct

NaNK

—

shieldgemma-2b

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

NaNK

—