vidore
colqwen2.5-v0.2
--- license: mit library_name: colpali base_model: vidore/colqwen2.5-base language: - en tags: - colpali - vidore - vidore-experimental pipeline_tag: visual-document-retrieval ---
colqwen2-v1.0
colSmol-500M
ColSmolVLM-Instruct-500M: Visual Retriever based on SmolVLM-Instruct-500M with ColBERT strategy This is a version trained with batchsize 32 for 3 epochs ColSmolVLM is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a SmolVLM extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is trained with the commit b983e40 of the Colpali repository. (main branch from the repo) Data is the same as the ColPali data described in the paper. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on a 4 GPU setup with data parallelism, a learning rate of 5e-4 with linear decay with 2.5% warmup steps, and a batch size of 8. Make sure `colpali-engine` is installed from source or with a version superior to 0.3.5 (main branch from the repo currently). `transformers` version must be > 4.46.2. - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColQwen2's vision language backbone model (Qwen2-VL) is under `apache2.0` license. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
colpali-v1.2
ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is trained with `colpali-engine==0.2.0` but can be loaded for any version `>=0.2.0`. Compared to `vidore/colpali`, this version is trained with right padding for queries to fix unwanted tokens in the query encoding. It also stems from the fixed `vidore/colpaligemma-3b-pt-448-base` to guarantee deterministic projection layer initialization. It was trained for 5 epochs, with in-batch negatives and hard mined negatives and a warmup of 1000 steps (10x longer) to help reduce non-english language collapse. Data is the same as the ColPali data described in the paper. This model is built iteratively starting from an off-the-shelf SigLIP model. We finetuned it to create BiSigLIP and fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
colpali-v1.3
ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy This version is trained with 256 batch size for 3 epochs on the same data as the original ColPali model. ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is trained with `colpali-engine==0.2.0` but can be loaded for any version `>=0.2.0`. Compared to `vidore/colpali`, this version is trained with right padding for queries to fix unwanted tokens in the query encoding. It also stems from the fixed `vidore/colpaligemma-3b-pt-448-base` to guarantee deterministic projection layer initialization. It was trained for 5 epochs, with in-batch negatives and hard mined negatives and a warmup of 1000 steps (10x longer) to help reduce non-english language collapse. Data is the same as the ColPali data described in the paper. This model is built iteratively starting from an off-the-shelf SigLIP model. We finetuned it to create BiSigLIP and fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
colqwen2-v1.0-hf
colqwen2-v0.1
colqwen-omni-v0.1
ColQwen2.5-Omni: Visual+Audio Retriever based on Qwen2.5-Omni-3B-Instruct with ColBERT strategy Check out the release blogpost for in-depth explanations and tutorials! ColQwen-Omni is a model based on a novel model architecture and training strategy based on Omnimodal Language Models to efficiently index documents from their visual features. It is a Qwen2.5-Omni-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali. Maximal resolution is set so that 1024 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements. This version is trained with `colpali-engine==0.3.11`. Data is the same as the ColPali data described in the paper. The audio retrieval capabilities are acquired in a 0-shot capacity, as the entire training data is purely image-text matching. Yhe audio and vision tower are frozen during training. Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training. Make sure `colpali-engine` is installed from source or with a version superior to 0.3.11. - Manuel Faysse: [email protected] - Antonio Loison: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
colpali
ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This model is built iteratively starting from an off-the-shelf SigLIP model. We finetuned it to create BiSigLIP and fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. For best performance, newer models are available (vidore/colpali-v1.2) - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
colpali-v1.3-hf
> [!IMPORTANT] > This version of ColPali should be loaded with the `transformers 🤗` release, not with `colpali-engine`. > It was converted using the `convertcolpaliweightstohf.py` script > from the `vidore/colpali-v1.3-merged` checkpoint. ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository The HuggingFace `transformers` 🤗 implementation was contributed by Tony Wu (@tonywu71) and Yoni Gozlan (@yonigozlan). Read the `transformers` 🤗 model card: https://huggingface.co/docs/transformers/en/modeldoc/colpali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. - The ColPali arXiv paper can be found here. 📄 - The official blog post detailing ColPali can be found here. 📝 - The original model implementation code for the ColPali model and for the `colpali-engine` package can be found here. 🌎 - Cookbooks for learning to use the transformers-native version of ColPali, fine-tuning, and similarity maps generation can be found here. 📚 - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. ColPali inherits from this `gemma` license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
colSmol-256M
colpali-v1.2-hf
colpali-v1.1
colsmolvlm-v0.1
colpaligemma2-3b-pt-448-base
colqwen2.5-v0.1
ColSmolVLM-256M-Base
colSmol-500M-base
colpali2-3b-pt-448
colqwen2-v1.0-merged
colpali-3b-pt-448
colpali-v1.3-merged
colpali-v1.2-merged
colpaligemma-3b-pt-448-base
colqwen2-base
bisiglip
colidefics
colqwen2.5-base
bipali
colpali-hard-v1.1
colpaligemma-3b-mix-448-base
colqwen2-v0.1-merged
ColSmolVLM-base
ColSmolVLM: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColSmolVLM is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a SmolVLM extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is the untrained base version to guarantee deterministic projection layer initialization. > [!WARNING] > This version should not be used: it is solely the base version useful for deterministic LoRA initialization. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows: