vidore

36 models • 1 total models in database

Sort by:

colqwen2.5-v0.2

--- license: mit library_name: colpali base_model: vidore/colqwen2.5-base language: - en tags: - colpali - vidore - vidore-experimental pipeline_tag: visual-document-retrieval ---

license:mit

686,354

colqwen2-v1.0

license:apache-2.0

43,690

113

colSmol-500M

ColSmolVLM-Instruct-500M: Visual Retriever based on SmolVLM-Instruct-500M with ColBERT strategy This is a version trained with batchsize 32 for 3 epochs ColSmolVLM is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a SmolVLM extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is trained with the commit b983e40 of the Colpali repository. (main branch from the repo) Data is the same as the ColPali data described in the paper. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on a 4 GPU setup with data parallelism, a learning rate of 5e-4 with linear decay with 2.5% warmup steps, and a batch size of 8. Make sure `colpali-engine` is installed from source or with a version superior to 0.3.5 (main branch from the repo currently). `transformers` version must be > 4.46.2. - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColQwen2's vision language backbone model (Qwen2-VL) is under `apache2.0` license. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

license:mit

36,933

colpali-v1.2

ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is trained with `colpali-engine==0.2.0` but can be loaded for any version `>=0.2.0`. Compared to `vidore/colpali`, this version is trained with right padding for queries to fix unwanted tokens in the query encoding. It also stems from the fixed `vidore/colpaligemma-3b-pt-448-base` to guarantee deterministic projection layer initialization. It was trained for 5 epochs, with in-batch negatives and hard mined negatives and a warmup of 1000 steps (10x longer) to help reduce non-english language collapse. Data is the same as the ColPali data described in the paper. This model is built iteratively starting from an off-the-shelf SigLIP model. We finetuned it to create BiSigLIP and fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

license:mit

35,093

112

colpali-v1.3

ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy This version is trained with 256 batch size for 3 epochs on the same data as the original ColPali model. ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is trained with `colpali-engine==0.2.0` but can be loaded for any version `>=0.2.0`. Compared to `vidore/colpali`, this version is trained with right padding for queries to fix unwanted tokens in the query encoding. It also stems from the fixed `vidore/colpaligemma-3b-pt-448-base` to guarantee deterministic projection layer initialization. It was trained for 5 epochs, with in-batch negatives and hard mined negatives and a warmup of 1000 steps (10x longer) to help reduce non-english language collapse. Data is the same as the ColPali data described in the paper. This model is built iteratively starting from an off-the-shelf SigLIP model. We finetuned it to create BiSigLIP and fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

license:mit

31,227

colqwen2-v1.0-hf

license:apache-2.0

14,474

colqwen2-v0.1

license:apache-2.0

8,801

191

colqwen-omni-v0.1

ColQwen2.5-Omni: Visual+Audio Retriever based on Qwen2.5-Omni-3B-Instruct with ColBERT strategy Check out the release blogpost for in-depth explanations and tutorials! ColQwen-Omni is a model based on a novel model architecture and training strategy based on Omnimodal Language Models to efficiently index documents from their visual features. It is a Qwen2.5-Omni-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali. Maximal resolution is set so that 1024 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements. This version is trained with `colpali-engine==0.3.11`. Data is the same as the ColPali data described in the paper. The audio retrieval capabilities are acquired in a 0-shot capacity, as the entire training data is purely image-text matching. Yhe audio and vision tower are frozen during training. Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training. Make sure `colpali-engine` is installed from source or with a version superior to 0.3.11. - Manuel Faysse: [email protected] - Antonio Loison: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

license:mit

5,010

colpali

ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This model is built iteratively starting from an off-the-shelf SigLIP model. We finetuned it to create BiSigLIP and fed the patch-embeddings output by SigLIP to an LLM, PaliGemma-3B to create BiPali. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. For best performance, newer models are available (vidore/colpali-v1.2) - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. The adapters attached to the model are under MIT license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

license:mit

3,852

463

colpali-v1.3-hf

> [!IMPORTANT] > This version of ColPali should be loaded with the `transformers 🤗` release, not with `colpali-engine`. > It was converted using the `convertcolpaliweightstohf.py` script > from the `vidore/colpali-v1.3-merged` checkpoint. ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a PaliGemma-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository The HuggingFace `transformers` 🤗 implementation was contributed by Tony Wu (@tonywu71) and Yoni Gozlan (@yonigozlan). Read the `transformers` 🤗 model card: https://huggingface.co/docs/transformers/en/modeldoc/colpali. Dataset Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2% of the samples to tune hyperparameters. Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters (LoRA) with `alpha=32` and `r=32` on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a `pagedadamw8bit` optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32. - The ColPali arXiv paper can be found here. 📄 - The official blog post detailing ColPali can be found here. 📝 - The original model implementation code for the ColPali model and for the `colpali-engine` package can be found here. 🌎 - Cookbooks for learning to use the transformers-native version of ColPali, fine-tuning, and similarity maps generation can be found here. 📚 - Focus: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages. - Support: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support. ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its model card. ColPali inherits from this `gemma` license. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

—

2,263

colSmol-256M

license:mit

2,162

vidore

colqwen2.5-v0.2

colqwen2-v1.0

colSmol-500M

colpali-v1.2

colpali-v1.3

colqwen2-v1.0-hf

colqwen2-v0.1

colqwen-omni-v0.1

colpali

colpali-v1.3-hf

colSmol-256M

colpali-v1.2-hf

colpali-v1.1

colsmolvlm-v0.1

colpaligemma2-3b-pt-448-base

colqwen2.5-v0.1

ColSmolVLM-256M-Base

colSmol-500M-base

colpali2-3b-pt-448

colqwen2-v1.0-merged

colpali-3b-pt-448

colpali-v1.3-merged

colpali-v1.2-merged

colpaligemma-3b-pt-448-base

colqwen2-base

bisiglip

colidefics

colqwen2.5-base

bipali

colpali-hard-v1.1

colpaligemma-3b-mix-448-base

colqwen2-v0.1-merged

ColSmolVLM-base

ColSmolVLM-Instruct-256M-base

colSmol-256M-base

ColSmolVLM-Instruct-500M-base