Salesforce
✓ VerifiedEnterpriseSalesforce AI Research, enterprise CRM AI solutions
blip-image-captioning-base
--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: bsd-3-clause ---
blip-image-captioning-large
--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: bsd-3-clause ---
blip2-opt-2.7b
--- language: en license: mit tags: - vision - image-to-text - image-captioning - visual-question-answering pipeline_tag: image-text-to-text ---
blip2-opt-2.7b-coco
--- language: en license: mit tags: - vision - image-to-text - image-captioning - visual-question-answering pipeline_tag: image-to-text inference: false ---
blip-vqa-base
--- pipeline_tag: 'visual-question-answering' tags: - visual-question-answering inference: false languages: - en license: bsd-3-clause ---
moirai-moe-1.0-R-base
--- license: cc-by-nc-4.0 pipeline_tag: time-series-forecasting tags: - time series - forecasting - pretrained models - foundation models - time series foundation models - time-series ---
codegen-350M-mono
moirai-2.0-R-small
This model is designed for time series forecasting. It is licensed under CC BY-NC 4.0 and is categorized as a pretrained foundation model for time series applications.
codet5p-110m-embedding
blip2-flan-t5-xl
moirai-1.1-R-large
This is new updated version of Moirai-1.0-R (https://huggingface.co/Salesforce/moirai-1.0-R-large). The Moirai-1.1-R model achieved significant improvements (~20%) for low-frequency cases like Yearly and Quarterly data in Normalised Mean Absolute Error (NMAE) for 40 datasets on the Monash repository. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
moirai-1.1-R-base
This is new updated version of Moirai-1.0-R (https://huggingface.co/Salesforce/moirai-1.0-R-base). The new Moirai model achieved significant improvements (~20%) for low-frequency cases like Yearly and Quarterly data in Normalised Mean Absolute Error (NMAE) for 40 datasets on the Monash repository. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
moirai-1.1-R-small
This is new updated version of Moirai-1.0-R (https://huggingface.co/Salesforce/moirai-1.0-R-base). The Moirai-1.1-R model achieved significant improvements (~20%) for low-frequency cases like Yearly and Quarterly data in Normalised Mean Absolute Error (NMAE) for 40 datasets on the Monash repository. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
SFR-Embedding-Mistral
blip2-opt-6.7b-coco
blip-itm-base-coco
codet5-small
Pre-trained CodeT5 model. It was introduced in the paper CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi and first released in this repository. Disclaimer: The team releasing CodeT5 did not write a model card for this model so this model card has been written by the Hugging Face team (more specifically, nielsr). "We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code." This repository contains the pre-trained model only, so you can use this model for masked span prediction, as shown in the code example below. However, the main use of this model is to fine-tune it for a downstream task of interest, such as: code summarization code generation code translation code refinement code defect detection code clone detection. See the model hub to look for fine-tuned versions on a task that interests you. The CodeT5 model was pretrained on CodeSearchNet Husain et al., 2019. Additionally, the authors collected two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped programming languages with the pre-training data. In total, around 8.35 million instances are used for pretraining. This model uses a code-specific BPE (Byte-Pair Encoding) tokenizer. One can prepare text (or code) for the model using RobertaTokenizer, with the files from this repository. For evaluation results on several downstream benchmarks, we refer to the paper. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
blip-vqa-capfilt-large
codet5-base
Pre-trained CodeT5 model. It was introduced in the paper CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi and first released in this repository. Disclaimer: The team releasing CodeT5 did not write a model card for this model so this model card has been written by the Hugging Face team (more specifically, nielsr). "We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code." This repository contains the pre-trained model only, so you can use this model for (among other tasks) masked span prediction, as shown in the code example below. However, the main use of this model is to fine-tune it for a downstream task of interest, such as: code summarization code generation code translation code refinement code defect detection code clone detection. Supervised datasets for code can be found here. See the model hub to look for fine-tuned versions on a task that interests you. The CodeT5 model was pretrained on CodeSearchNet Husain et al., 2019. Additionally, the authors collected two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped programming languages with the pre-training data. In total, around 8.35 million instances are used for pretraining. This model uses a code-specific BPE (Byte-Pair Encoding) tokenizer trained using the HuggingFace Tokenizers library. One can prepare text (or code) for the model using RobertaTokenizer, with the files from this repository. For evaluation results on several downstream benchmarks, we refer to the paper. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
blip2-opt-6.7b
BLIP-2 model, leveraging OPT-6.7b (a large language model with 6.7 billion parameters). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders a...
SFR-Embedding-2_R
blip-itm-large-coco
instructblip-vicuna-7b
blip2-itm-vit-g
SFR-Embedding-Code-400M_R
codet5p-220m
moirai-1.0-R-small
instructblip-flan-t5-xl
Llama-xLAM-2-8b-fc-r-gguf
This repo provides the GGUF format for the Llama-xLAM-2-8b-fc-r model. Here's a link to original model Llama-xLAM-2-8b-fc-r. Large Action Models (LAMs) are advanced language models designed to enhance decision-making by translating user intentions into executable actions. As the brains of AI agents, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains. This model release is for research purposes only. The new xLAM-2 series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in multi-turn conversation and tool usage. Trained using our novel APIGen-MT framework, which generates high-quality training data through simulated agent-human interactions. Our models achieve state-of-the-art performance on BFCL and τ-bench benchmarks, outperforming frontier models like GPT-4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi-turn scenarios while maintaining exceptional consistency across trials. We've also refined the chat template and vLLM integration, making it easier to build advanced AI agents. Compared to previous xLAM models, xLAM-2 offers superior performance and seamless deployment across applications. Comparative performance of larger xLAM-2-fc-r models (8B-70B, trained with APIGen-MT data) against state-of-the-art baselines on function-calling (BFCL v3, as of date 04/02/2025) and agentic (τ-bench) capabilities. Table of Contents - Model Series - Using GGUF Files - Benchmark Results - Citation xLAM series are significantly better at many things including general tasks and function calling. For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model. | Model | # Total Params | Context Length | Category | Download Model | Download GGUF files | |------------------------|----------------|------------|-------|----------------|----------| | Llama-xLAM-2-70b-fc-r | 70B | 128k | Multi-turn Conversation, Function-calling | 🤗 Link | NA | | Llama-xLAM-2-8b-fc-r | 8B | 128k | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | | xLAM-2-32b-fc-r | 32B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | NA | | xLAM-2-3b-fc-r | 3B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | | xLAM-2-1b-fc-r | 1B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | Note: The default context length for Qwen-2.5-based models is 32k, but you can use techniques like YaRN (Yet Another Recursive Network) to achieve maximum 128k context length. Please refer to here for more details. You can also explore our previous xLAM series here. The `-fc` suffix indicates that the models are fine-tuned for function calling tasks, while the `-r` suffix signifies a research release. ✅ All models are fully compatible with vLLM and Transformers-based inference frameworks. For scenarios requiring more efficient inference or deployment on resource-constrained devices, we provide GGUF versions of our models, which are compatible with llama.cpp and similar frameworks. 1. Install llama.cpp framework from the source here 2. Run the inference task as shown below. For configuration of generation-related parameters, refer to llama.cpp documentation Performance comparison of different models on BFCL leaderboard. The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls. Success Rate (pass@1) on τ-bench benchmark averaged across at least 5 trials. Our xLAM-2-70b-fc-r model achieves an overall success rate of 56.2% on τ-bench, significantly outperforming the base Llama 3.1 70B Instruct model (38.2%) and other open-source models like DeepSeek v3 (40.6%). Notably, our best model even outperforms proprietary models such as GPT-4o (52.9%) and approaches the performance of more recent models like Claude 3.5 Sonnet (new) (60.1%). Pass^k curves measuring the probability that all 5 independent trials succeed for a given task, averaged across all tasks for τ-retail (left) and τ-airline (right) domains. Higher values indicate better consistency of the models. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. For all Llama relevant models, please also follow corresponding Llama license and terms. Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. If you use our model or dataset in your work, please cite our paper: Additionally, please check our other amazing works regarding xLAM series and consider citing them as well:
blip2-itm-vit-g-coco
blip2-flan-t5-xxl
moirai-moe-1.0-R-small
codet5-base-multi-sum
ctrl
moirai-1.0-R-large
Llama-xLAM-2-8b-fc-r
Large Action Models (LAMs) are advanced language models designed to enhance decision-making by translating user intentions into executable actions. As the brains of AI agents, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains. This model release is for research purposes only. The new xLAM-2 series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in multi-turn conversation and tool usage. Trained using our novel APIGen-MT framework, which generates high-quality training data through simulated agent-human interactions. Our models achieve state-of-the-art performance on BFCL and τ-bench benchmarks, outperforming frontier models like GPT-4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi-turn scenarios while maintaining exceptional consistency across trials. We've also refined the chat template and vLLM integration, making it easier to build advanced AI agents. Compared to previous xLAM models, xLAM-2 offers superior performance and seamless deployment across applications. Comparative performance of larger xLAM-2-fc-r models (8B-70B, trained with APIGen-MT data) against state-of-the-art baselines on function-calling (BFCL v3, as of date 04/02/2025) and agentic (τ-bench) capabilities. Table of Contents - Usage - Basic Usage with Huggingface Chat Template - Using vLLM for Inference - Setup and Serving - Testing with OpenAI API - Benchmark Results - Citation xLAM series are significant better at many things including general tasks and function calling. For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model. | Model | # Total Params | Context Length | Category | Download Model | Download GGUF files | |------------------------|----------------|------------|-------|----------------|----------| | Llama-xLAM-2-70b-fc-r | 70B | 128k | Multi-turn Conversation, Function-calling | 🤗 Link | NA | | Llama-xLAM-2-8b-fc-r | 8B | 128k | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | | xLAM-2-32b-fc-r | 32B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | NA | | xLAM-2-3b-fc-r | 3B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | | xLAM-2-1b-fc-r | 1B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | Note: The default context length for Qwen-2.5-based models is 32k, but you can use techniques like YaRN (Yet Another Recursive Network) to achieve maximum 128k context length. Please refer to here for more details. You can also explore our previous xLAM series here. The `-fc` suffix indicates that the models are fine-tuned for function calling tasks, while the `-r` suffix signifies a research release. ✅ All models are fully compatible with vLLM and Transformers-based inference frameworks. - Transformers 4.46.1 (or later) - PyTorch 2.5.1+cu124 (or later) - Datasets 3.1.0 (or later) - Tokenizers 0.20.3 (or later) The new xLAM models are designed to work seamlessly with the Hugging Face Transformers library and utilize natural chat templates for an easy and intuitive conversational experience. Below are examples of how to use these models. The xLAM models can also be efficiently served using vLLM for high-throughput inference. Please use `vllm>=0.6.5` since earlier versions will cause degraded performance for Qwen-based models. 2. Download the tool parser plugin to your local path: Note: Ensure that the tool parser plugin file is downloaded and that the path specified in `--tool-parser-plugin` correctly points to your local copy of the file. The xLAM series models all utilize the same tool call parser, so you only need to download it once for all models. Here's a minimal example to test tool usage with the served endpoint: For more advanced configurations and deployment options, please refer to the vLLM documentation. Performance comparison of different models on BFCL leaderboard. The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls. Success Rate (pass@1) on τ-bench benchmark averaged across at least 5 trials. Our xLAM-2-70b-fc-r model achieves an overall success rate of 56.2% on τ-bench, significantly outperforming the base Llama 3.1 70B Instruct model (38.2%) and other open-source models like DeepSeek v3 (40.6%). Notably, our best model even outperforms proprietary models such as GPT-4o (52.9%) and approaches the performance of more recent models like Claude 3.5 Sonnet (new) (60.1%). Pass^k curves measuring the probability that all 5 independent trials succeed for a given task, averaged across all tasks for τ-retail (left) and τ-airline (right) domains. Higher values indicate better consistency of the models. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. For all Llama relevant models, please also follow corresponding Llama license and terms. Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. If you use our model or dataset in your work, please cite our paper: Additionally, please check our other awesome related works regarding xLAM series and consider citing them as well:
codegen-350M-multi
blip-itm-large-flickr
codet5p-770m
moirai-1.0-R-base
SFR-Embedding-Code-2B_R
GTA1-32B
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our blog, we share state-of-the-art GUI grounding models trained using GRPO. We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results: | Model | Size | Open Source | ScreenSpot-V2 | ScreenSpotPro | OSWORLD-G | OSWORLD-G-Refined | |-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:| | OpenAI CUA | — | ❌ | 87.9 | 23.4 | — | — | | Claude 3.7 | — | ❌ | 87.6 | 27.7 | — | — | | JEDI-7B | 7B | ✅ | 91.7 | 39.5 | 54.1 | — | | SE-GUI | 7B | ✅ | 90.3 | 47.0 | — | — | | UI-TARS | 7B | ✅ | 91.6 | 35.7 | 47.5 | — | | UI-TARS-1.5 | 7B | ✅ | 89.7 | 42.0 | 52.8 | 64.2 | | UGround-v1-7B | 7B | ✅ | — | 31.1 | — | 36.4 | | Qwen2.5-VL-32B-Instruct | 32B | ✅ | 91.9 | 48.0 | 46.5 | 59.6 | | UGround-v1-72B | 72B | ✅ | — | 34.5 | — | — | | Qwen2.5-VL-72B-Instruct | 72B | ✅ | 94.00 | 53.3 | — | 62.2 | | UI-TARS | 72B | ✅ | 90.3 | 38.1 | — | — | | OpenCUA | 7B | ✅ | 92.3 | 50.0 | 55.3 | 68.3 | | OpenCUA | 32B | ✅ | 93.4 | 55.3 | 59.6 | 70.2 | | GTA1-2507 (Ours) | 7B | ✅ | 92.4 (∆ +2.7) | 50.1 (∆ +8.1) | 55.1 (∆ +2.3) | 67.7 (∆ +3.5) | | GTA1 (Ours) | 7B | ✅ | 93.4 (∆ +0.1) | 55.5 (∆ +5.5) | 60.1 (∆ +4.8) | 68.8 (∆ +0.5) | | GTA1 (Ours) | 32B | ✅ | 95.2 (∆ +1.8) | 63.6 (∆ +8.3) | 65.2 (∆ +5.6) | 72.2 (∆ +2.0) | > Note: > - Model size is indicated in billions (B) of parameters. > - A dash (—) denotes results that are currently unavailable. > - A superscript asterisk (﹡) denotes our evaluated result. > - UI-TARS-1.5 7B, OpenCUA-7B, and OpenCUA-32B are applied as our baseline models. > - ∆ indicates the performance improvement (∆) of our model compared to its baseline. We evaluate our models on the OSWorld and OSWorld-Verified benchmarks following the standard evaluation protocol. The results demonstrate strong performance across both datasets. | Agent Model | Step | OSWorld | OSWorld-Verified | |-----------------|:--------:|:-----------:|:-------------------:| | Proprietary Models | | Claude 3.7 Sonnet | 100 | 28.0 | — | | OpenAI CUA 4o | 200 | 38.1 | — | | UI-TARS-1.5 | 100 | 42.5 | 41.8 | | OpenAI CUA o3 | 200 | 42.9 | — | | Open-Source Models | | Aria-UI w/ GPT-4o | 15 | 15.2 | — | | Aguvis-72B w/ GPT-4o | 15 | 17.0 | — | | UI-TARS-72B-SFT | 50 | 18.8 | — | | Agent S w/ Claude-3.5-Sonnet | 15 | 20.5 | — | | Agent S w/ GPT-4o | 15 | 20.6 | — | | UI-TARS-72B-DPO | 15 | 22.7 | — | | UI-TARS-72B-DPO | 50 | 24.6 | — | | UI-TARS-1.5-7B | 100 | 26.9 | 27.4 | | Jedi-7B w/ o3 | 100 | — | 51.0 | | Jedi-7B w/ GPT-4o | 100 | 27.0 | — | | Agent S2 w/ Claude-3.7-Sonnet | 50 | 34.5 | — | | Agent S2 w/ Gemini-2.5-Pro | 50 | 41.4 | 45.8 | | Agent S2.5 w/ o3 | 100 | — | 56.0 | | Agent S2.5 w/ GPT-5 | 100 | — | 58.4 | | CoAct-1 w/o3 & o4mini & OpenAI CUA 4o | 150 | — | 60.8 | | GTA1-7B-2507 w/ o3 | 100 | 45.2 | 53.1 | | GTA1-7B-2507 w/ GPT-5 | 100 | — | 61.0 | | GTA1-32B w/ o3 | 100 | — | 55.4 | | GTA1-32B w/ GPT-5 | 100 | — | 63.4 | We also evaluate our models on the WindowsAgentArena benchmark, demonstrating strong performance in Windows-specific GUI automation tasks. | Agent Model | Step | Success Rate | |-----------------|:--------:|:---------------:| | Kimi-VL | 15 | 10.4 | | WAA | — | 19.5 | | Jedi w/ GPT-4o | 100 | 33.7 | | GTA1-7B-2507 w/ o3 | 100 | 47.9 | | GTA1-7B-2507 w/ GPT-5 | 100 | 49.2 | | GTA1-32B w/ o3 | 100 | 51.2 | | GTA1-32B w/ GPT-5 | 100 | 50.6 | Inference Below is a code snippet demonstrating how to run inference using a trained model. This model is released for research and educational purposes. While our model demonstrates strong performance on GUI benchmarks, users should carefully evaluate its suitability for their specific use cases. Important Considerations: - Accuracy Limitations: Like all AI systems, this model may produce incorrect outputs or fail to accurately identify GUI elements in certain scenarios. - Safety and Security: Exercise caution when deploying GUI automation agents, especially in production environments where incorrect actions could affect system integrity or data security. - Human Oversight: We recommend maintaining appropriate human supervision when using this model for automated GUI interactions. - Compliance: Users are responsible for ensuring their use of this model complies with applicable laws, regulations, and organizational policies. Recommended Best Practices: - Thoroughly test the model in controlled environments before production deployment - Implement safeguards and error handling mechanisms - Consider the potential impact of automated actions on user systems and data - Regularly monitor and validate model performance in your specific domain For further guidance on use cases, refer to our AUP and AI AUP. If you're using any GTA model or find it helpful in your research, please cite it as follows:
codet5p-770m-py
codet5-large
xLAM-2-3b-fc-r
SweRankEmbed-Small
xLAM-2-3b-fc-r-gguf
safety-flan-t5-base
xgen-mm-phi3-mini-instruct-interleave-r-v1.5
codegen-16B-multi
blip2-flan-t5-xl-coco
xgen-7b-8k-inst
Official research release for the family of XGen models (`7B`) by Salesforce AI Research: Title: Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length Authors: Erik Nijkamp\, Tian Xie\, Hiroaki Hayashi\, Bo Pang\, Congying Xia\, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryscinski, Lidiya Murakhovs'ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Rayhan Joty, Caiming Xiong. Correspondence to: Shafiq Rayhan Joty, Caiming Xiong Base models XGen-7B-4K-Base: XGen-7B model pre-trained under 4K sequence length. License: Apache-2.0 XGen-7B-8K-Base: XGen-7B model pre-trained under 8K sequence length. License: Apache-2.0 Supervised finetuned model on public domain instructional data. Released for research purpose only. The training data for the models are tokenized with OpenAI Tiktoken library. To use this model, install the package via `pip`: The models can be used as auto-regressive samplers as follows: This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
codegen-6B-multi
codegen-2B-multi
codegen-2B-mono
xgen-7b-8k-base
Official research release for the family of XGen models (`7B`) by Salesforce AI Research: Title: Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length Authors: Erik Nijkamp\, Tian Xie\, Hiroaki Hayashi\, Bo Pang\, Congying Xia\, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryscinski, Lidiya Murakhovs'ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Rayhan Joty, Caiming Xiong. Correspondence to: Shafiq Rayhan Joty, Caiming Xiong Base models XGen-7B-4K-Base: XGen-7B model pre-trained under 4K sequence length. License: Apache-2.0 XGen-7B-8K-Base: XGen-7B model pre-trained under 8K sequence length. License: Apache-2.0 Supervised finetuned model on public domain instructional data. Released for research purpose only. The training data for the models are tokenized with OpenAI Tiktoken library. To use this model, install the package via `pip`: The models can be used as auto-regressive samplers as follows: This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
CoDA-v0-Instruct
Try CoDA · Paper · Model Collection · GitHub Repository Welcome to CoDA, Salesforce AI Research's diffusion-based language model designed for powerful code generation and bidirectional context unde...
XLAM 2 1b Fc R
Large Action Models (LAMs) are advanced language models designed to enhance decision-making by translating user intentions into executable actions. As the brains of AI agents, LAMs autonomously plan and execute tasks to achieve specific goals, making them invaluable for automating workflows across diverse domains. This model release is for research purposes only. The new xLAM-2 series, built on our most advanced data synthesis, processing, and training pipelines, marks a significant leap in multi-turn conversation and tool usage. Trained using our novel APIGen-MT framework, which generates high-quality training data through simulated agent-human interactions. Our models achieve state-of-the-art performance on BFCL and τ-bench benchmarks, outperforming frontier models like GPT-4o and Claude 3.5. Notably, even our smaller models demonstrate superior capabilities in multi-turn scenarios while maintaining exceptional consistency across trials. We've also refined the chat template and vLLM integration, making it easier to build advanced AI agents. Compared to previous xLAM models, xLAM-2 offers superior performance and seamless deployment across applications. Comparative performance of larger xLAM-2-fc-r models (8B-70B, trained with APIGen-MT data) against state-of-the-art baselines on function-calling (BFCL v3, as of date 04/02/2025) and agentic (τ-bench) capabilities. Table of Contents - Usage - Basic Usage with Huggingface Chat Template - Using vLLM for Inference - Setup and Serving - Testing with OpenAI API - Benchmark Results - Citation xLAM series are significant better at many things including general tasks and function calling. For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model. | Model | # Total Params | Context Length | Category | Download Model | Download GGUF files | |------------------------|----------------|------------|-------|----------------|----------| | Llama-xLAM-2-70b-fc-r | 70B | 128k | Multi-turn Conversation, Function-calling | 🤗 Link | NA | | Llama-xLAM-2-8b-fc-r | 8B | 128k | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | | xLAM-2-32b-fc-r | 32B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | NA | | xLAM-2-3b-fc-r | 3B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | | xLAM-2-1b-fc-r | 1B | 32k (max 128k) | Multi-turn Conversation, Function-calling | 🤗 Link | 🤗 Link | Note: The default context length for Qwen-2.5-based models is 32k, but you can use techniques like YaRN (Yet Another Recursive Network) to achieve maximum 128k context length. Please refer to here for more details. You can also explore our previous xLAM series here. The `-fc` suffix indicates that the models are fine-tuned for function calling tasks, while the `-r` suffix signifies a research release. ✅ All models are fully compatible with vLLM and Transformers-based inference frameworks. - Transformers 4.46.1 (or later) - PyTorch 2.5.1+cu124 (or later) - Datasets 3.1.0 (or later) - Tokenizers 0.20.3 (or later) The new xLAM models are designed to work seamlessly with the Hugging Face Transformers library and utilize natural chat templates for an easy and intuitive conversational experience. Below are examples of how to use these models. The xLAM models can also be efficiently served using vLLM for high-throughput inference. Please use `vllm>=0.6.5` since earlier versions will cause degraded performance for Qwen-based models. 2. Download the tool parser plugin to your local path: Note: Ensure that the tool parser plugin file is downloaded and that the path specified in `--tool-parser-plugin` correctly points to your local copy of the file. The xLAM series models all utilize the same tool call parser, so you only need to download it once for all models. Here's a minimal example to test tool usage with the served endpoint: For more advanced configurations and deployment options, please refer to the vLLM documentation. Performance comparison of different models on BFCL leaderboard. The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls. Success Rate (pass@1) on τ-bench benchmark averaged across at least 5 trials. Our xLAM-2-70b-fc-r model achieves an overall success rate of 56.2% on τ-bench, significantly outperforming the base Llama 3.1 70B Instruct model (38.2%) and other open-source models like DeepSeek v3 (40.6%). Notably, our best model even outperforms proprietary models such as GPT-4o (52.9%) and approaches the performance of more recent models like Claude 3.5 Sonnet (new) (60.1%). Pass^k curves measuring the probability that all 5 independent trials succeed for a given task, averaged across all tasks for τ-retail (left) and τ-airline (right) domains. Higher values indicate better consistency of the models. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. For all Llama relevant models, please also follow corresponding Llama license and terms. Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. If you use our model or dataset in your work, please cite our paper: Additionally, please check our other awesome related works regarding xLAM series and consider citing them as well:
codegen-6B-nl
codegen-16B-nl
codegen-350M-nl
xgen-mm-phi3-mini-instruct-r-v1
instructblip-vicuna-13b
instructblip-flan-t5-xxl
codet5p-220m-py
xgen-small-4B-instruct-r
GTA1-7B
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our blog, we share state-of-the-art GUI grounding models trained using GRPO. We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results: | Model | Size | Open Source | ScreenSpot-V2 | ScreenSpotPro | OSWORLD-G | OSWORLD-G-Refined | |-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:| | OpenAI CUA | — | ❌ | 87.9 | 23.4 | — | — | | Claude 3.7 | — | ❌ | 87.6 | 27.7 | — | — | | JEDI-7B | 7B | ✅ | 91.7 | 39.5 | 54.1 | — | | SE-GUI | 7B | ✅ | 90.3 | 47.0 | — | — | | UI-TARS | 7B | ✅ | 91.6 | 35.7 | 47.5 | — | | UI-TARS-1.5 | 7B | ✅ | 89.7 | 42.0 | 52.8 | 64.2 | | UGround-v1-7B | 7B | ✅ | — | 31.1 | — | 36.4 | | Qwen2.5-VL-32B-Instruct | 32B | ✅ | 91.9 | 48.0 | 46.5 | 59.6 | | UGround-v1-72B | 72B | ✅ | — | 34.5 | — | — | | Qwen2.5-VL-72B-Instruct | 72B | ✅ | 94.00 | 53.3 | — | 62.2 | | UI-TARS | 72B | ✅ | 90.3 | 38.1 | — | — | | OpenCUA | 7B | ✅ | 92.3 | 50.0 | 55.3 | 68.3 | | OpenCUA | 32B | ✅ | 93.4 | 55.3 | 59.6 | 70.2 | | GTA1-2507 (Ours) | 7B | ✅ | 92.4 (∆ +2.7) | 50.1 (∆ +8.1) | 55.1 (∆ +2.3) | 67.7 (∆ +3.5) | | GTA1 (Ours) | 7B | ✅ | 93.4 (∆ +0.1) | 55.5 (∆ +5.5) | 60.1 (∆ +4.8) | 68.8 (∆ +0.5) | | GTA1 (Ours) | 32B | ✅ | 95.2 (∆ +1.8) | 63.6 (∆ +8.3) | 65.2 (∆ +5.6) | 72.2 (∆ +2.0) | > Note: > - Model size is indicated in billions (B) of parameters. > - A dash (—) denotes results that are currently unavailable. > - A superscript asterisk (﹡) denotes our evaluated result. > - UI-TARS-1.5 7B, OpenCUA-7B, and OpenCUA-32B are applied as our baseline models. > - ∆ indicates the performance improvement (∆) of our model compared to its baseline. We evaluate our models on the OSWorld and OSWorld-Verified benchmarks following the standard evaluation protocol. The results demonstrate strong performance across both datasets. | Agent Model | Step | OSWorld | OSWorld-Verified | |-----------------|:--------:|:-----------:|:-------------------:| | Proprietary Models | | Claude 3.7 Sonnet | 100 | 28.0 | — | | OpenAI CUA 4o | 200 | 38.1 | — | | UI-TARS-1.5 | 100 | 42.5 | 41.8 | | OpenAI CUA o3 | 200 | 42.9 | — | | Open-Source Models | | Aria-UI w/ GPT-4o | 15 | 15.2 | — | | Aguvis-72B w/ GPT-4o | 15 | 17.0 | — | | UI-TARS-72B-SFT | 50 | 18.8 | — | | Agent S w/ Claude-3.5-Sonnet | 15 | 20.5 | — | | Agent S w/ GPT-4o | 15 | 20.6 | — | | UI-TARS-72B-DPO | 15 | 22.7 | — | | UI-TARS-72B-DPO | 50 | 24.6 | — | | UI-TARS-1.5-7B | 100 | 26.9 | 27.4 | | Jedi-7B w/ o3 | 100 | — | 51.0 | | Jedi-7B w/ GPT-4o | 100 | 27.0 | — | | Agent S2 w/ Claude-3.7-Sonnet | 50 | 34.5 | — | | Agent S2 w/ Gemini-2.5-Pro | 50 | 41.4 | 45.8 | | Agent S2.5 w/ o3 | 100 | — | 56.0 | | Agent S2.5 w/ GPT-5 | 100 | — | 58.4 | | CoAct-1 w/o3 & o4mini & OpenAI CUA 4o | 150 | — | 60.8 | | GTA1-7B-2507 w/ o3 | 100 | 45.2 | 53.1 | | GTA1-7B-2507 w/ GPT-5 | 100 | — | 61.0 | | GTA1-32B w/ o3 | 100 | — | 55.4 | | GTA1-32B w/ GPT-5 | 100 | — | 63.4 | We also evaluate our models on the WindowsAgentArena benchmark, demonstrating strong performance in Windows-specific GUI automation tasks. | Agent Model | Step | Success Rate | |-----------------|:--------:|:---------------:| | Kimi-VL | 15 | 10.4 | | WAA | — | 19.5 | | Jedi w/ GPT-4o | 100 | 33.7 | | GTA1-7B-2507 w/ o3 | 100 | 47.9 | | GTA1-7B-2507 w/ GPT-5 | 100 | 49.2 | | GTA1-32B w/ o3 | 100 | 51.2 | | GTA1-32B w/ GPT-5 | 100 | 50.6 | Inference Below is a code snippet demonstrating how to run inference using a trained model. This model is released for research and educational purposes. While our model demonstrates strong performance on GUI benchmarks, users should carefully evaluate its suitability for their specific use cases. Important Considerations: - Accuracy Limitations: Like all AI systems, this model may produce incorrect outputs or fail to accurately identify GUI elements in certain scenarios. - Safety and Security: Exercise caution when deploying GUI automation agents, especially in production environments where incorrect actions could affect system integrity or data security. - Human Oversight: We recommend maintaining appropriate human supervision when using this model for automated GUI interactions. - Compliance: Users are responsible for ensuring their use of this model complies with applicable laws, regulations, and organizational policies. Recommended Best Practices: - Thoroughly test the model in controlled environments before production deployment - Implement safeguards and error handling mechanisms - Consider the potential impact of automated actions on user systems and data - Regularly monitor and validate model performance in your specific domain For further guidance on use cases, refer to our AUP and AI AUP. If you're using any GTA model or find it helpful in your research, please cite it as follows:
xLAM-2-1b-fc-r-gguf
codegen-6B-mono
CoDA-v0-Base
Try CoDA · Paper · Model Collection · GitHub Repository Welcome to CoDA, Salesforce AI Research's diffusion-based language model designed for powerful code generation and bidirectional context understanding. We're releasing CoDA as a lightweight yet capable model: - `CoDA-1.7B-Base` — diffusion foundation model with bidirectional diffusion architecture, ideal for further fine-tuning and RL training - `CoDA-1.7B-Instruct` — optimized for code generation tasks with bidirectional diffusion modeling (1.7B parameters) CoDA leverages discrete diffusion processes to enable understanding of both past and future tokens, making it uniquely suited for code completion and generation tasks where context flows in both directions. > [!NOTE] > This model card is dedicated to the `CoDA-1.7B-Base` model. Check out our model collection for other variants. Bidirectional Context Understanding: Leverage discrete diffusion processes to understand both past and future tokens, enabling superior code completion. Confidence-Guided Sampling: Maintain competitive inference latency through intelligent sampling strategies that balance quality and speed. Lightweight Architecture: Achieve strong performance with only 1.7B parameters, making it accessible for researchers with limited computational resources. Full Training Pipeline: Complete reproducible training pipeline from pre-training to fine-tuning, enabling customization for specific domains. Optimized for Code: Specifically designed and trained for code generation tasks, with strong performance on HumanEval, MBPP, and other coding benchmarks. - Model Size: 1.7B parameters - Architecture: Diffusion-based language model - Training: TPU-based pre-training with GPU fine-tuning - Primary Use: Code generation and completion tasks - Bidirectional Context: Diffusion modeling enables understanding of both past and future tokens - Confidence-Guided Sampling: Maintains competitive inference latency through intelligent sampling - Lightweight Design: Achieves strong performance with fewer parameters than comparable models - Open Training Pipeline: Fully reproducible training from pre-training to fine-tuning CoDA-1.7B-Instruct demonstrates competitive performance on standard code generation benchmarks: | Model | HumanEval | HumanEval+ | MBPP | MBPP+ | EvalPlus | |-------|-----------|------------|------|-------|----------| | CoDA-Base | 29.3 | 23.8 | 35.2 | 46.0 | 34.9 | | CoDA-Instruct | 54.3 | 47.6 | 47.2 | 63.2 | 55.4 | | Dream-Base | 56.7 | 50.0 | 68.7 | 57.4 | 53.7 | | Dream-7B-Instruct | 57.9 | 53.7 | 68.3 | 56.1 | 54.9 | | LLaDA-8B-Instruct | 35.4 | 31.7 | 31.5 | 28.6 | 30.2 | 🎯 Key Finding: CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters while maintaining significantly lower computational requirements. CoDA offers an advantageous balance between inference speed and accuracy compared to larger diffusion models. Three-stage training: (1) Pre-training with bidirectional masking, (2) Post-training with instruction format, (3) Inference with progressive denoising. For production deployment, we provide serving with OpenAI-compatible APIs: Customize generation behavior with environment variables: Recommended Settings: - Fast inference: `STEPS=64`, `TEMPERATURE=0.0` - Quality generation: `STEPS=128`, `TEMPERATURE=0.7`, `TOPP=0.9` - High quality: `STEPS=256`, `TEMPERATURE=0.5`, `TOPP=0.95` The complete training pipeline is available in our repository: Technical report coming soon. For now, please cite: - 📄 Technical Report: technicalreport.pdf - 💻 Code Repository: github.com/SalesforceAIResearch/CoDA - 🤗 Model Hub: Salesforce CoDA collection We thank Lingpeng Kong for insightful discussions and Jialei Chen for technical support with TPU infrastructure.
codet5p-6b
xgen-mm-vid-phi3-mini-r-v1.5-32tokens-8frames
codet5p-2b
GTA1-7B-2507
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our blog, we share state-of-the-art GUI grounding models trained using GRPO. We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results: | Model | Size | Open Source | ScreenSpot-V2 | ScreenSpotPro | OSWORLD-G | OSWORLD-G-Refined | |-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:| | OpenAI CUA | — | ❌ | 87.9 | 23.4 | — | — | | Claude 3.7 | — | ❌ | 87.6 | 27.7 | — | — | | JEDI-7B | 7B | ✅ | 91.7 | 39.5 | 54.1 | — | | SE-GUI | 7B | ✅ | 90.3 | 47.0 | — | — | | UI-TARS | 7B | ✅ | 91.6 | 35.7 | 47.5 | — | | UI-TARS-1.5 | 7B | ✅ | 89.7 | 42.0 | 52.8 | 64.2 | | UGround-v1-7B | 7B | ✅ | — | 31.1 | — | 36.4 | | Qwen2.5-VL-32B-Instruct | 32B | ✅ | 91.9 | 48.0 | 46.5 | 59.6 | | UGround-v1-72B | 72B | ✅ | — | 34.5 | — | — | | Qwen2.5-VL-72B-Instruct | 72B | ✅ | 94.00 | 53.3 | — | 62.2 | | UI-TARS | 72B | ✅ | 90.3 | 38.1 | — | — | | OpenCUA | 7B | ✅ | 92.3 | 50.0 | 55.3 | 68.3 | | OpenCUA | 32B | ✅ | 93.4 | 55.3 | 59.6 | 70.2 | | GTA1-2507 (Ours) | 7B | ✅ | 92.4 (∆ +2.7) | 50.1 (∆ +8.1) | 55.1 (∆ +2.3) | 67.7 (∆ +3.5) | | GTA1 (Ours) | 7B | ✅ | 93.4 (∆ +0.1) | 55.5 (∆ +5.5) | 60.1 (∆ +4.8) | 68.8 (∆ +0.5) | | GTA1 (Ours) | 32B | ✅ | 95.2 (∆ +1.8) | 63.6 (∆ +8.3) | 65.2 (∆ +5.6) | 72.2 (∆ +2.0) | > Note: > - Model size is indicated in billions (B) of parameters. > - A dash (—) denotes results that are currently unavailable. > - A superscript asterisk (﹡) denotes our evaluated result. > - UI-TARS-1.5 7B, OpenCUA-7B, and OpenCUA-32B are applied as our baseline models. > - ∆ indicates the performance improvement (∆) of our model compared to its baseline. We evaluate our models on the OSWorld and OSWorld-Verified benchmarks following the standard evaluation protocol. The results demonstrate strong performance across both datasets. | Agent Model | Step | OSWorld | OSWorld-Verified | |-----------------|:--------:|:-----------:|:-------------------:| | Proprietary Models | | Claude 3.7 Sonnet | 100 | 28.0 | — | | OpenAI CUA 4o | 200 | 38.1 | — | | UI-TARS-1.5 | 100 | 42.5 | 41.8 | | OpenAI CUA o3 | 200 | 42.9 | — | | Open-Source Models | | Aria-UI w/ GPT-4o | 15 | 15.2 | — | | Aguvis-72B w/ GPT-4o | 15 | 17.0 | — | | UI-TARS-72B-SFT | 50 | 18.8 | — | | Agent S w/ Claude-3.5-Sonnet | 15 | 20.5 | — | | Agent S w/ GPT-4o | 15 | 20.6 | — | | UI-TARS-72B-DPO | 15 | 22.7 | — | | UI-TARS-72B-DPO | 50 | 24.6 | — | | UI-TARS-1.5-7B | 100 | 26.9 | 27.4 | | Jedi-7B w/ o3 | 100 | — | 51.0 | | Jedi-7B w/ GPT-4o | 100 | 27.0 | — | | Agent S2 w/ Claude-3.7-Sonnet | 50 | 34.5 | — | | Agent S2 w/ Gemini-2.5-Pro | 50 | 41.4 | 45.8 | | Agent S2.5 w/ o3 | 100 | — | 56.0 | | Agent S2.5 w/ GPT-5 | 100 | — | 58.4 | | CoAct-1 w/o3 & o4mini & OpenAI CUA 4o | 150 | — | 60.8 | | GTA1-7B-2507 w/ o3 | 100 | 45.2 | 53.1 | | GTA1-7B-2507 w/ GPT-5 | 100 | — | 61.0 | | GTA1-32B w/ o3 | 100 | — | 55.4 | | GTA1-32B w/ GPT-5 | 100 | — | 63.4 | We also evaluate our models on the WindowsAgentArena benchmark, demonstrating strong performance in Windows-specific GUI automation tasks. | Agent Model | Step | Success Rate | |-----------------|:--------:|:---------------:| | Kimi-VL | 15 | 10.4 | | WAA | — | 19.5 | | Jedi w/ GPT-4o | 100 | 33.7 | | GTA1-7B-2507 w/ o3 | 100 | 47.9 | | GTA1-7B-2507 w/ GPT-5 | 100 | 49.2 | | GTA1-32B w/ o3 | 100 | 51.2 | | GTA1-32B w/ GPT-5 | 100 | 50.6 | Inference Below is a code snippet demonstrating how to run inference using a trained model. This model is released for research and educational purposes. While our model demonstrates strong performance on GUI benchmarks, users should carefully evaluate its suitability for their specific use cases. Important Considerations: - Accuracy Limitations: Like all AI systems, this model may produce incorrect outputs or fail to accurately identify GUI elements in certain scenarios. - Safety and Security: Exercise caution when deploying GUI automation agents, especially in production environments where incorrect actions could affect system integrity or data security. - Human Oversight: We recommend maintaining appropriate human supervision when using this model for automated GUI interactions. - Compliance: Users are responsible for ensuring their use of this model complies with applicable laws, regulations, and organizational policies. Recommended Best Practices: - Thoroughly test the model in controlled environments before production deployment - Implement safeguards and error handling mechanisms - Consider the potential impact of automated actions on user systems and data - Regularly monitor and validate model performance in your specific domain For further guidance on use cases, refer to our AUP and AI AUP. If you're using any GTA model or find it helpful in your research, please cite it as follows:
Llama-xLAM-2-70b-fc-r
codegen2-1B_P
xLAM-7b-r
mixqg-3b
blip-itm-base-flickr
xgen-mm-vid-phi3-mini-r-v1.5-128tokens-8frames
codet5-base-codexglue-sum-java
codet5p-16b
xgen-small-9B-instruct-r
XLAM 1b Fc R
[Homepage] | [APIGen Paper] | [ActionStudio Paper] | [Discord] | [Dataset] | [Github] Welcome to the xLAM model family! Large Action Models (LAMs) are advanced large language models designed to enhance decision-making and translate user intentions into executable actions that interact with the world. LAMs autonomously plan and execute tasks to achieve specific goals, serving as the brains of AI agents. They have the potential to automate workflow processes across various domains, making them invaluable for a wide range of applications. Table of Contents - Model Series - Repository Overview - Benchmark Results - Usage - Basic Usage with Huggingface - Usage with vLLM - License - Citation We provide a series of xLAMs in different sizes to cater to various applications, including those optimized for function-calling and general agent applications: | Model | # Total Params | Context Length |Release Date | Category | Download Model | Download GGUF files | |------------------------|----------------|----------------|----|----|----------------|----------| | xLAM-7b-r | 7.24B | 32k | Sep. 5, 2024|General, Function-calling | 🤗 Link | -- | | xLAM-8x7b-r | 46.7B | 32k | Sep. 5, 2024|General, Function-calling | 🤗 Link | -- | | xLAM-8x22b-r | 141B | 64k | Sep. 5, 2024|General, Function-calling | 🤗 Link | -- | | xLAM-1b-fc-r | 1.35B | 16k | July 17, 2024 | Function-calling| 🤗 Link | 🤗 Link | | xLAM-7b-fc-r | 6.91B | 4k | July 17, 2024| Function-calling| 🤗 Link | 🤗 Link | | xLAM-v0.1-r | 46.7B | 32k | Mar. 18, 2024 |General, Function-calling | 🤗 Link | -- | The `fc` series of models are optimized for function-calling capability, providing fast, accurate, and structured responses based on input queries and available APIs. These models are fine-tuned based on the deepseek-coder models and are designed to be small enough for deployment on personal devices like phones or computers. We also provide their quantized GGUF files for efficient deployment and execution. GGUF is a file format designed to efficiently store and load large language models, making GGUF ideal for running AI models on local devices with limited resources, enabling offline functionality and enhanced privacy. This repository is focused on our tiny `xLAM-1b-fc-r` model, which is optimized for function-calling and can be easily deployed on personal devices. Function-calling, or tool use, is one of the key capabilities for AI agents. It requires the model not only understand and generate human-like text but also to execute functional API calls based on natural language instructions. This extends the utility of LLMs beyond simple conversation tasks to dynamic interactions with a variety of digital services and applications, such as retrieving weather information, managing social media platforms, and handling financial services. The instructions will guide you through the setup, usage, and integration of `xLAM-1b-fc-r` with HuggingFace and vLLM. We will first introduce the basic usage, and then walk through the provided tutorial and example scripts in the examples folder. - Transformers 4.41.0 - Pytorch 2.3.0+cu121 - Datasets 2.19.1 - Tokenizers 0.19.1 We mainly test our function-calling models on the Berkeley Function-Calling Leaderboard (BFCL), which offers a comprehensive evaluation framework for assessing LLMs' function-calling capabilities across various programming languages and application domains like Java, JavaScript, and Python. Performance comparison on the BFCL benchmark as of date 07/18/2024. Evaluated with temperature=0.001 and topp=1 Our xLAM-7b-fc-r secures the 3rd place with an overall accuracy of 88.24% on the leaderboard, outperforming many strong models. Notably, our xLAM-1b-fc-r model is the only tiny model with less than 2B parameters on the leaderboard, but still achieves a competitive overall accuracy of 78.94% and outperforming GPT3-Turbo and many larger models. Both models exhibit balanced performance across various categories, showing their strong function-calling capabilities despite their small sizes. See our paper and Github repo for more detailed analysis. To use the `xLAM-1b-fc-r` model from Huggingface, please first install the `transformers` library: We use the following example to illustrate how to use our model to perform function-calling tasks. Please note that, our model works best with our provided prompt format. It allows us to extract JSON output that is similar to the function-calling mode of ChatGPT. { "toolcalls": [ {"name": "funcname1", "arguments": {"argument1": "value1", "argument2": "value2"}}, ... (more tool calls as required) ] } ` Then you should be able to see the following output string in JSON format: We highly recommend to use our provided prompt format and helper functions to yield the best function-calling performance of our model. We provide example scripts to deploy our model with `vllm` and run inferences. First, install the required packages: The example scripts are located in the examples folder. To build prompts using the chat template and output formatted prompts ready for various test cases, run: Options: - `--temperature`: Default 0.3 - `--topp`: Default 1.0 - `--maxtokens`: Default 512 This test script provides a handler implementation that can be easily applied to your customized function-calling applications. To test the xLAM model directly with the vLLM library, run: Options are the same as for the endpoint test. This test script also provides a handler implementation that can be easily applied to your customized function-calling applications. These examples are designed to be flexible and easily integrated into your own projects. Feel free to modify the scripts to suit your specific needs and applications. You can adjust test queries or API definitions in each script to test different scenarios or model capabilities. Additional customization tips: - Modify the `--dtype` parameter when serving the model based on your GPU capacity. - Refer to the vLLM documentation for more detailed configuration options. - Explore the `demo.ipynb` file for a comprehensive description of the entire workflow, including how to execute APIs. These resources provide a robust foundation for integrating xLAM models into your applications, allowing for tailored and efficient deployment. `xLAM-1b-fc-r` is distributed under the CC-BY-NC-4.0 license, with additional terms specified in the Deepseek license. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. If you find this repo helpful, please cite our paper:
codegen25-7b-multi_P
codegen-16B-mono
xgen-mm-phi3-mini-instruct-dpo-r-v1.5
xLAM-7b-fc-r-gguf
xLAM-2-32b-fc-r
xgen-small-4B-base-r
codet5p-220m-bimodal
xLAM-1b-fc-r-gguf
codegen2-7B_P
dialogstudio-t5-base-v1.0
Llama Fin 8b
💰 Demystifying Domain-adaptive Post-training for Financial LLMs This is the finance-specific large language model trained using the recipe described in our paper: 📄 Demystifying Domain-adaptive Post-training for Financial LLMs For more details, please check the following resources: - 🌐 Project Page: https://vincent950129.github.io/adapt-llm/ - 📚 Training Data: https://huggingface.co/datasets/Salesforce/FinTrain - 🧠 Evaluation Data: https://huggingface.co/datasets/Salesforce/FinEval - 💻 Code Repository: https://github.com/SalesforceAIResearch/FinDAP Ethical Considerations Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This release is for research purposes only in support of an academic paper. If you find our project helpful, please consider citing our paper 😊
xgen-mm-phi3-mini-base-r-v1.5
cogalign-internvl2_5-mpo-1b
mixqg-base
FARE-20B
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains Paper: arXiv link Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty FARE-20B is a multi-task evaluator model finetuned from gpt-oss-20B. It is trained on a large-scale multi-task, multi-domain data mixture using rejection-sampling SFT to perform the following evaluation tasks: Pairwise comparisons, step-level evaluation, reference-based verification, reference-free verification, and single-rating assessment. Usage > [!IMPORTANT] > The FARE family of evaluators has been trained with specific system and user prompt templates. We provide examples below for two evaluation tasks: Pairwise comparisons and step-level error identification evaluation. For other tasks, we provide prompt templates in our paper (Appendix E). Example inference with SGLang For FARE-20B (gpt-oss variant), our evaluations were conducted with SGLang. We provide a minimal working example below with pairwise evaluation. For example usage with vLLM, see FARE-8B Ethics disclaimer for Salesforce AI models, data, code This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our standard AUP and AI AUP.
codet5-large-ntp-py
FARE 8B
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains Paper: arXiv link Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty FARE-8B is a multi-task evaluator model finetuned from Qwen-8B. It is trained on a large-scale multi-task, multi-domain data mixture using rejection-sampling SFT to perform the following evaluation tasks: Pairwise comparisons, step-level evaluation, reference-based verification, reference-free verification, and single-rating assessment. Usage > [!IMPORTANT] > The FARE family of evaluators has been trained with specific system and user prompt templates. We provide examples below for two evaluation tasks: Pairwise comparisons and step-level error identification evaluation. For other tasks, we provide prompt templates in our paper (Appendix E). Example inference with vLLM For FARE-8B (Qwen-3 variant), our evaluations were conducted with vLLM. We provide a minimal working example below with pairwise evaluation. For example usage with SGLang, see FARE-20B Ethics disclaimer for Salesforce AI models, data, code This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our standard AUP and AI AUP.
SweRankLLM-Small
codegen2-3_7B_P
instructcodet5p-16b
xgen-mm-phi3-mini-instruct-singleimg-r-v1.5
xLAM-v0.1-r
xgen-mm-phi3-mini-base-r-v1
LLaMA-3-8B-SFR-Iterative-DPO-R
We release a state-of-the-art instruct model of its class, Llama-3-8B-SFR-Iterative-DPO-R. On all three widely-used instruct model benchmarks: Alpaca-Eval-V2, MT-Bench, Chat-Arena-Hard, our model outperforms all models of similar size, e.g., LLaMA-3-8B-it, and most large open-sourced models, e.g., Mixtral-8x7B-it.
codegen-2B-nl
xLAM-7b-fc-r
grappa_large_jnt
xLAM-8x7b-r
codegen25-7b-instruct_P
WQRM-PRE
dialogstudio-t5-large-v1.0
qa_consolidation
E1-Math-1.5B
cogalign-internvl2_5-mpo-4b
xgen-7b-4k-base
SweRankEmbed-Large
codegen2-16B_P
WQRM
BLIP3o-NEXT-GRPO-TexT-3B
This is BLIP3o-NEXT-GRPO-TexT checkpoint trained on the BLIP3o-NEXT-SFT. Clone the repo (if you haven’t already) and install the environment: ``` git clone https://github.com/JiuhaiChen/BLIP3o.git
codet5-base-codexglue-sum-python
bart-large-xsum-samsum
codet5-base-codexglue-defect
LLaMA-3-8B-SFR-SFT-R
codet5-base-codexglue-translate-java-cs
codegen25-7b-mono_P
codet5-base-codexglue-translate-cs-java
LLaMA-3-8B-SFR-RM-R
discord_qg
codet5-base-codexglue-refine-medium
BLIP3o-NEXT-edit-VAE
This is BLIP3o-NEXT-edit-VAE checkpoint trained on the BLIP3o-NEXT-SFT and use VAE as condition. Clone the repo (if you haven’t already) and install the environment: and switch to BLIP3o-NEXT-edit branch to do the inference.
qaconv-roberta-large-squad2
SweRankLLM-Large
qaconv-unifiedqa-t5-large
codet5-base-codexglue-concode
xLAM-8x22b-r
llama3-siglip-mantis-taco-8b
codet5-base-codexglue-sum-php
E1-Code-14B
dialogstudio-t5-3b-v1.0
codet5-base-codexglue-refine-small
qwen2-siglip-llava-ov-taco-7b
cogalign-llava-ov-0_5b
codet5-base-codexglue-sum-go
bic_simple_edit_id
squality-socratic-books-30M
socratic-books-30M
qaconv-unifiedqa-t5-3b
codet5-base-codexglue-sum-ruby
E1-AceReason-14B
mixqg-large
LLaMA-3-8B-SFR-Iterative-DPO-Concise-R
llama3-clip-pretrained-mantis-taco-8b
codet5-base-codexglue-sum-javascript
bart-large-swipe-clean
xgen-small-r
Elastic-Reasoning
discord_qa
BLIP
E1-Math-7B
xgen-small-9B-base-r
BLIP3o-NEXT-Pretrain-3B
BLIP3o-NEXT-GRPO-Geneval-3B
This is BLIP3o-NEXT-GRPO-Geneval checkpoint trained on the BLIP3o-NEXT-SFT. Clone the repo (if you haven’t already) and install the environment: ``` git clone https://github.com/JiuhaiChen/BLIP3o.git
cods-bart-large-xsum-samsum
qaconv-unifiedqa-t5-base
safety-flan-t5-small
BLIP3o-NEXT-SFT-3B
This is BLIP3o-NEXT-SFT checkpoint trained on BLIP3o-NEXT-Pretrain. Clone the repo (if you haven’t already) and install the environment: ``` git clone https://github.com/JiuhaiChen/BLIP3o.git