TencentARC

83 models • 2 total models in database

Sort by:

InstantMesh

Model card for InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models. We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g., depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.

license:apache-2.0

PhotoMaker

🤗 Gradio demo (Realistic) | 🤗 Gradio demo (Stylization) Users can input one or a few face photos, along with a text prompt, to receive a customized photo or painting within seconds (no training required!). Additionally, this model can be adapted to any base model based on SDXL or used in conjunction with other LoRA modules. It mainly contains two parts corresponding to two keys in loaded state dict: 1. `idencoder` includes finetuned OpenCLIP-ViT-H-14 and a few fuse layers. 2. `loraweights` applies to all attention layers in the UNet, and the rank is set to 64. You can directly download the model in this repository. You also can download the model in python script: Then, please follow the instructions in our GitHub repository. - The model's customization performance degrades on Asian male faces. - The model still struggles with accurately rendering human hands. While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

license:apache-2.0

PhotoMaker-V2

license:apache-2.0

t2i-adapter-sketch-sdxl-1.0

license:apache-2.0

t2i-adapter-lineart-sdxl-1.0

license:apache-2.0

t2i-adapter-canny-sdxl-1.0

license:apache-2.0

t2i-adapter-depth-midas-sdxl-1.0

license:apache-2.0

t2i-adapter-openpose-sdxl-1.0

license:apache-2.0

t2i-adapter-depth-zoe-sdxl-1.0

T2I Adapter is a network providing additional conditioning to stable diffusion. Each t2i checkpoint takes a different type of conditioning as input and is used with a specific base stable diffusion checkpoint. This checkpoint provides conditioning on depth for the StableDiffusionXL checkpoint. This was a collaboration between Tencent ARC and Hugging Face. Model Details - Developed by: T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models - Model type: Diffusion-based text-to-image generation model - Language(s): English - License: Apache 2.0 - Resources for more information: GitHub Repository, Paper. - Model complexity: | | SD-V1.4/1.5 | SD-XL | T2I-Adapter | T2I-Adapter-SDXL | | --- | --- |--- |--- |--- | | Parameters | 860M | 2.6B |77 M | 77/79 M | | - Cite as: @misc{ title={T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models}, author={Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie}, year={2023}, eprint={2302.08453}, archivePrefix={arXiv}, primaryClass={cs.CV} } | Model Name | Control Image Overview| Control Image Example | Generated Image Example | |---|---|---|---| |TencentARC/t2i-adapter-canny-sdxl-1.0 Trained with canny edge detection | A monochrome image with white edges on a black background.| | | |TencentARC/t2i-adapter-sketch-sdxl-1.0 Trained with PidiNet edge detection | A hand-drawn monochrome image with white outlines on a black background.| | | |TencentARC/t2i-adapter-lineart-sdxl-1.0 Trained with lineart edge detection | A hand-drawn monochrome image with white outlines on a black background.| | | |TencentARC/t2i-adapter-depth-midas-sdxl-1.0 Trained with Midas depth estimation | A grayscale image with black representing deep areas and white representing shallow areas.| | | |TencentARC/t2i-adapter-depth-zoe-sdxl-1.0 Trained with Zoe depth estimation | A grayscale image with black representing deep areas and white representing shallow areas.| | | |TencentARC/t2i-adapter-openpose-sdxl-1.0 Trained with OpenPose bone image | A OpenPose bone image.| | | To get started, first install the required dependencies: 1. Images are first downloaded into the appropriate control image format. 2. The control image and prompt are passed to the `StableDiffusionXLAdapterPipeline`. Let's have a look at a simple example using the Depth-zoe Adapter. Our training script was built on top of the official training script that we provide here. The model is trained on 3M high-resolution image-text pairs from LAION-Aesthetics V2 with - Training steps: 25000 - Batch size: Data parallel with a single gpu batch size of `16` for a total batch size of `256`. - Learning rate: Constant learning rate of `1e-5`. - Mixed precision: fp16

license:apache-2.0

t2iadapter_canny_sd15v2

license:apache-2.0

t2iadapter_depth_sd15v2

license:apache-2.0

t2iadapter_sketch_sd15v2

license:apache-2.0

t2iadapter_zoedepth_sd15v1

license:apache-2.0

ARC-Hunyuan-Video-7B

[](https://arxiv.org/abs/2507.20939) [](https://arc.tencent.com/en/ai-demos/multimodal) [](https://github.com/TencentARC/ARC-Hunyuan-Video-7B) [](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) [](https://tencentarc.github.io/posts/arc-video-announcement/) [](https://huggingface.co/datasets/TencentARC/ShortVid-Bench) Please note that in our Demo, ARC-Hunyuan-Video-7B is the model consistent with the model checkpoint and the one described in the paper, while ARC-Hunyuan-Video-7B-V0 only supports video description and summarization in Chinese. Due to API file size limits, our demo uses compressed input video resolutions, which may cause slight performance differences from the paper. For original results, please run locally. We introduce ARC-Hunyuan-Video-7B, a powerful multimodal model designed for understanding real-world short videos. Understanding user-generated videos is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. To address this challenge, ARC-Hunyuan-Video-7B processes visual, audio, and textual signals end-to-end for a deep, structured understanding of video through integrating and reasoning over multimodal cues. Stress test reports show an inference time of just 10 seconds for a one-minute video on H20 GPU, yielding an average of 500 tokens, with inference accelerated by the vLLM framework. Compared to prior arts, we introduces a new paradigm of Structured Video Comprehension, with capabilities including: - Deep Understanding of Real-World Short Videos: ARC-Hunyuan-Video-7B excels at analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond surface-level descriptions to grasp the creator's intent, emotional expression, and core message by processing complex visual elements, dense audio cues, and rapid pacing. - Synchronized Audio-Visual Reasoning: The synchronization of raw visual and audio signals allows our model to answer complex questions that are impossible to solve with only one modality, such as understanding humor in a skit or details in a product review. - Precise Temporal Awareness: ARC-Hunyuan-Video-7B knows not just what happens, but when it happens. It supports multi-granularity timestamped captioning, temporal video grounding, and detailed event summarization, making it perfect for applications like video search, highlight generation, and content analysis. - Advanced Reasoning and Application Versatility: Leveraging a comprehensive multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for diverse downstream applications like video tagging, recommendation, and retrieval. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning as below, Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension: - An extra audio encoder with fine-grained visual-audio synchronization for temporally aligned visual-audio inputs - A timestamp overlay mechanism on visual frames that explicitly provides the model with temporal awareness - Millions of real-world videos with a totally automated bootstrapped annotation pipeline - A comprehensive training regimen based on the finding that grounding the model in objective tasks with RL is key to unlocking high-quality, subjective understanding ARC-Qwen-Video-7B In this version, we have switched the base model from hunyuan VLM to Qwen2.5-VL-7B-Instruct and introduce ARC-Qwen-Video-7B. We used the same training data and training stages. Please refere to the `arc-qwen-video` branch for details. We are also introducing a new model, ARC-Qwen-Video-7B-Narrator. It can output timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video): 这是一个喜剧短片，讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现，并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话，生动展现了丈夫从悠闲自得，到震惊错愕，再到崩溃无奈的全过程，充满了戏剧性的反转和幽默感。 0:05 - 0:10 镜头切换：妻子在服装店里，满脸幸福地给丈夫打电话。妻子 “哎，老公，老公，我爱你爱你，爱死你了，么么么。” 0:10 - 0:18 丈夫接起电话，对妻子的热情感到好奇，妻子则兴奋地揭晓了“惊喜”。丈夫 “哎，怎么了你这是，这么高兴啊？” 0:18 - 0:27 听到“一万元”，丈夫表情瞬间凝固，从疑惑变为震惊和懊悔，但仍强装镇定。丈夫 “啊？好啊，你你你你开心高兴就行。” 0:27 - 0:34 妻子开心地告知钱的用途，丈夫的表情彻底僵住，震惊加剧。妻子 “我当然高兴啊，我用它买了一件新衣裳，等晚上回去穿给你看啊。” 0:34 - 0:46 丈夫确认钱已被花掉，情绪崩溃。妻子则认为是丈夫授权的，丈夫忍不住骂了一句。丈夫 “你已经给买成衣服了？” 0:46 - 0:59 妻子察觉丈夫语气不对，丈夫立刻改口掩饰，并催促妻子早点回家。妻子 “什么，老公，你说什么？” 丈夫: 行为: 藏私房钱，事发后极力掩饰自己的真实情绪（心痛、懊悔）。心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。特点: 爱面子，对妻子既有爱意也有无奈，典型的“妻管严”形象。妻子: 行为: 发现钱后，认为是丈夫的爱意表达，并迅速将其消费。心理变化: 全程处于发现“惊喜”的幸福和喜悦中。特点: 天真、消费果断，对丈夫充满信任和爱意。丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉，是一场“惊吓”。妻子视角: 丈夫精心准备的 10,000元浪漫基金，是一份巨大的“惊喜”。这个误会推动了整个故事的发展，丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差，制造了密集的笑点。该视频通过一个关于“私房钱”的常见家庭情景，巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺（观众和丈夫知道真相，而妻子蒙在鼓里）的手法，精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出，也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题，容易引发观众的共鸣和讨论。 News - 2025.09.19: We release ARC-Qwen-Video-7B, which switched the base model from hunyuan VLM to Qwen2.5-VL-7B-Instruct. We also release ARC-Qwen-Video-7B-Narrator, which can output timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content. Please refere to the `arc-qwen-video` branch for details. - 2025.08.05: We release ShortVid-Bench, a specialized, human-annotated benchmark with multiple-choice questions for evaluating short-video understanding. - 2025.07.29: We release the training code for instruction tuning. - 2025.07.25: We release the model checkpoint and inference code of ARC-Hunyuan-Video-7B including vLLM version. - 2025.07.25: We release the API service of ARC-Hunyuan-Video-7B, which is supported by vLLM. We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper. Usage Dependencies - Our inference can be performed on a single NVIDIA A100 40GB GPU. - For the vLLM deployment version, we recommend using two NVIDIA A100 40GB GPUs. Installation - Download ARC-Hunyuan-Video-7B including ViT and LLM and the original whisper-large-v3 . We also provide access to the model via API, which is supported by vLLM. For details, please refer to the documentation. We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper, which is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning (It supports Chinese and English videos and particularly excels at Chinese). For videos longer than 5 minutes, we only support structured descriptions. We process these videos in 5-minute segments and use an LLM to integrate the inference results. If you only need to understand and summarize short Chinese videos, we recommend using the V0 version. Due to video file size limitations imposed by the deployment API, we compressed input video resolutions for our online demo and API services. Consequently, model performance in these interfaces may slightly deviate from the results reported in the paper. To reproduce the original performance, we recommend local inference. We observe that incorporating generic video datasets during training may inadvertently compromise the model's capacity for real-world video understanding, potentially due to domain shift or noise introduced by non-real-world samples. To address this limitation, we plan to develop a dedicated model trained exclusively on rigorously curated real-world video data. If you find the work helpful, please consider citing:

QA-CLIP-ViT-L-14

license:apache-2.0

TimeLens-8B

GeometryCrafter

StereoCrafter

NVComposer

ARC-Qwen-Video-7B

[](https://arxiv.org/abs/2507.20939) [](https://arc.tencent.com/en/ai-demos/multimodal) [](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/tree/arc-qwen-video) [](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) [](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) [](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator) [](https://tencentarc.github.io/posts/arc-video-announcement/) [](https://huggingface.co/datasets/TencentARC/ShortVid-Bench) In this version, we have switched the base model from hunyuan VLM in ARC-Hunyuan-Video-7B to Qwen2.5-VL-7B-Instruct and introduce ARC-Qwen-Video-7B for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to ARC-Hunyuan-Video-7B. The main distinctions are listed as below, | Feature | `ARC-Hunyuan-Video-7B` | `ARC-Qwen-Video-7B` | | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Base VLM | Hunyuan-VL-7B-Pretrain | Qwen2.5-VL-7B-Instruct | | Frame Resolution Each model uses a fixed frame resolution to maintain audio-video synchronization. | Fixed at `640 x 640` | Fixed at `392 x 292` | | Frame Sampling | • • > 150s: Uniformly sample 150 frames. | • • > 300s: Uniformly sample 300 frames. | | Audio-Video Synchronization | • • 150-300s: Sum tokens from corresponding audio segment + video frame. • > 300s: Split audio into 300 segments, use first 2s of each. | • • > 300s: Split audio into 300 segments, use middle 1s of each. | We are also introducing a new model, ARC-Qwen-Video-7B-Narrator. It can output timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video): 这是一个喜剧短片，讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现，并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话，生动展现了丈夫从悠闲自得，到震惊错愕，再到崩溃无奈的全过程，充满了戏剧性的反转和幽默感。 0:05 - 0:10 镜头切换：妻子在服装店里，满脸幸福地给丈夫打电话。妻子 “哎，老公，老公，我爱你爱你，爱死你了，么么么。” 0:10 - 0:18 丈夫接起电话，对妻子的热情感到好奇，妻子则兴奋地揭晓了“惊喜”。丈夫 “哎，怎么了你这是，这么高兴啊？” 0:18 - 0:27 听到“一万元”，丈夫表情瞬间凝固，从疑惑变为震惊和懊悔，但仍强装镇定。丈夫 “啊？好啊，你你你你开心高兴就行。” 0:27 - 0:34 妻子开心地告知钱的用途，丈夫的表情彻底僵住，震惊加剧。妻子 “我当然高兴啊，我用它买了一件新衣裳，等晚上回去穿给你看啊。” 0:34 - 0:46 丈夫确认钱已被花掉，情绪崩溃。妻子则认为是丈夫授权的，丈夫忍不住骂了一句。丈夫 “你已经给买成衣服了？” 0:46 - 0:59 妻子察觉丈夫语气不对，丈夫立刻改口掩饰，并催促妻子早点回家。妻子 “什么，老公，你说什么？” 丈夫: 行为: 藏私房钱，事发后极力掩饰自己的真实情绪（心痛、懊悔）。心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。特点: 爱面子，对妻子既有爱意也有无奈，典型的“妻管严”形象。妻子: 行为: 发现钱后，认为是丈夫的爱意表达，并迅速将其消费。心理变化: 全程处于发现“惊喜”的幸福和喜悦中。特点: 天真、消费果断，对丈夫充满信任和爱意。丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉，是一场“惊吓”。妻子视角: 丈夫精心准备的 10,000元浪漫基金，是一份巨大的“惊喜”。这个误会推动了整个故事的发展，丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差，制造了密集的笑点。该视频通过一个关于“私房钱”的常见家庭情景，巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺（观众和丈夫知道真相，而妻子蒙在鼓里）的手法，精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出，也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题，容易引发观众的共鸣和讨论。 Dependencies The installation has been tested and verified on the following environments: NVIDIA H20 with CUDA 12.4 NVIDIA A100 with CUDA 12.1 An 'Ugly' Workaround for vLLM Installation If you are unable to install our provided vllm package, we offer an alternative "ugly" method: 2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen25VLForConditionalGeneration". 3. Patch the vllm source code. Locate the file vllm/modelexecutor/models/qwen25vl.py in your vllm installation path. Add the following code inside the init method of the Qwen25VLForConditionalGeneration class: Why this works: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5. To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B. Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful. Benchmark Performance | | Video-MMMU | MMVU | Temp-Compass | Video-Holmes | Video-MME | VCR-Bench | MV-Bench | ShortVid-Bench | Charades-STA | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | ARC-Hunyuan-Video-7B | 31.1 | 49.1 | 66.0 | 40.9 | 58.7 | 50.5 | 62.6 | 73.0 | 54.8 | | ARC-Qwen-Video-7B | 41.3 | 55.5 | 68.7 | 51.1 | 61.0 | 52.3 | 60.8 | 72.6 | 52.8 | Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions. If you find the work helpful, please consider citing:

license:apache-2.0

VerseCrafter

TimeLens-7B

t2iadapter_color_sd14v1

license:apache-2.0

SEED-Story

t2iadapter_depth_sd14v1

license:apache-2.0

t2iadapter_sketch_sd14v1

license:apache-2.0

flux-mini

t2iadapter_seg_sd14v1

license:apache-2.0

t2iadapter_openpose_sd14v1

license:apache-2.0

t2iadapter_canny_sd14v1

license:apache-2.0

AudioStory-3B

AudioStory: Generating Long-Form Narrative Audio with Large Language Models [[github]](https://github.com/TencentARC/AudioStory/) ✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis. 📖 Release [2025/09/02] 🔥🔥 Text-to-long audio checkpoint released! Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: 1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components—a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation. 2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. 1. Video Dubbing (Tom & Jerry style) > Dubbing is achieved using AudioStory (trained on Tom & Jerry) with visual captions extracted from videos. Instruction: "Develop a comprehensive audio that fully represents jake shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds." Instruction: "Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds." Instruction: "Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds." To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs interleaved reasoning generation, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation. Python >= 3.10 (Recommend to use Anaconda) PyTorch >=2.1.0 NVIDIA GPU + CUDA When building the codebase of continuous denosiers, we refer to SEED-X and TangoFlux. Thanks for their wonderful projects. - [ ] Release our gradio demo. - [x] 💾 Release AudioStory model checkpoints - [ ] Release AudioStory-10k dataset. - [ ] Release training codes of all three stages. If you have further questions, feel free to contact me: [email protected] Discussions and potential collaborations are also welcome.

IC-Custom

ARC-Qwen-Video-7B-Narrator

[](https://arxiv.org/abs/2507.20939) [](https://arc.tencent.com/en/ai-demos/multimodal) [](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/tree/arc-qwen-video) [](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) [](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) [](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator) [](https://tencentarc.github.io/posts/arc-video-announcement/) [](https://huggingface.co/datasets/TencentARC/ShortVid-Bench) In this version, we have switched the base model from hunyuan VLM in ARC-Hunyuan-Video-7B to Qwen2.5-VL-7B-Instruct and introduce ARC-Qwen-Video-7B for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to ARC-Hunyuan-Video-7B. The main distinctions are listed as below, | Feature | `ARC-Hunyuan-Video-7B` | `ARC-Qwen-Video-7B` | | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Base VLM | Hunyuan-VL-7B-Pretrain | Qwen2.5-VL-7B-Instruct | | Frame Resolution Each model uses a fixed frame resolution to maintain audio-video synchronization. | Fixed at `640 x 640` | Fixed at `392 x 292` | | Frame Sampling | • • > 150s: Uniformly sample 150 frames. | • • > 300s: Uniformly sample 300 frames. | | Audio-Video Synchronization | • • 150-300s: Sum tokens from corresponding audio segment + video frame. • > 300s: Split audio into 300 segments, use first 2s of each. | • • > 300s: Split audio into 300 segments, use middle 1s of each. | We are also introducing a new model, ARC-Qwen-Video-7B-Narrator. It can output timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video): 这是一个喜剧短片，讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现，并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话，生动展现了丈夫从悠闲自得，到震惊错愕，再到崩溃无奈的全过程，充满了戏剧性的反转和幽默感。 0:05 - 0:10 镜头切换：妻子在服装店里，满脸幸福地给丈夫打电话。妻子 “哎，老公，老公，我爱你爱你，爱死你了，么么么。” 0:10 - 0:18 丈夫接起电话，对妻子的热情感到好奇，妻子则兴奋地揭晓了“惊喜”。丈夫 “哎，怎么了你这是，这么高兴啊？” 0:18 - 0:27 听到“一万元”，丈夫表情瞬间凝固，从疑惑变为震惊和懊悔，但仍强装镇定。丈夫 “啊？好啊，你你你你开心高兴就行。” 0:27 - 0:34 妻子开心地告知钱的用途，丈夫的表情彻底僵住，震惊加剧。妻子 “我当然高兴啊，我用它买了一件新衣裳，等晚上回去穿给你看啊。” 0:34 - 0:46 丈夫确认钱已被花掉，情绪崩溃。妻子则认为是丈夫授权的，丈夫忍不住骂了一句。丈夫 “你已经给买成衣服了？” 0:46 - 0:59 妻子察觉丈夫语气不对，丈夫立刻改口掩饰，并催促妻子早点回家。妻子 “什么，老公，你说什么？” 丈夫: 行为: 藏私房钱，事发后极力掩饰自己的真实情绪（心痛、懊悔）。心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。特点: 爱面子，对妻子既有爱意也有无奈，典型的“妻管严”形象。妻子: 行为: 发现钱后，认为是丈夫的爱意表达，并迅速将其消费。心理变化: 全程处于发现“惊喜”的幸福和喜悦中。特点: 天真、消费果断，对丈夫充满信任和爱意。丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉，是一场“惊吓”。妻子视角: 丈夫精心准备的 10,000元浪漫基金，是一份巨大的“惊喜”。这个误会推动了整个故事的发展，丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差，制造了密集的笑点。该视频通过一个关于“私房钱”的常见家庭情景，巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺（观众和丈夫知道真相，而妻子蒙在鼓里）的手法，精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出，也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题，容易引发观众的共鸣和讨论。 Dependencies The installation has been tested and verified on the following environments: NVIDIA H20 with CUDA 12.4 NVIDIA A100 with CUDA 12.1 An 'Ugly' Workaround for vLLM Installation If you are unable to install our provided vllm package, we offer an alternative "ugly" method: 2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen25VLForConditionalGeneration". 3. Patch the vllm source code. Locate the file vllm/modelexecutor/models/qwen25vl.py in your vllm installation path. Add the following code inside the init method of the Qwen25VLForConditionalGeneration class: Why this works: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5. To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B. Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful. Benchmark Performance | | Video-MMMU | MMVU | Temp-Compass | Video-Holmes | Video-MME | VCR-Bench | MV-Bench | ShortVid-Bench | Charades-STA | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | ARC-Hunyuan-Video-7B | 31.1 | 49.1 | 66.0 | 40.9 | 58.7 | 50.5 | 62.6 | 73.0 | 54.8 | | ARC-Qwen-Video-7B | 41.3 | 55.5 | 68.7 | 51.1 | 61.0 | 52.3 | 60.8 | 72.6 | 52.8 | Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions. If you find the work helpful, please consider citing:

license:apache-2.0

GRPO-CARE

This repository contains the GRPO-CARE model, presented in the paper GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning.

license:apache-2.0

TokLIP

t2iadapter_keypose_sd14v1

Moto

license:apache-2.0

QA-CLIP-ViT-B-16

license:apache-2.0

Open-MAGVIT2-Tokenizer-256-resolution

license:apache-2.0

DSR_Suite-Model

license:apache-2.0

IBQ-Tokenizer-1024

Taming Scalable Visual Tokenizer for Autoregressive Image Generation Code: https://github.com/TencentARC/SEED-Voken Introduction We propose Index Backpropagation Quantization (IBQ), a new vector quantization method for the joint optimization of all codebook embeddings and the visual encoder, ensuring the consistent latent space. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization.

license:apache-2.0

AnimeGamer

Open-MAGVIT2-Tokenizer-128-resolution

license:apache-2.0

Open-MAGVIT2-Tokenizer-16384-Pretrain

license:apache-2.0

Open-MAGVIT2-Tokenizer-262144-Pretrain

Open-MAGVIT2: Democratizing Autoregressive Visual Generation Code: https://github.com/TencentARC/SEED-Voken Introduction Until now, VQGAN, the initial tokenizer is still acting an indispensible role in mainstream tasks, especially autoregressive visual generation. Limited by the bottleneck of the size of codebook and the utilization of code, the capability of AR generation with VQGAN is underestimated. Therefore, MAGVIT2 proposes a powerful tokenizer for visual generation task, which introduces a novel LookUpFree technique when quantization and extends the size of codebook to $2^{18}$, exhibiting promising performance in both image and video generation tasks. And it plays an important role in the recent state-of-the-art AR video generation model VideoPoet. However, we have no access to this strong tokenizer so far. ☹️ In the codebase, we follow the significant insights of tokenizer design in MAGVIT-2 and re-implement it with Pytorch, achieving the closest results to the original so far. We hope that our effort can foster innovation, creativity within the field of Autoregressive Visual Generation. 😄

license:apache-2.0

Open-MAGVIT2-Tokenizer-262144-Video

license:apache-2.0

Divot

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [](https://arxiv.org/abs/2412.04432) [](https://github.com/TencentARC/Divot) >We introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-LLM through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. All models, training code and inference code are released! TODOs - [x] Release the pretrained tokenizer and de-tokenizer of Divot. - [x] Release the pretrained and instruction tuned model of Divot-LLM. - [x] Release inference code of Divot. - [x] Release training and inference code of Divot-LLM. - [ ] Release training code of Divot. - [ ] Release de-tokenizer adaptation training code. We utilize the diffusion procedure to learn a video tokenizer in a self-supervised manner for unified comprehension and generation, where the spatiotemporal representations serve as the condition of a diffusion model to de-noise video clips. Additionally, the proxy diffusion model functions as a de-tokenizer to decode realistic video clips from the video representations. After training the the Divot tokenizer, video features from the Divot tokenizer are fed into the LLM to perform next-word prediction for video comprehension, while learnable queries are input into the LLM to model the distributions of Divot features using a Gaussian Mixture Model (GMM) for video generation. During inference, video features are sampled from the predicted GMM distribution to decode videos using the de-tokenizer. Dependencies - Python >= 3.8 (Recommend to use Anaconda) - PyTorch >=2.1.0 - NVIDIA GPU + CUDA Installation Clone the repo and install dependent packages Model Weights We release the pretrained tokenizer and de-tokenizer, pre-trained and instruction-tuned Divot-LLM. Please download the checkpoints and save them under the folder `./pretrained`. For example, `./pretrained/Divottokenizerdetokenizer`. You also need to download Mistral-7B-Instruct-v0.1 and CLIP-ViT-H-14-laion2B-s32B-b79K, and save them under the folder `./pretrained`. Training Pre-training 1. Download the checkpoints of pre-trained Mistral-7B-Instruct-v0.1 and CLIP-ViT-H-14-laion2B-s32B-b79K , and save them under the folder `./pretrained`. 2. Prepare the training data in the format of webdataset. 3. Run the following script. Instruction-tuning 1. Download the checkpoints of pre-trained Divot tokenizer and Divot-LLM in Divot, and save them under the folder `./pretrained`. 2. Prepare the instruction data in the format of webdataset (for generation) and jsonl (for comprehension, where each line stores a dictionary used to specify the videopath, question, and answer). 3. Run the following script. Inference with your own model 1. Obtain "pytorchmodel.bin" with the following script. 2. Merge your trained lora with the original LLM model using the following script. 3. Load your merged model in "mistral7bmergedxxx" and and corresponding "agent" path, For example, License `Divot` is licensed under the Apache License Version 2.0 for academic purpose only except for the third-party components listed in License. Citation If you find the work helpful, please consider citing: Acknowledge Our code for Divot tokenizer and de-tokenizer is built upon DynamiCrafter. Thanks for their excellent work!

license:apache-2.0

IBQ-Tokenizer-16384

Taming Scalable Visual Tokenizer for Autoregressive Image Generation Code: https://github.com/TencentARC/SEED-Voken Introduction We propose Index Backpropagation Quantization (IBQ), a new vector quantization method for the joint optimization of all codebook embeddings and the visual encoder, ensuring the consistent latent space. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization.

license:apache-2.0

IBQ-Tokenizer-262144

Taming Scalable Visual Tokenizer for Autoregressive Image Generation Code: https://github.com/TencentARC/SEED-Voken Introduction We propose Index Backpropagation Quantization (IBQ), a new vector quantization method for the joint optimization of all codebook embeddings and the visual encoder, ensuring the consistent latent space. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization.

license:apache-2.0

IBQ-AR-B

Taming Scalable Visual Tokenizer for Autoregressive Image Generation Code: https://github.com/TencentARC/SEED-Voken Introduction We propose Index Backpropagation Quantization (IBQ), a new vector quantization method for the joint optimization of all codebook embeddings and the visual encoder, ensuring the consistent latent space. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization.

license:apache-2.0

IBQ-AR-L

Taming Scalable Visual Tokenizer for Autoregressive Image Generation Code: https://github.com/TencentARC/SEED-Voken Introduction We propose Index Backpropagation Quantization (IBQ), a new vector quantization method for the joint optimization of all codebook embeddings and the visual encoder, ensuring the consistent latent space. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization.

license:apache-2.0

IBQ-AR-XXL

license:apache-2.0

Open-MAGVIT2-AR-XL-256-resolution

license:apache-2.0

IBQ-Tokenizer-16384-Pretrain

license:apache-2.0

IBQ-Tokenizer-8192

Taming Scalable Visual Tokenizer for Autoregressive Image Generation Code: https://github.com/TencentARC/SEED-Voken Introduction We propose Index Backpropagation Quantization (IBQ), a new vector quantization method for the joint optimization of all codebook embeddings and the visual encoder, ensuring the consistent latent space. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook (2^18) with high dimension (256) and high utilization.

license:apache-2.0

IBQ-AR-XL

license:apache-2.0

Open-MAGVIT2-AR-B-256-resolution

license:apache-2.0

IBQ-Tokenizer-262144-Pretrain

license:apache-2.0

T2I-Adapter

[](https://huggingface.co/spaces/Adapter/T2I-Adapter) T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models Please find the model information in https://github.com/TencentARC/T2I-Adapter/blob/main/docs/AdapterZoo.md

MotionCtrl

license:apache-2.0

GFPGANv1

ColorFlow

ToonComposer

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing Traditional cartoon/anime production is time-consuming, requiring skilled artists for keyframing, inbetweening, and colorization. ToonComposer streamlines this with generative AI, turning hours of manual work of inbetweening and colorization into a single, seamless process. Visit our project page and read our paper for more details. This HF model repo provides model weights of ToonComposer. Codes are available at our GitHub repo. If you find ToonComposer useful, please consider citing:

VideoPainter

BrushEdit

Open-MAGVIT2

license:apache-2.0

MotionCrafter

RollingForcing

ViT-Lens

license:apache-2.0

QA-CLIP

license:apache-2.0

CubeComposer

CustomNet

license:apache-2.0

Mira-v0

license:apache-2.0

ImageConductor

license:apache-2.0

FreeSplatter

license:apache-2.0

MasaCtrl

DI-PCG

ViSFT

license:apache-2.0

Mira-v1

license:apache-2.0

GenCompositor

Assembler

# Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion Wang Zhao 1 , Yan-Pei Cao 2 , Jiale Xu 1 , Yuejiang Dong 1,3 , Ying Shan 1 1 ARC Lab, Tencent PCG &ensp;&ensp; 2 VAST &ensp;&ensp; 3 Tsinghua University 🚩 Overview This repository contains code release for our SIGGRAPH ASIA 2025 paper "Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion". ⚙️ Installation We recommend using anaconda to install the dependencies: 🚀 Usage Inference To run the inference demo, simply use: This script runs for example data inside `./examples` from Toys4k dataset. You could use your own data to assemble. Put all the part meshes (in GLB format) and reference image (in PNG format) into a single folder, and set the `inputdir` argument to that folder.

Track4World

mllm-npu-llama2-qwenvl-vit

license:cc-by-2.0