stepfun-ai
Step-3.5-Flash-FP8
Step-3.5-Flash
Step3-VL-10B
step3
📰 Step3 Model Blog | 📄 Step3 System Blog Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. | Config | Value | |------------------------|---------| | Number of Layers (Dense layer included)|61| |Number of Dense Layers| 5| | Hidden Dimension | 7168 | | Attention Mechanism | MFA | | Low-rank Query Dimension | 2048 | | Number of Query Heads | 64 | | Head Dimension | 256 | |Number of Experts |48| |Selected Experts per Token|3| |Number of Shared Experts| 1| | Max Context Length | 65536 | | Tokenizer | Deepseek V3 | | Total Parameters (LLM) | 316B | | Activated Params per Token | 38B | | Total Parameters (VLM) | 321B | > [!Note] > Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you. We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang. Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface. Currently, it is recommended to run Step3 on the following inference engines: Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide. Contact Us If you have any questions, please reach out at [email protected] . License Both the code repository and the model weights are released under the Apache License (Version 2.0).
GOT-OCR2_0
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10: More details about 'ocrtype', 'ocrbox', 'ocrcolor', and 'render' can be found at our GitHub. Our training codes are available at our GitHub. 👏 Welcome to explore more multimodal projects of our team: If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
GOT-OCR-2.0-hf
--- pipeline_tag: image-text-to-text library_name: transformers language: - multilingual tags: - got - vision-language - ocr2.0 license: apache-2.0 ---
Step-3.5-Flash-GGUF-Q4_K_S
Step-Audio-EditX
Step3-VL-10B-Base
Step-Audio-R1.1
Step-3.5-Flash-Int4
Step-Audio-2-mini
GELab-Zero-4B-preview
Step1X-Edit-v1p2-preview
Step1X-Edit-v1p2
PaCoRe-8B
Step1X-Edit-v1p1-diffusers
Step-3.5-Flash-Base
Step-3.5-Flash-GGUF-Q8_0
Step-Audio-R1
Step1X-Edit
Step-Audio-TTS-3B
NextStep-1-Large-Pretrain
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale We introduce NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. To avoid potential errors when loading and running your models, we recommend using the following settings: If you find NextStep useful for your research and applications, please consider starring this repository and citing:
Step-3.5-Flash-Int8
Step-3.5-Flash-Base-Midtrain
Step-Audio-2-mini-Base
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. - Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information. - Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information. - Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech. - State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See Evaluation and Technical Report). + Open-source: Step-Audio 2 mini and Step-Audio 2 mini Base are released under Apache 2.0 license. Model Download Huggingface | Models | 🤗 Hugging Face | |-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | Model Usage 🔧 Dependencies and Installation - Python >= 3.10 - PyTorch >= 2.3-cu121 - CUDA Toolkit - Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled. - You will need an API key from the StepFun Open Platform. - Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled. - Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner. You can scan the following QR code to join our WeChat group for communication and discussion. Automatic speech recognition CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is not supported. Category Test set Doubao LLM ASR GPT-4o Transcribe Kimi-Audio Qwen-Omni Step-Audio 2 Step-Audio 2 mini Multilingual FLEURS Arabian N/A 11.72 N/A 25.13 14.22 16.46 In-house Anhui accent 8.83 50.55 22.17 18.73 10.61 11.65 Shanghai dialect 47.49 89.58 82.90 58.74 17.77 19.30 Paralinguistic information understanding StepEval-Audio-Paralinguistic Model Avg. Gender Age Timbre Scenario Event Emotion Pitch Rhythm Speed Style Vocal GPT-4o Audio 43.45 18 42 34 22 14 82 40 60 58 64 44 Step-Audio-AQAA 36.91 70 66 18 14 14 40 38 48 54 44 0 Step-Audio 2 83.09 100 96 82 78 60 86 82 86 88 88 68 Step-Audio 2 mini 80.00 100 94 80 78 60 82 82 68 74 86 76 Tool calling StepEval-Audio-Toolcall. Date and time tools have no parameter. Model Objective Metric Audio search Date & Time Weather Web search Qwen3-32B † Trigger Precision / Recall 67.5 / 98.5 98.4 / 100.0 90.1 / 100.0 86.8 / 98.5 Step-Audio 2 Trigger Precision / Recall 86.8 / 99.5 96.9 / 98.4 92.2 / 100.0 88.4 / 95.5 Speech-to-speech conversation URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation, respectively. GPT-4o Audio Chinese 78.59 89.40 65.48 85.24 67.10 70.60 57.22 70.20 Kimi-Audio 73.59 79.34 64.66 79.75 66.07 60.44 59.29 76.21 Qwen-Omni 68.98 59.66 69.74 77.27 59.11 59.01 59.82 58.74 Step-Audio-AQAA 74.71 87.61 59.63 81.93 65.61 74.76 47.29 68.97 Step-Audio 2 83.32 91.05 75.45 86.08 68.25 74.78 63.18 65.10 Step-Audio 2 mini 77.81 89.19 64.53 84.12 69.57 76.84 58.90 69.42 GPT-4o Audio English 84.54 90.18 75.90 90.41 67.51 60.65 64.36 78.46 Kimi-Audio 60.04 83.36 42.31 60.36 49.79 50.32 40.59 56.04 Qwen-Omni 70.58 66.29 69.62 76.16 50.99 44.51 63.88 49.41 Step-Audio-AQAA 71.11 90.15 56.12 72.06 52.01 44.25 54.54 59.81 Step-Audio 2 83.90 92.72 76.51 84.92 66.07 64.86 67.75 66.33 Step-Audio 2 mini 74.36 90.07 60.12 77.65 61.25 58.79 61.94 63.80 The model and code in the repository is licensed under Apache 2.0 License.
step3-fp8
📰 Step3 Model Blog | 📄 Paper Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. | Config | Value | |------------------------|---------| | Number of Layers (Dense layer included)|61| |Number of Dense Layers| 5| | Hidden Dimension | 7168 | | Attention Mechanism | MFA | | Low-rank Query Dimension | 2048 | | Number of Query Heads | 64 | | Head Dimension | 256 | |Number of Experts |48| |Selected Experts per Token|3| |Number of Shared Experts| 1| | Max Context Length | 65536 | | Tokenizer | Deepseek V3 | | Total Parameters (LLM) | 316B | | Activated Params per Token | 38B | | Total Parameters (VLM) | 321B | > [!Note] > Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you. We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang. Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface. Currently, it is recommended to run Step3 on the following inference engines: Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide. Contact Us If you have any questions, please reach out at [email protected] . License Both the code repository and the model weights are released under the Apache License (Version 2.0).
Step-Audio-Chat
This repository contains the Multimodal Large Language Model (LLM) component of Step-Audio. It is a 130 billion parameter multimodal LLM that is responsible for understanding and generating human speech. The model is specifically designed to seamlessly integrate functions such as speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation. 2. Evaluation 2.1 LLM judge metrics(GPT-4o) on StepEval-Audio-360 Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360. Model Factuality (% ↑) Relevance (% ↑) Chat Score ↑ Note: Moshi are marked with "\" and should be considered for reference only. Model Llama Question Web Questions TriviaQA ComplexBench HSK-6 Note: Results marked with "\" on TriviaQA dataset are considered for reference only. TriviaQA dataset marked with "\" indicates results are for reference only. For more information, please refer to our repository: Step-Audio.
NextStep-1-Large
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale We introduce NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. To avoid potential errors when loading and running your models, we recommend using the following settings: If you find NextStep useful for your research and applications, please consider starring this repository and citing:
StepFun-Formalizer-32B
StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion We introduce StepFun-Formalizer, a family of large language models designed to translate natural-language mathematical problems into formal statements in Lean 4. Through the fusion of formal knowledge and informal-to-formal reasoning capability, StepFun-Formalizer achieves strong performance on autoformalization tasks. Evaluated with BEq verification on mainstream benchmarks including FormalMATH-Lite, ProverBench, and CombiBench, StepFun-Formalizer matches or exceeds all prior general-purpose and specialized autoformalization models of comparable scale. Please refer to our paper and code for more details. | Model | Download | | -------- | -------- | | StepFun-Formalizer-7B | 🤗HuggingFace | | StepFun-Formalizer-32B | 🤗HuggingFace | License Both the code repository and the model weights are released under the Apache License (Version 2.0).
NextStep-1-Large-Edit
StepFun-Prover-Preview-7B
StepFun-Prover-Preview-32B
Step-Audio-2-mini-Think
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. - Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information. - Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information. - Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech. - State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See Evaluation and Technical Report). + Open-source: Step-Audio 2 mini and Step-Audio 2 mini Base are released under Apache 2.0 license. Model Download Huggingface | Models | 🤗 Hugging Face | |-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | Model Usage 🔧 Dependencies and Installation - Python >= 3.10 - PyTorch >= 2.3-cu121 - CUDA Toolkit - Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled. - You will need an API key from the StepFun Open Platform. - Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled. - Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner. You can scan the following QR code to join our WeChat group for communication and discussion. Automatic speech recognition CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is not supported. Category Test set Doubao LLM ASR GPT-4o Transcribe Kimi-Audio Qwen-Omni Step-Audio 2 Step-Audio 2 mini Multilingual FLEURS Arabian N/A 11.72 N/A 25.13 14.22 16.46 In-house Anhui accent 8.83 50.55 22.17 18.73 10.61 11.65 Shanghai dialect 47.49 89.58 82.90 58.74 17.77 19.30 Paralinguistic information understanding StepEval-Audio-Paralinguistic Model Avg. Gender Age Timbre Scenario Event Emotion Pitch Rhythm Speed Style Vocal GPT-4o Audio 43.45 18 42 34 22 14 82 40 60 58 64 44 Step-Audio-AQAA 36.91 70 66 18 14 14 40 38 48 54 44 0 Step-Audio 2 83.09 100 96 82 78 60 86 82 86 88 88 68 Step-Audio 2 mini 80.00 100 94 80 78 60 82 82 68 74 86 76 Tool calling StepEval-Audio-Toolcall. Date and time tools have no parameter. Model Objective Metric Audio search Date & Time Weather Web search Qwen3-32B † Trigger Precision / Recall 67.5 / 98.5 98.4 / 100.0 90.1 / 100.0 86.8 / 98.5 Step-Audio 2 Trigger Precision / Recall 86.8 / 99.5 96.9 / 98.4 92.2 / 100.0 88.4 / 95.5 Speech-to-speech conversation URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation, respectively. GPT-4o Audio Chinese 78.59 89.40 65.48 85.24 67.10 70.60 57.22 70.20 Kimi-Audio 73.59 79.34 64.66 79.75 66.07 60.44 59.29 76.21 Qwen-Omni 68.98 59.66 69.74 77.27 59.11 59.01 59.82 58.74 Step-Audio-AQAA 74.71 87.61 59.63 81.93 65.61 74.76 47.29 68.97 Step-Audio 2 83.32 91.05 75.45 86.08 68.25 74.78 63.18 65.10 Step-Audio 2 mini 77.81 89.19 64.53 84.12 69.57 76.84 58.90 69.42 GPT-4o Audio English 84.54 90.18 75.90 90.41 67.51 60.65 64.36 78.46 Kimi-Audio 60.04 83.36 42.31 60.36 49.79 50.32 40.59 56.04 Qwen-Omni 70.58 66.29 69.62 76.16 50.99 44.51 63.88 49.41 Step-Audio-AQAA 71.11 90.15 56.12 72.06 52.01 44.25 54.54 59.81 Step-Audio 2 83.90 92.72 76.51 84.92 66.07 64.86 67.75 66.33 Step-Audio 2 mini 74.36 90.07 60.12 77.65 61.25 58.79 61.94 63.80 The model and code in the repository is licensed under Apache 2.0 License.
Qwen2.5-32B-DialogueReason
RLVR-8B-0926
StepFun Formalizer 7B
stepvideo-t2v
🔥🔥🔥 News!! Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V. Download Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V-Turbo. Download Feb 17, 2025: 🎉 We have made our technical report available as open source. Read 1. Introduction 2. Model Summary 3. Model Download 4. Model Usage 5. Benchmark 6. Online Engine 7. Citation 8. Acknowledgement 1. Introduction We present Step-Video-T2V, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines. 2. Model Summary In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs. 2.1. Video-VAE A deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations. 2.2. DiT w/ 3D Full Attention Step-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions. 2.3. Video-DPO In Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process. 3. Model Download | Models | 🤗Huggingface | 🤖Modelscope | |:-------:|:-------:|:-------:| | Step-Video-T2V | download | download | Step-Video-T2V-Turbo (Inference Step Distillation) | download | download The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos: | Model | height/width/frame | Peak GPU Memory | 50 steps w flash-attn | 50 steps w/o flash-attn | |:------------:|:------------:|:------------:|:------------:|:------------:| | Step-Video-T2V | 544px992px204f | 77.64 GB | 743 s | 1232 s | | Step-Video-T2V | 544px992px136f | 72.48 GB | 408 s | 605 s | An NVIDIA GPU with CUDA support is required. The model is tested on four GPUs. Recommended: We recommend to use GPUs with 80GB of memory for better generation quality. Tested operating system: Linux The self-attention in text-encoder (stepllm) only supports CUDA capabilities sm80 sm86 and sm90 🔧 4.2 Dependencies and Installation - Python >= 3.10.0 (Recommend to use Anaconda or Miniconda) - PyTorch >= 2.3-cu121 - CUDA Toolkit - FFmpeg 🚀 4.3 Inference Scripts - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding. 🚀 4.4 Best-of-Practice Inference settings Step-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters: | Models | infersteps | cfgscale | timeshift | numframes | |:-------:|:-------:|:-------:|:-------:|:-------:| | Step-Video-T2V | 30-50 | 9.0 | 13.0 | 204 | Step-Video-T2V-Turbo (Inference Step Distillation) | 10-15 | 5.0 | 17.0 | 204 | 5. Benchmark We are releasing Step-Video-T2V Eval as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style. 6. Online Engine The online version of Step-Video-T2V is available on 跃问视频, where you can also explore some impressive examples. 8. Acknowledgement - We would like to express our sincere thanks to the xDiT team for their invaluable support and parallelization strategy. - Our code will be integrated into the official repository of Huggingface/Diffusers. - We thank the FastVideo team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.
NextStep-1-f8ch16-Tokenizer
Step-Audio-AQAA
Step-Audio-AQAA: A Fully End-to-End Expressive Large Audio Language Model 📚 Paper: Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model 🚀 Live Demo: [](https://www.stepfun.com/docs/zh/step-audio-aqaa?studiocode=step-audio-aqaa&studioid=121368403356246016&studiotype=1) Model Overview Step-Audio-AQAA is a fully end-to-end Large Audio-Language Model (LALM) designed for Audio Query-Audio Answer (AQAA) tasks. It directly processes audio inputs and generates natural, accurate speech responses without relying on traditional ASR and TTS modules, eliminating cascading errors and simplifying the system architecture. Key Capabilities - Fully End-to-End Audio Interaction: Generates speech outputs directly from raw audio inputs without ASR/TTS intermediates. - Fine-Grained Voice Control: Supports sentence-level adjustments of emotional tone, speech rate, and other vocal features. - Multilingual & Dialect Support: Covers Chinese (including Sichuanese, Cantonese), English, Japanese, etc. - Complex Task Handling: Excels in speech emotion control, role-playing, logical reasoning, and other complex audio interactions. Model Architecture Step-Audio-AQAA consists of three core modules: Dual-Codebook Audio Tokenizer - Linguistic Tokenizer: Based on Paraformer encoder, extracts phonemic and linguistic attributes with a 1,024-codebook at 16.7Hz. - Semantic Tokenizer: References CosyVoice 1.0, captures acoustic features with a 4,096-codebook at 25Hz. - Temporal Alignment: Uses a 2:3 interleaving ratio to ensure temporal consistency between token types. Backbone LLM - Parameter Scale: 130-billion-parameter multi-modal LLM (Step-Omni). - Architecture: Decoder-only with Transformer blocks, RMSNorm layers, and grouped query attention. - Vocabulary Expansion: Incorporates 5,120 audio tokens into the text vocabulary for text-audio interleaved output. Neural Vocoder - Architecture: Flow-matching model based on CosyVoice, using U-Net and ResNet-1D layers. - Conditional Generation: Generates high-fidelity speech waveforms conditioned solely on audio tokens. Training Approach Multi-Stage Training Pipeline 1. Pretraining: Multi-modal pretraining on text, audio, and image data. 2. Supervised Fine-Tuning (SFT): - Stage 1: Full-parameter update on AQTA and AQTAA datasets. - Stage 2: Optimizes specific capabilities with high-quality AQTAA data. 3. Direct Preference Optimization (DPO): Uses audio token masking to avoid degradation of speech generation. 4. Model Merging: Weighted combination of SFT and DPO models to enhance overall performance. Training Data - Multi-Modal Pretraining Data: 800 billion text tokens and audio-text interleaved data. - AQTA Dataset: Audio query-text answer pairs. - AQTAA Dataset: Audio query-text answer-audio answer triplets generated from AQTA. Team & Contributions Step-Audio-AQAA is developed by the StepFun team, with contributions from multiple researchers and engineers. For technical support or collaboration, contact the corresponding authors: Daxin Jiang ([email protected]), Shuchang Zhou ([email protected]), Chen Hu ([email protected]). License This model is released under the Apache 2.0 license. For more details, please refer to the license file.
stepvideo-ti2v
Step-Audio-EditX-AWQ-4bit
Step1X-3D
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets Step1X-3D demonstrates the capability to generate 3D assets with high-fidelity geometry and versatile texture maps, while maintaining exceptional alignment between surface geometry and texture mapping. From left to right, we sequentially present: the base geometry (untextured), followed by cartoon-style, sketch-style, and photorealistic 3D asset generation results. 🔥🔥🔥 Latest News!! May 13, 2025: 👋 Step1X-3D online demo is available on huggingface-enjoy yourself with generated 3D assets! Huggingface web live May 13, 2025: 👋 We release the 800K uids of high quality 3D assets (excluding self-collected assets) obtained with our rigorous data curation pipeline for both training 3D geometry and synthesis. Huggingface dataset May 13, 2025: 👋 We have also release the training code of both Step1X-3D geometry generation and texture synthesis. May 13, 2025: 👋 We have released the inference code and model weights of Step1X-3D geometry and Step1X-3D texture. May 13, 2025: 👋 We have released Step1X-3D [technical report]() as open source. Introduction While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an SD-XL-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces watertight TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The SD-XL-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notebly, the framework uniquely bridges 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation. Citation If you find our work helpful, please cite us