stepfun-ai

46 models • 3 total models in database
Sort by:

Step-3.5-Flash-FP8

license:apache-2.0
168,307
48

Step-3.5-Flash

license:apache-2.0
83,765
722

Step3-VL-10B

NaNK
license:apache-2.0
69,634
379

step3

📰   Step3 Model Blog     |     📄   Step3 System Blog Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. | Config | Value | |------------------------|---------| | Number of Layers (Dense layer included)|61| |Number of Dense Layers| 5| | Hidden Dimension | 7168 | | Attention Mechanism | MFA | | Low-rank Query Dimension | 2048 | | Number of Query Heads | 64 | | Head Dimension | 256 | |Number of Experts |48| |Selected Experts per Token|3| |Number of Shared Experts| 1| | Max Context Length | 65536 | | Tokenizer | Deepseek V3 | | Total Parameters (LLM) | 316B | | Activated Params per Token | 38B | | Total Parameters (VLM) | 321B | > [!Note] > Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you. We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang. Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface. Currently, it is recommended to run Step3 on the following inference engines: Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide. Contact Us If you have any questions, please reach out at [email protected] . License Both the code repository and the model weights are released under the Apache License (Version 2.0).

license:apache-2.0
47,147
166

GOT-OCR2_0

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10: More details about 'ocrtype', 'ocrbox', 'ocrcolor', and 'render' can be found at our GitHub. Our training codes are available at our GitHub. 👏 Welcome to explore more multimodal projects of our team: If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!

license:apache-2.0
42,550
1,522

GOT-OCR-2.0-hf

--- pipeline_tag: image-text-to-text library_name: transformers language: - multilingual tags: - got - vision-language - ocr2.0 license: apache-2.0 ---

license:apache-2.0
40,756
218

Step-3.5-Flash-GGUF-Q4_K_S

license:apache-2.0
11,065
128

Step-Audio-EditX

10,535
111

Step3-VL-10B-Base

NaNK
license:apache-2.0
8,863
44

Step-Audio-R1.1

license:apache-2.0
7,484
147

Step-3.5-Flash-Int4

license:apache-2.0
6,054
110

Step-Audio-2-mini

license:apache-2.0
2,127
247

GELab-Zero-4B-preview

NaNK
license:apache-2.0
1,385
123

Step1X-Edit-v1p2-preview

license:apache-2.0
856
16

Step1X-Edit-v1p2

license:apache-2.0
719
53

PaCoRe-8B

NaNK
license:mit
540
39

Step1X-Edit-v1p1-diffusers

license:apache-2.0
501
3

Step-3.5-Flash-Base

license:apache-2.0
462
72

Step-3.5-Flash-GGUF-Q8_0

license:apache-2.0
325
3

Step-Audio-R1

license:apache-2.0
265
100

Step1X-Edit

license:apache-2.0
164
322

Step-Audio-TTS-3B

NaNK
license:apache-2.0
154
191

NextStep-1-Large-Pretrain

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale We introduce NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. To avoid potential errors when loading and running your models, we recommend using the following settings: If you find NextStep useful for your research and applications, please consider starring this repository and citing:

license:apache-2.0
151
16

Step-3.5-Flash-Int8

license:apache-2.0
135
2

Step-3.5-Flash-Base-Midtrain

license:apache-2.0
126
34

Step-Audio-2-mini-Base

Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. - Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information. - Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information. - Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech. - State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See Evaluation and Technical Report). + Open-source: Step-Audio 2 mini and Step-Audio 2 mini Base are released under Apache 2.0 license. Model Download Huggingface | Models | 🤗 Hugging Face | |-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | Model Usage 🔧 Dependencies and Installation - Python >= 3.10 - PyTorch >= 2.3-cu121 - CUDA Toolkit - Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled. - You will need an API key from the StepFun Open Platform. - Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled. - Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner. You can scan the following QR code to join our WeChat group for communication and discussion. Automatic speech recognition CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is not supported. Category Test set Doubao LLM ASR GPT-4o Transcribe Kimi-Audio Qwen-Omni Step-Audio 2 Step-Audio 2 mini Multilingual FLEURS Arabian N/A 11.72 N/A 25.13 14.22 16.46 In-house Anhui accent 8.83 50.55 22.17 18.73 10.61 11.65 Shanghai dialect 47.49 89.58 82.90 58.74 17.77 19.30 Paralinguistic information understanding StepEval-Audio-Paralinguistic Model Avg. Gender Age Timbre Scenario Event Emotion Pitch Rhythm Speed Style Vocal GPT-4o Audio 43.45 18 42 34 22 14 82 40 60 58 64 44 Step-Audio-AQAA 36.91 70 66 18 14 14 40 38 48 54 44 0 Step-Audio 2 83.09 100 96 82 78 60 86 82 86 88 88 68 Step-Audio 2 mini 80.00 100 94 80 78 60 82 82 68 74 86 76 Tool calling StepEval-Audio-Toolcall. Date and time tools have no parameter. Model Objective Metric Audio search Date & Time Weather Web search Qwen3-32B † Trigger Precision / Recall 67.5 / 98.5 98.4 / 100.0 90.1 / 100.0 86.8 / 98.5 Step-Audio 2 Trigger Precision / Recall 86.8 / 99.5 96.9 / 98.4 92.2 / 100.0 88.4 / 95.5 Speech-to-speech conversation URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation, respectively. GPT-4o Audio Chinese 78.59 89.40 65.48 85.24 67.10 70.60 57.22 70.20 Kimi-Audio 73.59 79.34 64.66 79.75 66.07 60.44 59.29 76.21 Qwen-Omni 68.98 59.66 69.74 77.27 59.11 59.01 59.82 58.74 Step-Audio-AQAA 74.71 87.61 59.63 81.93 65.61 74.76 47.29 68.97 Step-Audio 2 83.32 91.05 75.45 86.08 68.25 74.78 63.18 65.10 Step-Audio 2 mini 77.81 89.19 64.53 84.12 69.57 76.84 58.90 69.42 GPT-4o Audio English 84.54 90.18 75.90 90.41 67.51 60.65 64.36 78.46 Kimi-Audio 60.04 83.36 42.31 60.36 49.79 50.32 40.59 56.04 Qwen-Omni 70.58 66.29 69.62 76.16 50.99 44.51 63.88 49.41 Step-Audio-AQAA 71.11 90.15 56.12 72.06 52.01 44.25 54.54 59.81 Step-Audio 2 83.90 92.72 76.51 84.92 66.07 64.86 67.75 66.33 Step-Audio 2 mini 74.36 90.07 60.12 77.65 61.25 58.79 61.94 63.80 The model and code in the repository is licensed under Apache 2.0 License.

license:apache-2.0
120
22

step3-fp8

📰   Step3 Model Blog     |     📄   Paper Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. | Config | Value | |------------------------|---------| | Number of Layers (Dense layer included)|61| |Number of Dense Layers| 5| | Hidden Dimension | 7168 | | Attention Mechanism | MFA | | Low-rank Query Dimension | 2048 | | Number of Query Heads | 64 | | Head Dimension | 256 | |Number of Experts |48| |Selected Experts per Token|3| |Number of Shared Experts| 1| | Max Context Length | 65536 | | Tokenizer | Deepseek V3 | | Total Parameters (LLM) | 316B | | Activated Params per Token | 38B | | Total Parameters (VLM) | 321B | > [!Note] > Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you. We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang. Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface. Currently, it is recommended to run Step3 on the following inference engines: Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide. Contact Us If you have any questions, please reach out at [email protected] . License Both the code repository and the model weights are released under the Apache License (Version 2.0).

license:apache-2.0
118
20

Step-Audio-Chat

This repository contains the Multimodal Large Language Model (LLM) component of Step-Audio. It is a 130 billion parameter multimodal LLM that is responsible for understanding and generating human speech. The model is specifically designed to seamlessly integrate functions such as speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation. 2. Evaluation 2.1 LLM judge metrics(GPT-4o) on StepEval-Audio-360 Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360. Model Factuality (% ↑) Relevance (% ↑) Chat Score ↑ Note: Moshi are marked with "\" and should be considered for reference only. Model Llama Question Web Questions TriviaQA ComplexBench HSK-6 Note: Results marked with "\" on TriviaQA dataset are considered for reference only. TriviaQA dataset marked with "\" indicates results are for reference only. For more information, please refer to our repository: Step-Audio.

license:apache-2.0
113
458

NextStep-1-Large

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale We introduce NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. To avoid potential errors when loading and running your models, we recommend using the following settings: If you find NextStep useful for your research and applications, please consider starring this repository and citing:

license:apache-2.0
108
94

StepFun-Formalizer-32B

StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion We introduce StepFun-Formalizer, a family of large language models designed to translate natural-language mathematical problems into formal statements in Lean 4. Through the fusion of formal knowledge and informal-to-formal reasoning capability, StepFun-Formalizer achieves strong performance on autoformalization tasks. Evaluated with BEq verification on mainstream benchmarks including FormalMATH-Lite, ProverBench, and CombiBench, StepFun-Formalizer matches or exceeds all prior general-purpose and specialized autoformalization models of comparable scale. Please refer to our paper and code for more details. | Model | Download | | -------- | -------- | | StepFun-Formalizer-7B | 🤗HuggingFace | | StepFun-Formalizer-32B | 🤗HuggingFace | License Both the code repository and the model weights are released under the Apache License (Version 2.0).

NaNK
license:apache-2.0
108
9

NextStep-1-Large-Edit

license:apache-2.0
100
47

StepFun-Prover-Preview-7B

NaNK
license:apache-2.0
96
3

StepFun-Prover-Preview-32B

NaNK
license:apache-2.0
93
11

Step-Audio-2-mini-Think

Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. - Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information. - Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information. - Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech. - State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See Evaluation and Technical Report). + Open-source: Step-Audio 2 mini and Step-Audio 2 mini Base are released under Apache 2.0 license. Model Download Huggingface | Models | 🤗 Hugging Face | |-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | Model Usage 🔧 Dependencies and Installation - Python >= 3.10 - PyTorch >= 2.3-cu121 - CUDA Toolkit - Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled. - You will need an API key from the StepFun Open Platform. - Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled. - Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner. You can scan the following QR code to join our WeChat group for communication and discussion. Automatic speech recognition CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is not supported. Category Test set Doubao LLM ASR GPT-4o Transcribe Kimi-Audio Qwen-Omni Step-Audio 2 Step-Audio 2 mini Multilingual FLEURS Arabian N/A 11.72 N/A 25.13 14.22 16.46 In-house Anhui accent 8.83 50.55 22.17 18.73 10.61 11.65 Shanghai dialect 47.49 89.58 82.90 58.74 17.77 19.30 Paralinguistic information understanding StepEval-Audio-Paralinguistic Model Avg. Gender Age Timbre Scenario Event Emotion Pitch Rhythm Speed Style Vocal GPT-4o Audio 43.45 18 42 34 22 14 82 40 60 58 64 44 Step-Audio-AQAA 36.91 70 66 18 14 14 40 38 48 54 44 0 Step-Audio 2 83.09 100 96 82 78 60 86 82 86 88 88 68 Step-Audio 2 mini 80.00 100 94 80 78 60 82 82 68 74 86 76 Tool calling StepEval-Audio-Toolcall. Date and time tools have no parameter. Model Objective Metric Audio search Date & Time Weather Web search Qwen3-32B † Trigger Precision / Recall 67.5 / 98.5 98.4 / 100.0 90.1 / 100.0 86.8 / 98.5 Step-Audio 2 Trigger Precision / Recall 86.8 / 99.5 96.9 / 98.4 92.2 / 100.0 88.4 / 95.5 Speech-to-speech conversation URO-Bench. U. R. O. stands for understanding, reasoning, and oral conversation, respectively. GPT-4o Audio Chinese 78.59 89.40 65.48 85.24 67.10 70.60 57.22 70.20 Kimi-Audio 73.59 79.34 64.66 79.75 66.07 60.44 59.29 76.21 Qwen-Omni 68.98 59.66 69.74 77.27 59.11 59.01 59.82 58.74 Step-Audio-AQAA 74.71 87.61 59.63 81.93 65.61 74.76 47.29 68.97 Step-Audio 2 83.32 91.05 75.45 86.08 68.25 74.78 63.18 65.10 Step-Audio 2 mini 77.81 89.19 64.53 84.12 69.57 76.84 58.90 69.42 GPT-4o Audio English 84.54 90.18 75.90 90.41 67.51 60.65 64.36 78.46 Kimi-Audio 60.04 83.36 42.31 60.36 49.79 50.32 40.59 56.04 Qwen-Omni 70.58 66.29 69.62 76.16 50.99 44.51 63.88 49.41 Step-Audio-AQAA 71.11 90.15 56.12 72.06 52.01 44.25 54.54 59.81 Step-Audio 2 83.90 92.72 76.51 84.92 66.07 64.86 67.75 66.33 Step-Audio 2 mini 74.36 90.07 60.12 77.65 61.25 58.79 61.94 63.80 The model and code in the repository is licensed under Apache 2.0 License.

license:apache-2.0
78
13

Qwen2.5-32B-DialogueReason

NaNK
license:apache-2.0
75
12

RLVR-8B-0926

NaNK
license:mit
55
6

StepFun Formalizer 7B

NaNK
license:apache-2.0
50
5

stepvideo-t2v

🔥🔥🔥 News!! Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V. Download Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V-Turbo. Download Feb 17, 2025: 🎉 We have made our technical report available as open source. Read 1. Introduction 2. Model Summary 3. Model Download 4. Model Usage 5. Benchmark 6. Online Engine 7. Citation 8. Acknowledgement 1. Introduction We present Step-Video-T2V, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines. 2. Model Summary In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs. 2.1. Video-VAE A deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations. 2.2. DiT w/ 3D Full Attention Step-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions. 2.3. Video-DPO In Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process. 3. Model Download | Models | 🤗Huggingface | 🤖Modelscope | |:-------:|:-------:|:-------:| | Step-Video-T2V | download | download | Step-Video-T2V-Turbo (Inference Step Distillation) | download | download The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos: | Model | height/width/frame | Peak GPU Memory | 50 steps w flash-attn | 50 steps w/o flash-attn | |:------------:|:------------:|:------------:|:------------:|:------------:| | Step-Video-T2V | 544px992px204f | 77.64 GB | 743 s | 1232 s | | Step-Video-T2V | 544px992px136f | 72.48 GB | 408 s | 605 s | An NVIDIA GPU with CUDA support is required. The model is tested on four GPUs. Recommended: We recommend to use GPUs with 80GB of memory for better generation quality. Tested operating system: Linux The self-attention in text-encoder (stepllm) only supports CUDA capabilities sm80 sm86 and sm90 🔧 4.2 Dependencies and Installation - Python >= 3.10.0 (Recommend to use Anaconda or Miniconda) - PyTorch >= 2.3-cu121 - CUDA Toolkit - FFmpeg 🚀 4.3 Inference Scripts - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding. 🚀 4.4 Best-of-Practice Inference settings Step-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters: | Models | infersteps | cfgscale | timeshift | numframes | |:-------:|:-------:|:-------:|:-------:|:-------:| | Step-Video-T2V | 30-50 | 9.0 | 13.0 | 204 | Step-Video-T2V-Turbo (Inference Step Distillation) | 10-15 | 5.0 | 17.0 | 204 | 5. Benchmark We are releasing Step-Video-T2V Eval as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style. 6. Online Engine The online version of Step-Video-T2V is available on 跃问视频, where you can also explore some impressive examples. 8. Acknowledgement - We would like to express our sincere thanks to the xDiT team for their invaluable support and parallelization strategy. - Our code will be integrated into the official repository of Huggingface/Diffusers. - We thank the FastVideo team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.

license:mit
43
471

NextStep-1-f8ch16-Tokenizer

license:apache-2.0
43
14

Step-Audio-AQAA

Step-Audio-AQAA: A Fully End-to-End Expressive Large Audio Language Model 📚 Paper: Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model 🚀 Live Demo: [](https://www.stepfun.com/docs/zh/step-audio-aqaa?studiocode=step-audio-aqaa&studioid=121368403356246016&studiotype=1) Model Overview Step-Audio-AQAA is a fully end-to-end Large Audio-Language Model (LALM) designed for Audio Query-Audio Answer (AQAA) tasks. It directly processes audio inputs and generates natural, accurate speech responses without relying on traditional ASR and TTS modules, eliminating cascading errors and simplifying the system architecture. Key Capabilities - Fully End-to-End Audio Interaction: Generates speech outputs directly from raw audio inputs without ASR/TTS intermediates. - Fine-Grained Voice Control: Supports sentence-level adjustments of emotional tone, speech rate, and other vocal features. - Multilingual & Dialect Support: Covers Chinese (including Sichuanese, Cantonese), English, Japanese, etc. - Complex Task Handling: Excels in speech emotion control, role-playing, logical reasoning, and other complex audio interactions. Model Architecture Step-Audio-AQAA consists of three core modules: Dual-Codebook Audio Tokenizer - Linguistic Tokenizer: Based on Paraformer encoder, extracts phonemic and linguistic attributes with a 1,024-codebook at 16.7Hz. - Semantic Tokenizer: References CosyVoice 1.0, captures acoustic features with a 4,096-codebook at 25Hz. - Temporal Alignment: Uses a 2:3 interleaving ratio to ensure temporal consistency between token types. Backbone LLM - Parameter Scale: 130-billion-parameter multi-modal LLM (Step-Omni). - Architecture: Decoder-only with Transformer blocks, RMSNorm layers, and grouped query attention. - Vocabulary Expansion: Incorporates 5,120 audio tokens into the text vocabulary for text-audio interleaved output. Neural Vocoder - Architecture: Flow-matching model based on CosyVoice, using U-Net and ResNet-1D layers. - Conditional Generation: Generates high-fidelity speech waveforms conditioned solely on audio tokens. Training Approach Multi-Stage Training Pipeline 1. Pretraining: Multi-modal pretraining on text, audio, and image data. 2. Supervised Fine-Tuning (SFT): - Stage 1: Full-parameter update on AQTA and AQTAA datasets. - Stage 2: Optimizes specific capabilities with high-quality AQTAA data. 3. Direct Preference Optimization (DPO): Uses audio token masking to avoid degradation of speech generation. 4. Model Merging: Weighted combination of SFT and DPO models to enhance overall performance. Training Data - Multi-Modal Pretraining Data: 800 billion text tokens and audio-text interleaved data. - AQTA Dataset: Audio query-text answer pairs. - AQTAA Dataset: Audio query-text answer-audio answer triplets generated from AQTA. Team & Contributions Step-Audio-AQAA is developed by the StepFun team, with contributions from multiple researchers and engineers. For technical support or collaboration, contact the corresponding authors: Daxin Jiang ([email protected]), Shuchang Zhou ([email protected]), Chen Hu ([email protected]). License This model is released under the Apache 2.0 license. For more details, please refer to the license file.

license:apache-2.0
38
46

stepvideo-ti2v

license:mit
22
83

Step-Audio-EditX-AWQ-4bit

NaNK
2
0

Step1X-3D

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets Step1X-3D demonstrates the capability to generate 3D assets with high-fidelity geometry and versatile texture maps, while maintaining exceptional alignment between surface geometry and texture mapping. From left to right, we sequentially present: the base geometry (untextured), followed by cartoon-style, sketch-style, and photorealistic 3D asset generation results. 🔥🔥🔥 Latest News!! May 13, 2025: 👋 Step1X-3D online demo is available on huggingface-enjoy yourself with generated 3D assets! Huggingface web live May 13, 2025: 👋 We release the 800K uids of high quality 3D assets (excluding self-collected assets) obtained with our rigorous data curation pipeline for both training 3D geometry and synthesis. Huggingface dataset May 13, 2025: 👋 We have also release the training code of both Step1X-3D geometry generation and texture synthesis. May 13, 2025: 👋 We have released the inference code and model weights of Step1X-3D geometry and Step1X-3D texture. May 13, 2025: 👋 We have released Step1X-3D [technical report]() as open source. Introduction While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an SD-XL-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces watertight TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The SD-XL-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notebly, the framework uniquely bridges 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation. Citation If you find our work helpful, please cite us

license:apache-2.0
0
101

stepvideo-t2v-turbo

license:mit
0
97

Step-Audio-Tokenizer

license:apache-2.0
0
42

NextStep-1.1-Pretrain-256px

license:apache-2.0
0
1