TMElyralab
DeepSeek-V3.1-AWQ-W4AFP8
DeepSeek-V3-0324-AWQ-W4AFP8
This model is a W4AFP8 quantized DeepSeek-V3-0324 with AWQ quantizaton. Releated PR:https://github.com/sgl-project/sglang/pull/8573 Releated Project: https://github.com/TMElyralab/sglang/tree/lyraw4afp8 Benchmark Test configuration: input/output len = 1000/1000, qps=64, maxconcurrency=64, numprompt=128 Device: H20 8 Compared to the original model: - bs=64,input/output throughput has increased by 56%. - bs=128,input/output throughput has increased by 125%. Mirror:lmsysorg/sglang:v0.4.6.post5-cu124 or lmsysorg/sglang:v0.5.1.post5-cu126 (cuda12.6 env need to update ptxas to 12.8 on Hopper. reference) Citation We are TMElyralab, the Acceleration Team from Tencent Music Entertainment (TME). DeepSeek-V3-0324 demonstrates notable improvements over its predecessor, DeepSeek-V3, in several key aspects. - Significant improvements in benchmark performance: - MMLU-Pro: 75.9 → 81.2 (+5.3) - GPQA: 59.1 → 68.4 (+9.3) - AIME: 39.6 → 59.4 (+19.8) - LiveCodeBench: 39.2 → 49.2 (+10.0) - Improved the executability of the code - More aesthetically pleasing web pages and game front-ends - Enhanced style and content quality: - Aligned with the R1 writing style - Better quality in medium-to-long-form writing - Feature Enhancements - Improved multi-turn interactive rewriting - Optimized translation quality and letter writing - Enhanced report analysis requests with more detailed outputs - Increased accuracy in Function Calling, fixing issues from previous V3 versions In the official DeepSeek web/app, we use the same system prompt with a specific date. In our web and application environments, the temperature parameter $T{model}$ is set to 0.3. Because many users use the default temperature 1.0 in API call, we have implemented an API temperature $T{api}$ mapping mechanism that adjusts the input API temperature value of 1.0 to the most suitable model temperature setting of 0.3. $$ T{model} = T{api} \times 0.3 \quad (0 \leq T{api} \leq 1) $$ $$ T{model} = T{api} - 0.7 \quad (1 < T{api} \leq 2) $$ Thus, if you call V3 via API, temperature 1.0 equals to the model temperature 0.3. For file uploading, please follow the template to create prompts, where {filename}, {filecontent} and {question} are arguments. For Web Search, {searchresults}, {curdate}, and {question} are arguments. The model structure of DeepSeek-V3-0324 is exactly the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This model supports features such as function calling, JSON output, and FIM completion. For instructions on how to construct prompts to use these features, please refer to DeepSeek-V2.5 repo. NOTE: Hugging Face's Transformers has not been directly supported yet. This repository and the model weights are licensed under the MIT License. Contact If you have any questions, please raise an issue or contact us at [email protected].
DeepSeek-R1-AWQ-W4AFP8
DeepSeek-R1-0528-AWQ-W4AFP8
MuseV
MuseTalk
MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting Yue Zhang \ , Minhao Liu \ , Zhaokang Chen, Bin Wu † , Yingjie He, Chao Zhan, Wenjiang Zhou ( Equal Contribution, † Corresponding Author, [email protected]) github huggingface Project(comming soon) Technical report (comming soon) We introduce `MuseTalk`, a real-time high quality lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by MuseV, as a complete virtual human solution. Overview `MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which 1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`. 1. supports audio in various languages, such as Chinese, English, and Japanese. 1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100. 1. supports modification of the center point of the face region proposes, which SIGNIFICANTLY affects generation results. 1. checkpoint available trained on the HDTF dataset. 1. training codes (comming soon). News - [04/02/2024] Released MuseTalk project and pretrained models. MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. The character of the last two rows, `Xinying Sun`, is a supermodel KOL. You can follow her on douyin. For video dubbing, we applied a self-developed tool which can detect the talking person. TODO: - [x] trained models and inference codes. - [ ] technical report. - [ ] training codes. - [ ] online UI. - [ ] a better model (may take longer). Getting Started We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users: Installation To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below: Build environment We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows: whisper install whisper to extract audio feature (only encoder) Download ffmpeg-static Download the ffmpeg-static and Download weights You can download weights manually as follows: 2. Download the weights of other components: - sd-vae-ft-mse - whisper - dwpose - face-parse-bisent - resnet18 Finally, these weights should be organized in `models` as follows: configs/inference/test.yaml is the path to the inference configuration file, including videopath and audiopath. The videopath should be either a video file or a directory of images. Use of bboxshift to have adjustable results :magright: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bboxshift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness. You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range. For example, in the case of `Xinying Sun`, after running the default configuration, it shows that the adjustable value rage is [-9, 9]. Then, to decrease the mouth openness, we set the value to be `-7`. :pushpin: More technical details can be found in bboxshift. As a complete solution to virtual human generation, you are suggested to first apply MuseV to generate a video (text-to-video, image-to-video or pose-to-video) by referring this. Then, you can use `MuseTalk` to generate a lip-sync video by referring this. If you want to launch online video chats, you are suggested to generate videos using MuseV and apply necessary pre-processing such as face detection in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time. Acknowledgement 1. We thank open-source components like whisper, dwpose, face-alignment, face-parsing, S3FD. 1. MuseTalk has referred much to diffusers. 1. MuseTalk has been built on `HDTF` datasets. Limitations - Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem. If you need higher resolution, you could apply super resolution models such as GFPGAN in combination with MuseTalk. - Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color. - Jitter: There exists some jitter as the current pipeline adopts single-frame generation. Disclaimer/License 1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage. 1. `model`: The trained model are available for any purpose, even commercially. 1. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc.. 1. The testdata are collected from internet, which are available for non-commercial research purposes only. 1. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.