krea

2 models • 2 total models in database

Sort by:

krea-realtime-video

Krea Realtime 14B is distilled from the Wan 2.1 14B text-to-video model using Self-Forcing, a technique for converting regular video diffusion models into autoregressive models. It achieves a text-to-video inference speed of 11fps using 4 inference steps on a single NVIDIA B200 GPU. For more details on our training methodology and sampling innovations, refer to our technical blog post. - Our model is over 10x larger than existing realtime video models - We introduce novel techniques for mitigating error accumulation, including KV Cache Recomputation and KV Cache Attention Bias - We develop memory optimizations specific to autoregressive video diffusion models that facilitate training large autoregressive models - Our model enables realtime interactive capabilities: Users can modify prompts mid-generation, restyle videos on-the-fly, and see first frames within 1 second Video To Video Krea realtime allows users to stream real videos, webcam inputs, or canvas primitives into the model, unlocking controllable video synthesis and editing Text To Video Krea realtime allows users to generate videos in a streaming fashion with ~1s time to first frame. And use the web app at http://localhost:8000/ in your browser (for more advanced use-cases and custom pipeline check out our GitHub repository: https://github.com/krea-ai/realtime-video) Krea Realtime 14B can be used with the `diffusers` library utilizing the new Modular Diffusers structure Using the `videostream` input will process video frames in as they arrive, while maintaining temporal consistency across chunks. To optimize inference speed and memory usage on Hopper level GPUs (H100s), we recommend using `torch.compile`, Sageattention and FP8 quantization with torchao. First let's set up our depedencies by enabling Sageattention via Hub kernels and installing the `torchao` and `kernels` packages. Alternatively, you can use Flash Attention 3 via kernels by disabling Sageattention: Then we will iterate over the blocks of the transformer and apply quantization and `torch.compile`.

Aesthetic Controlnet

Aesthetic ControlNet This model can produce highly aesthetic results from an input image and a text prompt. ControlNet is a method that can be used to condition diffusion models on arbitrary input features, such as image edges, segmentation maps, or human poses. Aesthetic ControlNet is a version of this technique that uses image features extracted using a Canny edge detector and guides a text-to-image diffusion model trained on a large aesthetic dataset. The base diffusion model is a fine-tuned version of Stable Diffusion 2.1 trained at a resolution of 640x640, and the control network comes from thibaud/controlnet-sd21 by @thibaudz. For more information about ControlNet, please have a look at this thread or at the original work by Lvmin Zhang and Maneesh Agrawala. Diffusers Install the following dependencies and then run the code below: Misuse and Malicious Use The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

—