kernels-community

23 models • 2 total models in database

Sort by:

flash-attn3

license:apache-2.0

257,532

activation

—

48,771

vllm-flash-attn3

This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the vLLM team. The transformers team packaged the implementation and pre-built it for use with the kernels library. Kernel source: https://github.com/huggingface/kernels-community/tree/main/vllm-flash-attn3 When loading your model with transformers, provide this repository id as the source of the attention implementation: This will automatically resolve and download the appropriate code for your architecture. See more details in this post. - Tri Dao and team for Flash Attention and Flash Attention 3. - The vLLM team for their implementation and their contribution of attention sinks. - The transformers team for packaging, testing, building and making it available for use with the kernels library.

license:apache-2.0

32,864

flash-attn2

Flash Attention is a fast and memory-efficient implementation of the attention mechanism, designed to work with large models and long sequences. This is a Hugging Face compliant kernel build of Flash Attention. Original code here https://github.com/Dao-AILab/flash-attention. `scripts/readmeexample.py` provides a simple example of how to use the Flash Attention kernel in PyTorch. It demonstrates standard attention, causal attention, and variable-length sequences.

license:bsd-3-clause

15,085

sage_attention

This is a build of the SageAttention compatible with kernels library

—

gpt-oss-metal-kernels

Metal kernels that back the OpenAI GPT-OSS reference implementation, repackaged for local experiments on Apple Silicon GPUs. The GPT-OSS project distributes optimized inference primitives for the `gpt-oss-20b` and `gpt-oss-120b` open-weight models, including MXFP4-packed linear layers and fused attention paths that target Metal Performance Shaders on macOS gpt-oss. The package exposes Python bindings through `gptossmetalkernels.ops`; these symbols are re-exported in `gptossmetalkernels.init` for convenience. All kernels expect Metal (`mps`) tensors and operate in place on user-provided outputs to minimize additional allocations. - `f32bf16wmatmul`, `f32bf16wmatmuladd` - `f32bf16wdensematmulqkv`, `f32bf16wdensematmulattnoutput`, `f32bf16wdensematmulmlpgate` - `f32bf16wrmsnorm` - `bf16f32embeddings` - `f32rope` - `f32bf16wmatmulqkv` - `f32sdpa` - `f32topk`, `expertroutingmetadata`, `f32scatter` For implementation details, inspect the `.metal` shader files. Each example below compares a Metal kernel against the canonical PyTorch equivalent using shared random inputs. The snippets assume an Apple Silicon machine with an `mps` device and that `kernels` installed in the active environment. For this kernel, the outputs match 97% of the time, It should be related to how the reference implementation is implemented below: These kernels form the core of the GPT-OSS inference stack, enabling BF16 activations with MXFP4 weights while keeping latency low on Metal GPUs gpt-oss. Use the snippets as templates when validating your own model integrations or when extending the kernel set.

—

metal-flash-sdpa

license:apache-2.0

triton_kernels

triton-kernels is a set of kernels that enable fast moe on different architectures. These kernels are compatible with different precision (e.g bf16, mxfp4) Original code here https://github.com/triton-lang/triton/tree/main/python/tritonkernels The current version is the following commit 7d0efaa7231661299284a603512fce4fa255e62c Note that we can't update those kernels as we wish as some commits might rely on triton main. We need to wait for a new release unfortunately. See releated issue https://github.com/triton-lang/triton/issues/7818

license:mit

fp8-fbgemm

—

gpt-oss-triton-kernels

—

tinygrad-rms

—

finegrained-fp8

—

sonic-moe

—

triton-scaled-mm

license:bsd-3-clause

torch_harmonics_attn

Attention mechanisms for the Spherical Harmonics basis using the torch-harmonics package : https://github.com/NVIDIA/torch-harmonics/tree/main/torchharmonics/attention

—

residual_rms_rocm

RMSNorm kernel for ROCm devices from https://github.com/huggingface/hf-rocm-kernels

—

kernels-community

flash-attn3

activation

vllm-flash-attn3

flash-attn2

cv_utils

rotary

megablocks

flash-mla

paged-attention

deep-gemm

sgl-flash-attn3

sage_attention

gpt-oss-metal-kernels

metal-flash-sdpa

triton_kernels

fp8-fbgemm

gpt-oss-triton-kernels

tinygrad-rms

finegrained-fp8

sonic-moe

triton-scaled-mm

torch_harmonics_attn

residual_rms_rocm