kernels-community

23 models • 2 total models in database
Sort by:

flash-attn3

license:apache-2.0
257,532
41

activation

48,771
6

vllm-flash-attn3

This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the vLLM team. The transformers team packaged the implementation and pre-built it for use with the kernels library. Kernel source: https://github.com/huggingface/kernels-community/tree/main/vllm-flash-attn3 When loading your model with transformers, provide this repository id as the source of the attention implementation: This will automatically resolve and download the appropriate code for your architecture. See more details in this post. - Tri Dao and team for Flash Attention and Flash Attention 3. - The vLLM team for their implementation and their contribution of attention sinks. - The transformers team for packaging, testing, building and making it available for use with the kernels library.

license:apache-2.0
32,864
41

flash-attn2

Flash Attention is a fast and memory-efficient implementation of the attention mechanism, designed to work with large models and long sequences. This is a Hugging Face compliant kernel build of Flash Attention. Original code here https://github.com/Dao-AILab/flash-attention. `scripts/readmeexample.py` provides a simple example of how to use the Flash Attention kernel in PyTorch. It demonstrates standard attention, causal attention, and variable-length sequences.

license:bsd-3-clause
15,085
29

cv_utils

1,076
0

rotary

license:bsd-3-clause
956
4

megablocks

license:apache-2.0
602
4

flash-mla

583
5

paged-attention

license:apache-2.0
372
11

deep-gemm

license:mit
78
1

sgl-flash-attn3

license:bsd-3-clause
61
3

sage_attention

This is a build of the SageAttention compatible with kernels library

15
3

gpt-oss-metal-kernels

Metal kernels that back the OpenAI GPT-OSS reference implementation, repackaged for local experiments on Apple Silicon GPUs. The GPT-OSS project distributes optimized inference primitives for the `gpt-oss-20b` and `gpt-oss-120b` open-weight models, including MXFP4-packed linear layers and fused attention paths that target Metal Performance Shaders on macOS gpt-oss. The package exposes Python bindings through `gptossmetalkernels.ops`; these symbols are re-exported in `gptossmetalkernels.init` for convenience. All kernels expect Metal (`mps`) tensors and operate in place on user-provided outputs to minimize additional allocations. - `f32bf16wmatmul`, `f32bf16wmatmuladd` - `f32bf16wdensematmulqkv`, `f32bf16wdensematmulattnoutput`, `f32bf16wdensematmulmlpgate` - `f32bf16wrmsnorm` - `bf16f32embeddings` - `f32rope` - `f32bf16wmatmulqkv` - `f32sdpa` - `f32topk`, `expertroutingmetadata`, `f32scatter` For implementation details, inspect the `.metal` shader files. Each example below compares a Metal kernel against the canonical PyTorch equivalent using shared random inputs. The snippets assume an Apple Silicon machine with an `mps` device and that `kernels` installed in the active environment. For this kernel, the outputs match 97% of the time, It should be related to how the reference implementation is implemented below: These kernels form the core of the GPT-OSS inference stack, enabling BF16 activations with MXFP4 weights while keeping latency low on Metal GPUs gpt-oss. Use the snippets as templates when validating your own model integrations or when extending the kernel set.

9
3

metal-flash-sdpa

license:apache-2.0
6
3

triton_kernels

triton-kernels is a set of kernels that enable fast moe on different architectures. These kernels are compatible with different precision (e.g bf16, mxfp4) Original code here https://github.com/triton-lang/triton/tree/main/python/tritonkernels The current version is the following commit 7d0efaa7231661299284a603512fce4fa255e62c Note that we can't update those kernels as we wish as some commits might rely on triton main. We need to wait for a new release unfortunately. See releated issue https://github.com/triton-lang/triton/issues/7818

license:mit
0
7

fp8-fbgemm

0
1

gpt-oss-triton-kernels

0
1

tinygrad-rms

0
1

finegrained-fp8

0
1

sonic-moe

0
1

triton-scaled-mm

license:bsd-3-clause
0
1

torch_harmonics_attn

Attention mechanisms for the Spherical Harmonics basis using the torch-harmonics package : https://github.com/NVIDIA/torch-harmonics/tree/main/torchharmonics/attention

0
1

residual_rms_rocm

RMSNorm kernel for ROCm devices from https://github.com/huggingface/hf-rocm-kernels

0
1