JonathanMiddleton

12 models • 1 total models in database

Sort by:

Qwen3 Embedding 8B GGUF

Purpose Multilingual text-embedding model in GGUF format for efficient CPU/GPU inference with llama.cpp and derivatives. Files | Filename | Precision | Size | Est. MTEB Δ vs FP16 | Notes | |-------------------------------------------|-----------|-------|--------------------|-------| | `Qwen3-Embedding-8B-F16.gguf` | FP16 | 15.1 GB | 0 | Direct conversion; reference quality | | `Qwen3-Embedding-8B-Q80.gguf` | Q80 | 8.6 GB | ≈ +0.02 | Full-precision parity for most tasks | | `Qwen3-Embedding-8B-Q6K.gguf` | Q6K | 6.9 GB | ≈ +0.20 | Balanced size / quality | | `Qwen3-Embedding-8B-Q5KM.gguf` | Q5KM | 6.16 GB | ≈ +0.35 | Good recall under tight memory | | `Qwen3-Embedding-8B-Q4KM.gguf` | Q4KM | 5.41 GB | ≈ +0.60 | Lowest-size CPU-friendly build | Upstream source Repository : `Qwen/Qwen3-Embedding-8B` Commit : `1d8ad4c` (2025-07-12) Licence : Apache-2.0 Conversion - Code base : llama.cpp commit `a20f0a1` + PR #14029 (Qwen embedding support). - Command: ```bash python converthftogguf.py Qwen/Qwen3-Embedding-8B \ --outfile Qwen3-Embedding-8B-F16.gguf \ --leave-output-tensor \ --outtype f16 EMBOPT="--token-embedding-type F16 --leave-output-tensor" for QT in Q4KM Q5KM Q6K Q80; do OUT="${DIR}/${BASE}-${QT}.gguf" echo ">> quantising ${QT} -> $(basename "$OUT")" llama-quantize $EMBOPT "$SRC" "$OUT" "$QT" $(nproc) done

NaNK

llama.cpp

912

daisy-milli-base-v18

—

322

Qwen3-Reranker-0.6B

🚨 REQUIRED Llama.cpp build: https://github.com/ngxson/llama.cpp/tree/xsn/qwen3embdrerank This unmerged fix branch is mandatory to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline `llama.cpp` because they were not produced with this build. This quantization was produced with the above build and works. Purpose Multilingual text-reranking model in GGUF for efficient CPU/GPU inference with llama.cpp-compatible back-ends. Parameters ≈ 0.6 B. Note: Token embedding matrix and output tensors are left at FP16 across all quantizations. Files | Filename | Quant | Size (bytes / MiB) | Est. quality Δ vs FP16 | |--------------------------------------------|---------|------------------------------------|------------------------| | `Qwen3-Reranker-0.6B-F16.gguf` | FP16 | 1,197,634,048 B (1142.2 MiB) | 0 (reference) | | `Qwen3-Reranker-0.6B-Q4KM.gguf` | Q4KM | 396,476,032 B (378.1 MiB) | TBD | | `Qwen3-Reranker-0.6B-Q5KM.gguf` | Q5KM | 444,186,496 B (423.6 MiB) | TBD | | `Qwen3-Reranker-0.6B-Q6K.gguf` | Q6K | 494,878,880 B (472.0 MiB) | TBD | | `Qwen3-Reranker-0.6B-Q80.gguf` | Q80 | 639,153,088 B (609.5 MiB) | TBD | Upstream Source Repo: `Qwen/Qwen3-Reranker-0.6B` Commit: `f16fc5d` (2025-06-09) License: Apache-2.0 Conversion & Quantization ```bash Convert safetensors → GGUF (FP16) python converthftogguf.py ~/models/local/Qwen3-Reranker-0.6B Quantize variants EMBOPT="--token-embedding-type F16 --leave-output-tensor" for QT in Q4KM Q5KM Q6K Q80; do llama-quantize $EMBOPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT done

NaNK

llama.cpp

141

JonathanMiddleton

Qwen3 Embedding 8B GGUF

daisy-milli-base-v18

Qwen3-Reranker-0.6B

daisy-pico-instruct-20260108

daisy-milli-base-v4.0

daisy-micro-base-v4

daisy-pico-base-20260108-1

daisy-nano

daisy-pico-base

daisy-pico-base-v11

test

daisy-milli-base-v18d.e-tokens163185688576