BTC 71,324.00 +1.05%
ETH 2,168.35 +0.62%
S&P 500 6,591.90 +0.54%
Dow Jones 46,429.49 +0.66%
Nasdaq 21,929.83 +0.77%
VIX 25.33 -6.01%
EUR/USD 1.09 +0.15%
USD/JPY 149.50 -0.05%
Gold 4,511.10 -0.91%
Oil (WTI) 91.39 +1.18%
BTC 71,324.00 +1.05%
ETH 2,168.35 +0.62%
S&P 500 6,591.90 +0.54%
Dow Jones 46,429.49 +0.66%
Nasdaq 21,929.83 +0.77%
VIX 25.33 -6.01%
EUR/USD 1.09 +0.15%
USD/JPY 149.50 -0.05%
Gold 4,511.10 -0.91%
Oil (WTI) 91.39 +1.18%

Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs

| 2 Min Read
Understanding model quantization is crucial for running LLMs locally. We break down the math, trade-offs, and help you choose the right format for your hardware. Continue reading Quantization Explaine...
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Q4_K_M vs AWQ vs FP16 Comparison

DimensionFP16Q4_K_M (GGUF)AWQ (4-bit)
Size (7B model)~14 GB~4.1 GB~4.2 GB
Hardware requirement≥16 GB VRAM (GPU)CPU, Apple Silicon, or any GPU via llama.cppNVIDIA GPU only (CUDA kernels)
Perplexity vs FP16Reference+0.1 to 0.3+0.05 to 0.2
Best use caseFine-tuning, evaluation, quality baselinesCPU/hybrid inference, broad hardware compatibilityMaximum GPU throughput with minimal quality loss

Running large language models locally means confronting a fundamental resource constraint: VRAM. Model quantization is the primary technique for fitting models that would otherwise exceed consumer hardware limits, but the choice between quantization formats carries real trade-offs in quality, speed, and compatibility. This article provides a direct, practical comparison of three common formats for local LLMs: FP16, Q4_K_M (GGUF), and AWQ.

Table of Contents

Prerequisites: All Python code in this article was tested with autoawq==0.2.x, transformers==4.44.x, torch==2.3.x, and accelerate==0.33.x on CUDA 12.1. Pin these versions in your environment to ensure reproducibility. llama.cpp commands assume a recent build from the project's master branch (mid-2024 or later, CMake-based build system). Verify exact script names and binary paths against your checked-out commit.

What Is Model Quantization and Why Does It Matter for Local LLMs?

The Memory Problem with Full-Precision Models

The arithmetic is straightforward. Each parameter in a neural network stored at FP32 (32-bit floating point) occupies 4 bytes. A 7-billion-parameter model at FP32 requires roughly 28 GB of VRAM just to hold the weights, before accounting for KV cache, activations, or framework overhead. Drop to FP16 (16-bit floating point) and that figure halves to approximately 14 GB. A 13B model at FP16 needs around 26 GB; a 70B model needs roughly 140 GB.

Consumer GPUs ship with 8 to 24 GB of VRAM. The NVIDIA RTX 4090, one of the most capable consumer cards available, tops out at 24 GB. Even a 7B-parameter model at FP16 is a tight fit, and anything larger is simply out of reach without further reduction. This is where quantization becomes essential rather than optional.

How Quantization Reduces Model Size

Quantization reduces the number of bits used to represent each weight. The progression from FP32 to FP16 to INT8 to INT4 successively shrinks the model's memory footprint, but each step sacrifices numerical precision. The core trade-off is size and speed against output quality.

Not all quantization methods are equal. Naive approaches round weights to the nearest representable value at the target bit width, which can degrade quality significantly. More sophisticated techniques analyze which weights matter most and allocate precision accordingly. This article focuses on three specific points along the quantization spectrum: FP16 as the full-fidelity baseline, Q4_K_M as the dominant GGUF format for CPU and hybrid inference, and AWQ as the leading activation-aware method for GPU inference.

FP16: The Full-Fidelity Baseline

What FP16 Is

FP16, or half-precision floating point, uses 16 bits per weight: 1 sign bit, 5 exponent bits, and 10 mantissa bits. It provides a dynamic range sufficient for virtually all inference tasks, and the quality loss relative to FP32 is negligible for the vast majority of models. FP16 is a common weight format in Hugging Face safetensors files; BF16 is also widely used.

When to Use FP16

FP16 is the right choice when ample VRAM is available: at least 16 GB for a 7B model in practice, with 24 GB providing comfortable margin for KV cache and longer contexts. For 70B-class models, this typically means A100 or H100 datacenter hardware with 80+ GB. It is also the right choice when maximum output quality is non-negotiable, such as during evaluation benchmarks, fine-tuning, or when establishing a quality baseline for measuring quantization degradation. FP16 is the reference point from which all quantized formats are derived. Every quality comparison in this article uses FP16 as ground truth.

Q4_K_M: The GGUF Sweet Spot for CPU and Hybrid Inference

Understanding GGUF and the K-Quant Naming Convention

GGUF is the file format native to llama.cpp, the C/C++ inference engine that made local LLM inference practical on consumer hardware. The format is a single-file container that bundles model weights, tokenizer data, and metadata.

The name Q4_K_M encodes three pieces of information. Q4 means the weights are primarily stored at 4-bit precision. K indicates the k-quant method, a quantization approach developed by the llama.cpp community that differs fundamentally from naive round-to-nearest schemes. M stands for medium, describing a mixed-precision strategy where different tensor types receive different bit widths based on their importance to model quality.

K-quants perform importance-weighted bit allocation. Attention and output projection tensors, which disproportionately affect output quality, retain higher precision (often 6 bits), while less critical feedforward layers are quantized more aggressively to 4 bits. This selective approach yields meaningfully better quality than assigning a uniform 4-bit width to every tensor.

K-quants perform importance-weighted bit allocation. Attention and output projection tensors, which disproportionately affect output quality, retain higher precision (often 6 bits), while less critical feedforward layers are quantized more aggressively to 4 bits.

Performance and Quality Profile

A 7B-parameter model quantized to Q4_K_M typically occupies approximately 4.1 GB on disk (varies by model architecture), roughly 70% smaller than the FP16 original. Community benchmarks on Llama-family models report perplexity increases on WikiText-2 in the range of 0.1 to 0.3 points relative to FP16, though exact values depend on the specific model, llama.cpp version, and evaluation parameters (stride, context length). This is a modest degradation that is imperceptible in most conversational and instruction-following tasks.

The defining advantage of GGUF Q4_K_M is hardware flexibility. It runs on machines with zero GPU via llama.cpp's CPU inference path, which uses AVX2/AVX-512 and ARM NEON SIMD instructions. It also supports partial GPU offloading, where some layers run on the GPU while the rest execute on CPU, allowing users with smaller GPUs to accelerate inference proportionally to available VRAM.

Loading a Q4_K_M Model with Ollama

Ollama provides the simplest path to running a GGUF-quantized model locally. The following demonstrates pulling a pre-quantized model and running it, as well as creating a custom Modelfile to load any GGUF file:

# Pull and run a pre-quantized model directly
# Verify current tag availability at ollama.com/library/llama3
ollama pull llama3:8b-instruct-q4_k_m
ollama run llama3:8b-instruct-q4_k_m "Explain quantization in one paragraph."

To load a custom GGUF file, create a Modelfile:

# Modelfile
FROM ./llama-3-8b-instruct.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Then build and run:

ollama create my-llama3-q4km -f Modelfile
ollama run my-llama3-q4km "What are the benefits of k-quant quantization?"

AWQ: Activation-Aware Weight Quantization for GPU Inference

How AWQ Works Differently

AWQ, introduced by researchers at MIT (Lin et al., 2023; arXiv:2306.00978), takes a fundamentally different approach to deciding which weights deserve protection during quantization. Rather than examining weight magnitudes alone, AWQ analyzes activation distributions from a calibration dataset to identify which weights produce the largest activations during inference. AWQ preserves these "salient" weights at higher effective precision through per-channel scaling factors, while it quantizes the remaining weights to 4-bit integers.

The storage format uses 4-bit integer weights with FP16 scaling factors, organized into groups (typically group size 128). AutoAWQ stores the result as safetensors files alongside a quantization configuration JSON. AWQ targets GPU execution exclusively. It relies on custom CUDA kernels for dequantization and matrix multiplication and does not support CPU-only inference.

Performance and Quality Profile

A 7B-parameter AWQ model occupies approximately 4.2 GB on disk (varies by model architecture), comparable to Q4_K_M GGUF. The quality retention edges ahead of Q4_K_M: perplexity degradation on WikiText-2 falls in the range of 0.05 to 0.2 points relative to FP16 on Llama-family models tested in the AWQ paper (Lin et al., 2023), owing to the activation-aware weight selection strategy. Exact values depend on the model and calibration data used. On several published benchmarks in that paper, AWQ outperforms GPTQ at 4-bit on perplexity; results vary by model and calibration data.

Where AWQ separates itself is GPU inference speed. The optimized CUDA kernels generate tokens faster on NVIDIA GPUs compared to GGUF models with full GPU offload. The AWQ paper reports throughput improvements of roughly 1.2 to 1.5x over comparable GPTQ configurations, though the exact delta against GGUF depends on model size, batch configuration, and GPU architecture. Frameworks like vLLM and Hugging Face Text Generation Inference (TGI) have native AWQ support with kernel-level optimizations that exploit the fixed integer arithmetic patterns.

Where AWQ separates itself is GPU inference speed. The optimized CUDA kernels generate tokens faster on NVIDIA GPUs compared to GGUF models with full GPU offload.

Loading an AWQ Model with Hugging Face Transformers

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

# Tested with: autoawq==0.2.x, transformers==4.44.x, torch==2.3.x
# Verify model availability at hf.co before use
model_id = "hugging-quants/Meta-Llama-3-8B-Instruct-AWQ-INT4"

# Load the AWQ-quantized model onto GPU
# fuse_layers=True requires Ampere (RTX 30xx) or newer; set False on older GPUs
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

# Generate text
prompt = "Explain the difference between quantization and pruning."
tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
input_length = tokens["input_ids"].shape[-1]

with torch.no_grad():
    output = model.generate(
        **tokens,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )

print(tokenizer.decode(output[0][input_length:], skip_special_tokens=True))

Setting fuse_layers=True enables fused attention and MLP kernels, which further improve throughput on supported architectures (Ampere and newer). The top_p=0.9 parameter works alongside temperature to control sampling behavior and avoid degenerate outputs. The response is decoded starting from input_length to exclude the echoed prompt tokens.

Head-to-Head Comparison: Benchmark Table and Analysis

Benchmark Methodology

The comparison uses a Llama 3 8B model across all three formats. Metrics include file size on disk, VRAM usage at runtime, token generation speed, and perplexity measured on WikiText-2. The hardware reference points are representative consumer GPUs: an RTX 3090 and RTX 4090, both with 24 GB VRAM. GPU token generation figures in the table below were measured on the RTX 4090; CPU figures use an equivalent consumer CPU. Exact library versions, batch sizes, and evaluation parameters (stride, context length) should be matched when reproducing these results.

Note: The tok/s values in the table below are qualitative estimates based on community benchmarks and published results, not controlled measurements from a single unified test run. Treat them as approximate rankings, not exact figures.

Comparison Table

Metric FP16 Q4_K_M (GGUF) AWQ (4-bit)
File Size (7-8B) ~14 GB ~4.1 GB ~4.2 GB
VRAM Required ~14.5 GB ~5.5 GB (full GPU offload) / 0 (CPU only) ~5.5 GB
Tokens/sec (GPU, RTX 4090) ~50-70 (estimate) ~80-110 (estimate) ~100-140 (estimate)
Tokens/sec (CPU) <1 on consumer CPUs for 7B ~10-20 (estimate, varies by CPU) Not supported
Perplexity vs FP16 Reference +0.1 to 0.3 +0.05 to 0.2
Ecosystem HF Transformers, vLLM llama.cpp, Ollama, LM Studio vLLM, HF Transformers, TGI

Key Takeaways from the Benchmarks

If you need the lowest perplexity at 4-bit and your machine has an NVIDIA GPU, AWQ is the strongest option. The activation-aware calibration process yields tighter perplexity to FP16 than Q4_K_M, and the dedicated CUDA kernels deliver the highest throughput among the three formats.

Users running on CPU, Apple Silicon, or mixed CPU/GPU setups should default to Q4_K_M. It is the only format among the three that supports CPU-only inference, partial GPU offloading, and runs natively on macOS Metal via llama.cpp. Without a dedicated NVIDIA GPU, it is effectively the only viable path.

FP16 remains the correct choice for quality-sensitive workflows like fine-tuning and evaluation, but it demands hardware that most local setups cannot provide for models above 7B parameters. Think of it as the reference checkpoint, not a deployment format for constrained environments.

Decision Flowchart: Which Format Should You Choose?

The following decision tree captures the primary selection criteria:

START
  ├─ Do you have a dedicated NVIDIA GPU with ≥8 GB VRAM?
  │    ├─ YES → Is maximum inference speed your priority?
  │    │         ├─ YES → Use AWQ
  │    │         └─ NO → Do you have ≥16 GB VRAM?
  │    │                   ├─ YES → Use FP16
  │    │                   └─ NO → Use AWQ or Q4_K_M (GPU offload)
  │    └─ NO → Do you have an AMD GPU (ROCm) or Apple Silicon (Metal)?
  │              ├─ YES → Use Q4_K_M (GGUF) with GPU offload via llama.cpp
  │              └─ NO → Use Q4_K_M (GGUF) with CPU inference
  └─ Are you fine-tuning or evaluating model quality?
       ├─ YES → Use FP16
       └─ NO → Follow GPU branch above

The first branch point is hardware: without an NVIDIA GPU, AWQ is off the table entirely, making Q4_K_M the default. llama.cpp supports GPU-accelerated inference on AMD GPUs via ROCm and on Apple Silicon via Metal, so Q4_K_M with GPU offload is viable on those platforms as well. With an NVIDIA GPU but limited VRAM (8 to 12 GB), both AWQ and Q4_K_M with GPU offloading are viable, and the choice depends on whether the user values speed (AWQ) or the ability to fall back to CPU layers (Q4_K_M). With 16+ GB of VRAM and no speed constraints, FP16 preserves full model fidelity. The fine-tuning branch is separate because quantized weights are not suitable as a starting point for training; FP16 or FP32 checkpoints are required.

Converting Between Formats: Practical Code Examples

Converting FP16 to GGUF Q4_K_M with llama.cpp

This requires a local clone of llama.cpp and a Hugging Face FP16 model on disk.

Disk space: Ensure at least 35 GB of free disk space before beginning. The workflow stores the FP16 model (~14 GB), an intermediate FP16 GGUF (~14 GB), and the final Q4_K_M output (~4 GB).

# Clone llama.cpp and build the quantization tool
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CMake (the current build system for llama.cpp)
# Detect available CPU cores cross-platform
if command -v nproc >/dev/null 2>&1; then
    JOBS=$(nproc)
elif command -v sysctl >/dev/null 2>&1; then
    JOBS=$(sysctl -n hw.logicalcpu)
else
    JOBS=4
fi

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$JOBS"

# Verify the conversion script name for your llama.cpp version:
# ls *.py
# As of mid-2024, the canonical script is convert_hf_to_gguf.py at the repo root.

# Convert the HF model to GGUF FP16 format first
python convert_hf_to_gguf.py \
    /path/to/Llama-3-8B-Instruct \
    --outfile llama-3-8b-instruct-fp16.gguf \
    --outtype f16

# Quantize the FP16 GGUF to Q4_K_M
QUANTIZE_BIN="./build/bin/llama-quantize"

# Fallback for non-standard build layouts
if [ ! -f "$QUANTIZE_BIN" ]; then
    QUANTIZE_BIN=$(find ./build -name "llama-quantize" -type f | head -1)
fi

if [ ! -f "$QUANTIZE_BIN" ]; then
    echo "ERROR: llama-quantize binary not found. Check build output." >&2
    exit 1
fi

"$QUANTIZE_BIN" \
    llama-3-8b-instruct-fp16.gguf \
    llama-3-8b-instruct-Q4_K_M.gguf \
    Q4_K_M

The convert_hf_to_gguf.py script reads safetensors weights and tokenizer files and writes an intermediate FP16 GGUF. The llama-quantize binary then applies the k-quant scheme at the specified level. The Q4_K_M option applies mixed precision: higher bit widths to attention tensors and lower to feedforward blocks.

Creating an AWQ Quantized Model with AutoAWQ

VRAM requirement: AWQ quantization loads the full FP16 model plus calibration activations into GPU memory. For a 7B model, ensure at least 20 GB VRAM, or use device_map="auto" to spill to system RAM if needed.

import os
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Tested with: autoawq==0.2.x, transformers==4.44.x, torch==2.3.x
model_path = "/path/to/Llama-3-8B-Instruct"
quant_path = "./Llama-3-8B-Instruct-AWQ"

# Load the FP16 model
# ⚠ trust_remote_code=False prevents execution of untrusted code
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)

# Ensure pad_token is set (required for calibration on base Llama variants)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Define quantization configuration
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",
}

# Ensure output directory exists before starting quantization
os.makedirs(quant_path, exist_ok=True)

# Run AWQ quantization
# The default calibration dataset is a subset of the Pile (as bundled in AutoAWQ).
# Replace with domain-specific data via the `calib_data` parameter for specialized models.
model.quantize(tokenizer, quant_config=quant_config)

# Save the quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

The q_group_size of 128 is the standard grouping that balances granularity against overhead. The "version": "GEMM" flag selects the GEMM-based kernel path, which provides higher throughput than GEMV at batch sizes above 1; for single-sequence local use, benchmark "GEMV" or "GEMV_FAST" as alternatives. Supplying a domain-specific calibration set can improve quality for specialized models but is not required for general use.

Quick Note on GPTQ

GPTQ is a related GPU quantization format that predates AWQ and has broader legacy tool support, including early integration with the transformers library and text-generation-webui. On several published benchmarks, AWQ produces lower perplexity at the same bit width and benefits from faster kernel implementations, though results vary by model. GPTQ remains relevant when using older toolchains or when pre-quantized GPTQ models are available but AWQ versions are not. EXL2 (via ExLlamaV2) is a further evolution of GPTQ with dynamic per-layer bit allocation and is worth consulting for GPU-only deployments.

Common Pitfalls and Tips

Mismatched Formats and Runtimes

Hugging Face Transformers cannot load GGUF files natively; they require llama.cpp, Ollama, or LM Studio. Conversely, Ollama cannot load AWQ safetensors files without first converting them to GGUF, a process that is lossy and generally inadvisable. Match the quantization format to the intended inference runtime before you start downloading models.

Quantization Is Not Compression

Quantization permanently alters weight values through lossy precision reduction. This is distinct from lossless file compression. You cannot "dequantize" a quantized model back to the original FP16 weights. (The inference engine dequantizes weights transiently during matrix multiplication, but the stored file retains only the lower-precision values.) Treat quantized models as derived artifacts, not as replacements for the original checkpoint.

Quant Quality Varies by Model

Larger models tolerate aggressive quantization better than smaller ones. A 70B model at 4-bit loses less perplexity in both absolute and relative terms than a 1.5B model at the same bit width, because the larger parameter count provides redundancy. Code-focused models also degrade faster at low bit widths than general chat models, as precise token prediction for code syntax is more sensitive to weight perturbation.

Summary and Recommendations

FP16 is the uncompromised baseline, suitable when hardware allows it and fidelity matters most. Q4_K_M is the most versatile quantization format, supporting CPU, hybrid, and GPU inference through the llama.cpp ecosystem with modest quality loss. AWQ delivers the best combination of quality retention and GPU inference speed but requires an NVIDIA GPU and a compatible serving framework. For further reference, consult the llama.cpp repository (github.com/ggerganov/llama.cpp), the AutoAWQ documentation (github.com/casper-hansen/AutoAWQ), the Ollama documentation (ollama.com), and the AWQ paper (Lin et al., 2023; arXiv:2306.00978).

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

Comments

Please sign in to comment.
Capitolioxa Market Intelligence