Quantization Guide
Reduce model size and memory usage with quantization.
Quantization Levels
| Bits |
Size Reduction |
Quality Loss |
Use Case |
| 16 |
0% |
None |
Original precision |
| 8 |
50% |
Minimal |
Desktop inference |
| 4 |
75% |
Low |
Mobile/embedded |
| 2 |
87.5% |
High |
Extreme compression |
Basic Usage
from llmforge import InferenceEngine
# 4-bit quantization (default for large models)
engine = InferenceEngine(
"meta-llama/Llama-3.2-1B-Instruct",
bits=4
)
# 8-bit quantization
engine = InferenceEngine(
"meta-llama/Llama-3.2-1B-Instruct",
bits=8
)
Manual Quantization
from llmforge.quant_module import quantize_model
# Quantize an existing model
model = quantize_model(model, bits=4)
Quantization Methods
INT8 (Dynamic)
from llmforge.quant.dynamic_quant import apply_dynamic_quant
model = apply_dynamic_quant(model)
INT4 (GPTQ)
from llmforge.quant.gptq import apply_gptq
model = apply_gptq(model, tokenizer, dataset)
INT4 (AWQ)
from llmforge.quant.awq import apply_awq
model = apply_awq(model, tokenizer, dataset)
Model Sizing Example
| Model |
Params |
FP16 |
INT8 |
INT4 |
| Llama-3.2-1B |
1B |
2GB |
1GB |
0.5GB |
| Llama-3.2-3B |
3B |
6GB |
3GB |
1.5GB |
| Llama-3.2-8B |
8B |
16GB |
8GB |
4GB |
| Llama-3.1-70B |
70B |
140GB |
70GB |
35GB |
Trade-offs
- INT8: Great for batch processing, minimal quality loss
- INT4: Best for deployment, slight quality loss
- AWQ: Better than GPTQ for instruction tuning
- GPTQ: Better for base models