Inference Guide¶
The InferenceEngine provides a clean interface for text generation and chat interactions. It inherits from BaseEngine which handles device detection, model loading, and quantization.
Overview¶
InferenceEngine (inherits from BaseEngine)
├── Model Loading & Tokenization
├── Device Management (CPU/GPU/MPS/TPU)
├── Quantization (FP16/INT8/INT4)
├── Text Generation
└── Chat Interface
Basic Usage¶
Plain Text Generation¶
from llmforge import InferenceEngine
# Initialize engine with default settings
engine = InferenceEngine("Qwen/Qwen3-0.6B")
# Generate text
prompt = "Write a short story about a robot"
result = engine.generate(prompt, max_tokens=128, temperature=0.7)
print(result)
Chat Mode¶
from llmforge import InferenceEngine
engine = InferenceEngine("Qwen/Qwen3-0.6B")
# Define messages
messages = [
{"role": "system", "content": "You are a helpful Python coding assistant."},
{"role": "user", "content": "How do I read a file in Python?"},
{"role": "assistant", "content": "You can use the built-in `open()` function."},
{"role": "user", "content": "How about writing?"},
]
# Generate response
response = engine.chat(messages, max_tokens=256, temperature=0.7)
print(response)
Device Configuration¶
Auto-Detect (Default)¶
# Automatically selects best device: CUDA > TPU > MPS > CPU
engine = InferenceEngine("Qwen/Qwen3-0.6B")
Explicit Device¶
from llmforge import InferenceEngine
# Use specific device
engine = InferenceEngine(
"Qwen/Qwen3-0.6B",
device="cuda" # or "cpu", "mps", "tpu", "auto"
)
Check Available Devices¶
from llmforge.base_engine import available_devices
devices = available_devices()
for name, info in devices.items():
print(f"{name}: {info}")
Quantization¶
Reduce memory usage with quantization:
from llmforge import InferenceEngine
# 4-bit quantization (75% size reduction)
engine = InferenceEngine(
"Qwen/Qwen3-0.6B",
bits=4
)
# 8-bit quantization (50% size reduction)
engine = InferenceEngine(
"Qwen/Qwen3-0.6B",
bits=8
)
# Full precision (no quantization)
engine = InferenceEngine(
"Qwen/Qwen3-0.6B",
bits=16
)
Generation Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str | required | Input prompt |
max_tokens |
int | 128 | Maximum tokens to generate |
temperature |
float | 0.7 | Sampling temperature (0=greedy, 1=random) |
Temperature Guide¶
| Temperature | Use Case |
|---|---|
| 0.0 | Code generation, factual answers |
| 0.1-0.3 | Precise, focused responses |
| 0.4-0.7 | Balanced (default) |
| 0.8-1.0 | Creative, diverse outputs |
Advanced Options¶
Tensor Parallelism¶
Automatically shard large models across multiple devices:
from llmforge import InferenceEngine
engine = InferenceEngine(
"meta-llama/Llama-3.2-70B-Instruct",
tensor_parallel=True # Auto-shard across devices
)
Layer Offloading¶
For very large models, enable layer offloading:
from llmforge import InferenceEngine
engine = InferenceEngine(
"meta-llama/Llama-3.1-70B",
offload=True, # Move layers to CPU as needed
bits=4
)
Error Handling¶
from llmforge import InferenceEngine
try:
engine = InferenceEngine("Qwen/Qwen3-0.6B", device="cuda")
except RuntimeError as e:
print(f"Device error: {e}")
# Fall back to CPU
engine = InferenceEngine("Qwen/Qwen3-0.6B", device="cpu")
Next Steps¶
- RL Engine Guide - Self-improving responses
- Chat API - Interactive chat interface
- Quantization - Memory optimization