Skip to content

Inference Guide

The InferenceEngine provides a clean interface for text generation and chat interactions. It inherits from BaseEngine which handles device detection, model loading, and quantization.

Overview

InferenceEngine (inherits from BaseEngine)
├── Model Loading & Tokenization
├── Device Management (CPU/GPU/MPS/TPU)
├── Quantization (FP16/INT8/INT4)
├── Text Generation
└── Chat Interface

Basic Usage

Plain Text Generation

from llmforge import InferenceEngine

# Initialize engine with default settings
engine = InferenceEngine("Qwen/Qwen3-0.6B")

# Generate text
prompt = "Write a short story about a robot"
result = engine.generate(prompt, max_tokens=128, temperature=0.7)
print(result)

Chat Mode

from llmforge import InferenceEngine

engine = InferenceEngine("Qwen/Qwen3-0.6B")

# Define messages
messages = [
    {"role": "system", "content": "You are a helpful Python coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
    {"role": "assistant", "content": "You can use the built-in `open()` function."},
    {"role": "user", "content": "How about writing?"},
]

# Generate response
response = engine.chat(messages, max_tokens=256, temperature=0.7)
print(response)

Device Configuration

Auto-Detect (Default)

# Automatically selects best device: CUDA > TPU > MPS > CPU
engine = InferenceEngine("Qwen/Qwen3-0.6B")

Explicit Device

from llmforge import InferenceEngine

# Use specific device
engine = InferenceEngine(
    "Qwen/Qwen3-0.6B",
    device="cuda"   # or "cpu", "mps", "tpu", "auto"
)

Check Available Devices

from llmforge.base_engine import available_devices

devices = available_devices()
for name, info in devices.items():
    print(f"{name}: {info}")

Quantization

Reduce memory usage with quantization:

from llmforge import InferenceEngine

# 4-bit quantization (75% size reduction)
engine = InferenceEngine(
    "Qwen/Qwen3-0.6B",
    bits=4
)

# 8-bit quantization (50% size reduction)
engine = InferenceEngine(
    "Qwen/Qwen3-0.6B", 
    bits=8
)

# Full precision (no quantization)
engine = InferenceEngine(
    "Qwen/Qwen3-0.6B", 
    bits=16
)

Generation Parameters

Parameter Type Default Description
prompt str required Input prompt
max_tokens int 128 Maximum tokens to generate
temperature float 0.7 Sampling temperature (0=greedy, 1=random)

Temperature Guide

Temperature Use Case
0.0 Code generation, factual answers
0.1-0.3 Precise, focused responses
0.4-0.7 Balanced (default)
0.8-1.0 Creative, diverse outputs

Advanced Options

Tensor Parallelism

Automatically shard large models across multiple devices:

from llmforge import InferenceEngine

engine = InferenceEngine(
    "meta-llama/Llama-3.2-70B-Instruct",
    tensor_parallel=True  # Auto-shard across devices
)

Layer Offloading

For very large models, enable layer offloading:

from llmforge import InferenceEngine

engine = InferenceEngine(
    "meta-llama/Llama-3.1-70B",
    offload=True,  # Move layers to CPU as needed
    bits=4
)

Error Handling

from llmforge import InferenceEngine

try:
    engine = InferenceEngine("Qwen/Qwen3-0.6B", device="cuda")
except RuntimeError as e:
    print(f"Device error: {e}")
    # Fall back to CPU
    engine = InferenceEngine("Qwen/Qwen3-0.6B", device="cpu")

Next Steps