Quick Start¶

Get up and running with LLMForge in 5 minutes.

Your First Generation¶

Step 1: Install LLMForge¶

pip install llmforge

Step 2: Run Inference¶

from llmforge import InferenceEngine

# Create engine - auto-detects best device (CPU/GPU/MPS)
engine = InferenceEngine("Qwen/Qwen3-0.6B")

# Generate text
response = engine.generate("Write a short poem about AI")
print(response)

Step 3: Chat with the Model¶

from llmforge import InferenceEngine

engine = InferenceEngine("Qwen/Qwen3-0.6B")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]

response = engine.chat(messages, max_tokens=128)
print(response)

RL Engine with Self-Improvement¶

from llmforge import RLEngine

# Create RL engine with memory
rl = RLEngine("Qwen/Qwen3-0.6B")

# Generate with auto-improvement strategies
for i in range(3):
    response = rl.generate(f"Explain concept {i+1}", max_tokens=128)
    print(f"Response {i+1}: {response[:100]}...")
    print("-" * 50)

# Check stats
stats = rl.get_stats()
print(f"Total generations: {stats['generations']}")
print(f"Memory hits: {stats['memory_hits']}")

Command Line Usage¶

Chat Interface¶

llmforge chat --model Qwen/Qwen3-0.6B

Options: - --temp 0.8 - Sampling temperature - --max-tokens 512 - Maximum tokens - --system-prompt "You are helpful" - System prompt

Generate from File¶

llmforge generate --model Qwen/Qwen3-0.6B --prompt-file prompts.txt

Common Models¶

Model	Size	Memory (FP16)	Memory (INT4)
Qwen3-0.6B	0.6B	1.2GB	0.3GB
Llama-3.2-1B	1B	2GB	0.5GB
Llama-3.2-3B	3B	6GB	1.5GB
Llama-3.2-8B	8B	16GB	4GB
Llama-3.1-70B	70B	140GB	35GB

What's Next?¶

Inference Guide - Deep dive into text generation
RL Engine - Self-improving AI responses
Quantization - Reduce memory usage
Server - Run as API server