LLMForge¶

Powerful LLM Inference with RL Self-Improvement - Built on PyTorch

Why LLMForge?¶

LLMForge is a comprehensive framework for large language model inference and reinforcement learning self-improvement. It provides seamless multi-device support, intelligent model quantization, and unique self-improvement capabilities.

Multi-Device Support Run models on CPU, NVIDIA CUDA GPUs, Apple MPS, or Google TPU with automatic device detection and optimization.
Massive Model Support
Handle 400B+ parameter models with 4-bit quantization, intelligent offloading, and tensor parallelism.
RL Self-Improvement Improve outputs without fine-tuning using self-critique, best-of-N sampling, and iterative refinement strategies.
Persistent Memory SQLite-based memory stores high-quality responses and recalls them for similar prompts, improving over time.
Hugging Face Integration Use thousands of LLMs from Hugging Face Hub with built-in chat templates and tokenizer support.
Streaming Generation Real-time token-by-token streaming for interactive applications and responsive user experiences.

Quick Install¶

pip install llmforge

Or install from source:

git clone https://github.com/ZandrixAI/llmforge.git
cd llmforge
pip install -e .

Quick Start¶

Basic Inference¶

from llmforge import InferenceEngine

# Create inference engine - auto-detects best device
engine = InferenceEngine("Qwen/Qwen3-0.6B")

# Generate text
response = engine.generate("Explain quantum computing in simple terms.")
print(response)

Chat Mode¶

from llmforge import InferenceEngine

engine = InferenceEngine("Qwen/Qwen3-0.6B")

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]

response = engine.chat(messages, max_tokens=256)
print(response)

RL Engine with Self-Improvement¶

from llmforge import RLEngine

# Create RL engine with memory
rl = RLEngine("Qwen/Qwen3-0.6B")

# Generate with auto-improvement
response = rl.generate("Explain gravity", strategy="auto")
print(response)

# Memory improves over time!
stats = rl.get_stats()
print(f"Generations: {stats['generations']}, Memory hits: {stats['memory_hits']}")

Key Features Comparison¶

Feature	LLMForge	vLLM	TGI
Multi-device (CPU/GPU/MPS/TPU)	Yes	Limited	Limited
RL Self-Improvement	Yes	No	No
SQLite Memory	Yes	No	No
400B+ Model Support	Yes	Yes	Yes
Hugging Face Models	Yes	Yes	Yes
PyTorch Native	Yes	No	No

Architecture¶

LLMForge uses a layered architecture:

graph TD
    subgraph "LLMForge"
        A[User Code] --> B[InferenceEngine]
        A --> C[RLEngine]
        A --> D[Server]

        B --> E[BaseEngine]
        C --> E

        E --> F[Device Layer]
        E --> G[Model Loading]
        E --> H[Quantization]
        E --> I[Tensor Parallelism]

        F --> J[CPU]
        F --> K[CUDA]
        F --> L[MPS]
        F --> M[TPU]
    end

    N[HuggingFace Models] --> G
    O[SQLite Memory] --> C

Component Overview¶

Component	Description
InferenceEngine	Text generation and chat interface
RLEngine	Self-improvement with memory strategies
Server	OpenAI-compatible API server
BaseEngine	Core: device detection, model loading, quantization, tensor parallelism

Community¶

GitHub: Star us on GitHub
PyPI: Install from PyPI
Issues: Report bugs

License¶

LLMForge is MIT licensed.