Chat API Guide
The chat module provides an interactive chat interface and programmatic chat generation.
Command Line Interface
# Basic chat
llmforge chat --model meta-llama/Llama-3.2-1B-Instruct
# With custom settings
llmforge chat --model meta-llama/Llama-3.2-1B-Instruct \
--temp 0.8 \
--max-tokens 512 \
--system-prompt "You are a helpful assistant"
Chat Commands
| Command |
Description |
q |
Exit chat |
r |
Reset conversation |
h |
Show help |
Programmatic Chat
from llmforge.chat import stream_generate
from llmforge import load, make_sampler, make_prompt_cache
# Load model and tokenizer
model, tokenizer = load("meta-llama/Llama-3.2-1B-Instruct")
# Prepare messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PyTorch?"},
]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True
)
# Create sampler
sampler = make_sampler(
temperature=0.7,
top_p=0.9
)
# Create prompt cache
prompt_cache = make_prompt_cache(model)
# Stream generated text
for response in stream_generate(
model,
tokenizer,
prompt,
max_tokens=256,
sampler=sampler,
prompt_cache=prompt_cache
):
print(response.text, end="")
Parameters
| Parameter |
CLI Flag |
Type |
Default |
Description |
| model |
--model |
str |
Llama-3.2-3B-Instruct-4bit |
Model path or HF repo |
| temperature |
--temp |
float |
0.0 |
Sampling temperature |
| top_p |
--top-p |
float |
1.0 |
Nucleus sampling threshold |
| max_tokens |
-m |
int |
256 |
Max tokens to generate |
| system_prompt |
--system-prompt |
str |
None |
System prompt |
| adapter_path |
--adapter-path |
str |
None |
LoRA adapter path |