BaseEngine API¶
BaseEngine provides common functionality for model loading, device handling, quantization, and forward pass.
Class Definition¶
class BaseEngine:
def __init__(
self,
model_name: str,
device: Optional[str] = None,
tensor_parallel: bool = True,
bits: int = 16,
offload: bool = False,
)
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str | required | HuggingFace model name or local path |
device |
str | "auto" | Device: "auto", "cuda", "cpu", "mps", "tpu" |
tensor_parallel |
bool | True | Auto-shard across devices |
bits |
int | 16 | Quantization bits: 16 (full), 8 (INT8), 4 (INT4) |
offload |
bool | False | Enable layer offloading |
Methods¶
from_pretrained()¶
@classmethod
def from_pretrained(
cls,
model_name: str,
bits: int = 4,
device: Optional[str] = None,
offload: bool = True,
use_accelerate: bool = False,
) -> "BaseEngine"
Load any model with automatic quantization and offloading.
_forward()¶
def _forward(
self,
input_ids: torch.Tensor,
cache: dict = None
) -> torch.Tensor
Core model forward pass. Returns last-token logits.
_get_eos_ids()¶
def _get_eos_ids(self) -> set
Get end-of-sequence token IDs.
_make_cache()¶
def _make_cache(self) -> dict
Create a fresh KV cache for generation.
Module Functions¶
available_devices()¶
def available_devices() -> dict
Return a dict of available devices with info:
{
"cuda": {"name": "CUDA (RTX 3090)", "available": True, "memory": "24.0 GB"},
"cpu": {"name": "CPU", "available": True},
"mps": {"name": "MPS (Apple Silicon)", "available": True},
}
_detect_device()¶
def _detect_device() -> str
Auto-detect best device: CUDA > TPU > MPS > CPU.
_validate_device()¶
def _validate_device(device: str) -> str
Validate and return device string.