BaseEngine API¶

BaseEngine provides common functionality for model loading, device handling, quantization, and forward pass.

Class Definition¶

class BaseEngine:
    def __init__(
        self,
        model_name: str,
        device: Optional[str] = None,
        tensor_parallel: bool = True,
        bits: int = 16,
        offload: bool = False,
    )

Parameters¶

Parameter	Type	Default	Description
`model_name`	str	required	HuggingFace model name or local path
`device`	str	"auto"	Device: "auto", "cuda", "cpu", "mps", "tpu"
`tensor_parallel`	bool	True	Auto-shard across devices
`bits`	int	16	Quantization bits: 16 (full), 8 (INT8), 4 (INT4)
`offload`	bool	False	Enable layer offloading

Methods¶

from_pretrained()¶

@classmethod
def from_pretrained(
    cls,
    model_name: str,
    bits: int = 4,
    device: Optional[str] = None,
    offload: bool = True,
    use_accelerate: bool = False,
) -> "BaseEngine"

Load any model with automatic quantization and offloading.

_forward()¶

def _forward(
    self,
    input_ids: torch.Tensor,
    cache: dict = None
) -> torch.Tensor

Core model forward pass. Returns last-token logits.

_get_eos_ids()¶

def _get_eos_ids(self) -> set

Get end-of-sequence token IDs.

_make_cache()¶

def _make_cache(self) -> dict

Create a fresh KV cache for generation.

Module Functions¶

available_devices()¶

def available_devices() -> dict

Return a dict of available devices with info:

{
    "cuda": {"name": "CUDA (RTX 3090)", "available": True, "memory": "24.0 GB"},
    "cpu": {"name": "CPU", "available": True},
    "mps": {"name": "MPS (Apple Silicon)", "available": True},
}

_detect_device()¶

def _detect_device() -> str

Auto-detect best device: CUDA > TPU > MPS > CPU.

_validate_device()¶

def _validate_device(device: str) -> str

Validate and return device string.