Model Catalog
LARouter ships with a curated registry of Gemma 4 models from Unsloth, optimized for local inference via llama.cpp and Apple MLX.
Supported Models
| Model | Params | Context | Modality | Best For |
|---|---|---|---|---|
| Gemma 4 E2B | 2B Dense + PLE | 128K | Text, Image, Audio | Edge inference, ASR, speech translation |
| Gemma 4 E4B | 4B Dense + PLE | 128K | Text, Image, Audio | Fast local multimodal, laptops |
| Gemma 4 26B-A4B | 26B MoE | 256K | Text, Image | Best speed/quality tradeoff |
| Gemma 4 31B | 31B Dense | 256K | Text, Image | Strongest local performance |
HuggingFace Sources
GGUF (llama.cpp — All Platforms)
| Model | Repository | Quantization | Size |
|---|---|---|---|
| E2B | unsloth/gemma-4-E2B-it-GGUF | Q8_0 | ~2.1 GB |
| E4B | unsloth/gemma-4-E4B-it-GGUF | Q8_0 | ~4.3 GB |
| 26B-A4B | unsloth/gemma-4-26B-A4B-it-GGUF | UD-Q4_K_XL | ~16 GB |
| 31B | unsloth/gemma-4-31B-it-GGUF | UD-Q4_K_XL | ~19 GB |
MLX (macOS Apple Silicon)
| Model | Repository | Size |
|---|---|---|
| E4B | unsloth/gemma-4-E4B-it-MLX-8bit | ~4.1 GB |
| 26B-A4B | unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit | ~15 GB |
| 31B | unsloth/gemma-4-31b-it-MLX-8bit | ~33 GB |
Default Tier Mapping
[!NOTE] Green = local (free), Purple = cloud (paid). Override any mapping via the WebUI or
config/larouter.json.
llama.cpp Server Settings
Each model is started with optimized parameters:
./llama-server \
--model models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--mmproj models/mmproj-BF16.gguf \
--temp 1.0 --top-p 0.95 --top-k 64 \
--alias "gemma-4-26b-a4b" \
--port 8001 \
--chat-template-kwargs '{"enable_thinking":true}'
LARouter auto-manages:
- Starting/stopping llama-server processes per model
- Port assignment (8001, 8002, 8003, 8004)
- Health monitoring (
GET /health) - OpenAI-compatible proxy routing
Hardware Requirements
| Model | Min RAM | Recommended | GPU |
|---|---|---|---|
| E2B (Q8_0) | 4 GB | 8 GB | Optional |
| E4B (Q8_0) | 8 GB | 16 GB | Optional |
| 26B-A4B (UD-Q4_K_XL) | 16 GB | 24 GB | Recommended |
| 31B (UD-Q4_K_XL) | 24 GB | 32 GB | Recommended |
[!TIP] On Apple Silicon Macs, use the MLX format for native Metal acceleration — no llama.cpp compilation needed.