Skip to main content

Model Catalog

LARouter ships with a curated registry of Gemma 4 models from Unsloth, optimized for local inference via llama.cpp and Apple MLX.

Supported Models

ModelParamsContextModalityBest For
Gemma 4 E2B2B Dense + PLE128KText, Image, AudioEdge inference, ASR, speech translation
Gemma 4 E4B4B Dense + PLE128KText, Image, AudioFast local multimodal, laptops
Gemma 4 26B-A4B26B MoE256KText, ImageBest speed/quality tradeoff
Gemma 4 31B31B Dense256KText, ImageStrongest local performance

HuggingFace Sources

GGUF (llama.cpp — All Platforms)

ModelRepositoryQuantizationSize
E2Bunsloth/gemma-4-E2B-it-GGUFQ8_0~2.1 GB
E4Bunsloth/gemma-4-E4B-it-GGUFQ8_0~4.3 GB
26B-A4Bunsloth/gemma-4-26B-A4B-it-GGUFUD-Q4_K_XL~16 GB
31Bunsloth/gemma-4-31B-it-GGUFUD-Q4_K_XL~19 GB

MLX (macOS Apple Silicon)

ModelRepositorySize
E4Bunsloth/gemma-4-E4B-it-MLX-8bit~4.1 GB
26B-A4Bunsloth/gemma-4-26b-a4b-it-UD-MLX-4bit~15 GB
31Bunsloth/gemma-4-31b-it-MLX-8bit~33 GB

Default Tier Mapping

[!NOTE] Green = local (free), Purple = cloud (paid). Override any mapping via the WebUI or config/larouter.json.

llama.cpp Server Settings

Each model is started with optimized parameters:

./llama-server \
--model models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--mmproj models/mmproj-BF16.gguf \
--temp 1.0 --top-p 0.95 --top-k 64 \
--alias "gemma-4-26b-a4b" \
--port 8001 \
--chat-template-kwargs '{"enable_thinking":true}'

LARouter auto-manages:

  • Starting/stopping llama-server processes per model
  • Port assignment (8001, 8002, 8003, 8004)
  • Health monitoring (GET /health)
  • OpenAI-compatible proxy routing

Hardware Requirements

ModelMin RAMRecommendedGPU
E2B (Q8_0)4 GB8 GBOptional
E4B (Q8_0)8 GB16 GBOptional
26B-A4B (UD-Q4_K_XL)16 GB24 GBRecommended
31B (UD-Q4_K_XL)24 GB32 GBRecommended

[!TIP] On Apple Silicon Macs, use the MLX format for native Metal acceleration — no llama.cpp compilation needed.