Skip to main content

Local Models

LA Router's local-first architecture lets you run AI models directly on your own hardware — keeping data private, eliminating API costs, and enabling offline operation.

How Local Routing Works

When a request arrives, LA Router's hybrid classifier evaluates the task complexity. If the task falls into a Heartbeat, Simple, or Moderate tier, it is routed to a local model running on your machine via llama.cpp. No data ever leaves your network.

Your App → LA Router → Classifier

┌───────────┼───────────┐
▼ ▼ ▼
Heartbeat Simple Moderate
(2B local) (4B local) (26B local)
│ │ │
└───────────┼───────────┘

llama.cpp
(your hardware)

Small Models — Edge & Quick Tasks

For lightweight tasks like greetings, status checks, formatting, and simple Q&A, LA Router uses small local models (2B–4B parameters). These models:

  • Start up instantly and respond in milliseconds
  • Run on any laptop, even without a GPU
  • Handle routine tasks with zero API cost
  • Are ideal for edge devices, phone-class hardware, and CI pipelines
ModelParamsBest ForHardware
SureCentric LLM2B2BGreetings, status, ASR, speechAny laptop / phone
SureCentric LLM4B4BSummaries, drafts, simple codeLaptop with 8GB RAM

Medium Models — Private Data Processing

For tasks that involve sensitive or proprietary data — document analysis, internal search, compliance review — LA Router routes to medium local models (26B–31B parameters). These models:

  • Process private data entirely on-premises
  • Never transmit content to external APIs
  • Offer quality competitive with cloud models for most business tasks
  • Support multimodal inputs (text + images)
ModelParamsBest ForHardware
SureCentric LLM26B26B (MoE)Document analysis, code review, private data16GB+ RAM, Apple Silicon recommended
SureCentric LLM31B31BComplex reasoning on private data32GB+ RAM, GPU recommended
Private Data Guarantee

When LA Router classifies a task as Heartbeat, Simple, or Moderate, no data is sent to any external service. The entire inference happens locally on your hardware.

Downloading Models from HuggingFace

All local models are downloaded from HuggingFace — the largest open-source model hub. LA Router supports two formats:

GGUF (Universal)

GGUF models run on any hardware via llama.cpp — CPU, GPU, or mixed. They use quantization to reduce memory requirements while preserving quality.

# Download via the dashboard (recommended)
# Navigate to http://localhost:5174/models → click "Download GGUF"

# Or via API
curl -X POST http://127.0.0.1:18790/api/models/download \
-H "Content-Type: application/json" \
-d '{"modelId": "sc-llm4b", "format": "gguf"}'

MLX (Apple Silicon)

MLX models are optimized for Apple Silicon Macs (M1/M2/M3/M4), leveraging the unified memory architecture for maximum performance.

# Download via dashboard → click "Download MLX"
# Or via API with format: "mlx"

Using Privately Trained & Licensed Models

LA Router also supports custom fine-tuned models and commercially licensed LLMs downloaded from HuggingFace:

  1. Fine-tuned models — If your organization has fine-tuned a model for a specific domain (legal, medical, financial), you can host it on a private HuggingFace repository and download it through LA Router.

  2. Licensed models — Commercial models distributed through HuggingFace (with gated access) can be downloaded using your HuggingFace token:

# Set your HuggingFace token for gated model access
export HF_TOKEN=hf_your_token_here

# The model will be downloaded and served locally via llama.cpp
  1. Air-gapped deployment — For maximum security, models can be downloaded once and transferred to air-gapped environments. LA Router manages the local model lifecycle without requiring internet access after initial download.

Model Lifecycle

LA Router manages the full lifecycle of local models:

StageDescription
DownloadPull GGUF or MLX weights from HuggingFace
StartLaunch llama-server with the model loaded
ServeRoute matching requests to the running model
StopGracefully shut down when no longer needed
UpdateCheck for newer quantization versions

All of this is managed through the Models page in the dashboard or via the REST API.

Models Dashboard