Core Concepts

Frontends

A Frontend is a virtual Ollama endpoint that clients connect to. Frontends define how requests are routed and which backends serve those requests.

Frontend Properties

Property	Description	Default
`Identifier`	Unique identifier for the frontend	Required
`Name`	Human-readable name	Required
`Hostname`	Hostname pattern (`*` for catch-all)	`*`
`TimeoutMs`	Request timeout in milliseconds	`60000`
`LoadBalancing`	Load balancing algorithm	`RoundRobin`
`Backends`	List of backend identifiers to use	`[]`
`RequiredModels`	Models that must be available	`[]`
`AllowEmbeddings`	Allow embeddings API requests	`true`
`AllowCompletions`	Allow completions API requests	`true`
`PinnedEmbeddingsProperties`	Enforce specific embeddings parameters	`{}`
`PinnedCompletionsProperties`	Enforce specific completion parameters	`{}`
`MaxRequestBodySize`	Maximum request size in bytes	`536870912` (512MB)
`UseStickySessions`	Enable session stickiness	`false`
`StickySessionExpirationMs`	Session timeout in milliseconds	`1800000` (30 min)

Load Balancing Algorithms

Round Robin (RoundRobin)

Cycles through backends sequentially
Ensures even distribution of requests
Best for uniform backend capacity

Random (Random)

Randomly selects from healthy backends
Good for stateless workloads
Provides natural load distribution

Session Stickiness

Session Stickiness ensures that clients are consistently routed to the same backend for subsequent requests, which is useful for:

Stateful Applications: When backends maintain client-specific state
Model Warm-up: Keeping frequently accessed models loaded on specific backends
Performance Optimization: Reducing model switching overhead

How It Works:

Client Identification: Uses client IP address as identifier
Backend Binding: First request creates a session binding client to a specific backend
Session Persistence: Subsequent requests from the same client route to the bound backend
Automatic Expiration: Sessions expire after the configured timeout period
Health Awareness: Sessions are invalidated if the bound backend becomes unhealthy

Configuration:

UseStickySessions: Enable/disable session stickiness (default: false)
StickySessionExpirationMs: Session timeout in milliseconds (default: 30 minutes)
Minimum: 10,000ms (10 seconds)
Maximum: 86,400,000ms (24 hours)

Session Management:

Sessions are automatically cleaned up every 5 minutes
Expired sessions are removed from memory
Backend failures invalidate all associated sessions
Sessions are not persisted across OllamaFlow restarts

Security Controls

Frontend security controls enable fine-grained access control and request parameter enforcement:

Request Type Controls

AllowEmbeddings: Controls whether embeddings API endpoints are accessible through this frontend
- Ollama API: /api/embed
- OpenAI API: /v1/embeddings
AllowCompletions: Controls whether completion API endpoints are accessible through this frontend
- Ollama API: /api/generate, /api/chat
- OpenAI API: /v1/completions, /v1/chat/completions

For a request to succeed, both the frontend and at least one assigned backend must allow the request type.

Pinned Properties

Pinned properties allow administrators to enforce specific parameters in requests:

PinnedEmbeddingsProperties: Key-value pairs automatically merged into all embeddings requests
PinnedCompletionsProperties: Key-value pairs automatically merged into all completion requests

Common use cases:

Enforce maximum context size: {"options": {"num_ctx": 2048}}
Standardize temperature settings: {"options": {"temperature": 0.7}}
Override model selection: {"model": "approved-model:latest"}

Properties are merged with client requests, with pinned properties taking precedence over client-specified values.

Frontend Configuration Example

{
  "Identifier": "production-frontend",
  "Name": "Production AI Inference",
  "Hostname": "ai.company.com",
  "LoadBalancing": "RoundRobin",
  "TimeoutMs": 90000,
  "Backends": ["gpu-1", "gpu-2", "gpu-3"],
  "RequiredModels": ["llama3:8b", "mistral:7b", "codellama"],
  "AllowEmbeddings": true,
  "AllowCompletions": true,
  "PinnedEmbeddingsProperties": {
    "model": "nomic-embed-text",
    "options": {
      "temperature": 0.1
    }
  },
  "PinnedCompletionsProperties": {
    "options": {
      "temperature": 0.7,
      "num_ctx": 2048
    }
  },
  "MaxRequestBodySize": 1073741824,
  "UseStickySessions": true,
  "StickySessionExpirationMs": 3600000
}

Backends

A Backend represents a physical Ollama instance in your infrastructure. Backends handle the actual AI inference requests.

Backend Properties

Property	Description	Default
`Identifier`	Unique identifier for the backend	Required
`Name`	Human-readable name	Required
`Hostname`	Ollama server hostname/IP	Required
`Port`	Ollama server port	`11434`
`Ssl`	Enable HTTPS for backend communication	`false`
`HealthCheckUrl`	URL path for health checks	`/`
`HealthCheckMethod`	HTTP method for health checks, either `GET` or `HEAD`	`GET`
`UnhealthyThreshold`	Failed checks before marking unhealthy	`2`
`HealthyThreshold`	Successful checks before marking healthy	`2`
`MaxParallelRequests`	Maximum concurrent requests	`4`
`RateLimitRequestsThreshold`	Rate limiting threshold	`10`
`AllowEmbeddings`	Allow embeddings API requests	`true`
`AllowCompletions`	Allow completions API requests	`true`
`Labels`	Specify labels to influence backend selection and routing	`[]`
`PinnedEmbeddingsProperties`	Enforce specific embeddings parameters	`{}`
`PinnedCompletionsProperties`	Enforce specific completion parameters	`{}`

Health Monitoring

OllamaFlow continuously monitors backend health:

Health Checks: Periodic HTTP requests to validate backend availability
Automatic Failover: Unhealthy backends are removed from load balancing rotation
Recovery Detection: Backends are automatically restored when they become healthy

Backend States

Healthy: Backend is responding to health checks and available for requests
Unhealthy: Backend has failed health checks and is excluded from rotation
Unknown: Initial state before first health check completion

Backend Configuration Example

{
  "Identifier": "gpu-server-1",
  "Name": "Primary GPU Server",
  "Hostname": "192.168.1.100",
  "Port": 11434,
  "Ssl": false,
  "HealthCheckUrl": "/",
  "HealthCheckMethod": "GET",
  "UnhealthyThreshold": 3,
  "HealthyThreshold": 2,
  "MaxParallelRequests": 8,
  "RateLimitRequestsThreshold": 20,
  "AllowEmbeddings": true,
  "AllowCompletions": true,
  "Labels": [
    "europe",
    "gdpr"
  ],
  "PinnedEmbeddingsProperties": {
    "options": {
      "num_ctx": 512
    }
  },
  "PinnedCompletionsProperties": {
    "options": {
      "num_ctx": 4096,
      "temperature": 0.8
    }
  }
}

Models

OllamaFlow provides intelligent model management across your backend fleet.

Model Discovery

Automatic Detection: OllamaFlow periodically discovers available models on each backend
Real-time Updates: Model availability is continuously tracked
Cross-Backend Visibility: View which models are available on which backends

Model Synchronization

When a frontend specifies RequiredModels, OllamaFlow automatically:

Checks Availability: Verifies if required models exist on associated backends
Downloads Missing Models: Pulls models to backends that don't have them
Parallel Operations: Downloads models concurrently for faster provisioning
Status Tracking: Monitors sync progress and completion

Model Management Flow

graph TD
    A[Frontend Configured] --> B[Check Required Models]
    B --> C{Models Available?}
    C -->|Yes| D[Route Requests]
    C -->|No| E[Start Model Sync]
    E --> F[Pull Missing Models]
    F --> G[Update Model Inventory]
    G --> D[Route Requests]

Model Requirements Example

{
  "RequiredModels": [
    "llama3:8b",
    "mistral:7b",
    "codellama:13b",
    "nomic-embed-text"
  ]
}

Request Flow

Understanding how requests flow through OllamaFlow:

Client Request: Client sends request to OllamaFlow frontend
Frontend Matching: OllamaFlow matches request hostname to frontend
Backend Selection: Load balancing algorithm selects healthy backend
Model Verification: Ensures required model is available on selected backend
Request Proxy: Request is forwarded to selected backend
Response Streaming: Response is streamed back to client

Configuration Persistence

All frontend and backend configurations are stored in a SQLite database (ollamaflow.db), ensuring:

Persistence: Configurations survive restarts
Atomic Updates: Configuration changes are transactional
Historical Tracking: Creation and update timestamps are maintained
Backup-Friendly: Single file database for easy backup/restore

Next Steps

Learn about Deployment Options for your environment
Review Configuration Examples for common scenarios
Explore the API Reference for programmatic management