API Reference

Core Concepts

Understanding OllamaFlow's core concepts is essential for effective deployment and management. This guide covers the three fundamental components: Frontends, Backends, and Models.

Frontends

A Frontend is a virtual Ollama endpoint that clients connect to. Frontends define how requests are routed and which backends serve those requests.

Frontend Properties

PropertyDescriptionDefault
IdentifierUnique identifier for the frontendRequired
NameHuman-readable nameRequired
HostnameHostname pattern (* for catch-all)*
TimeoutMsRequest timeout in milliseconds60000
LoadBalancingLoad balancing algorithmRoundRobin
BackendsList of backend identifiers to use[]
RequiredModelsModels that must be available[]
AllowEmbeddingsAllow embeddings API requeststrue
AllowCompletionsAllow completions API requeststrue
PinnedEmbeddingsPropertiesEnforce specific embeddings parameters{}
PinnedCompletionsPropertiesEnforce specific completion parameters{}
MaxRequestBodySizeMaximum request size in bytes536870912 (512MB)
UseStickySessionsEnable session stickinessfalse
StickySessionExpirationMsSession timeout in milliseconds1800000 (30 min)

Load Balancing Algorithms

Round Robin (RoundRobin)

  • Cycles through backends sequentially
  • Ensures even distribution of requests
  • Best for uniform backend capacity

Random (Random)

  • Randomly selects from healthy backends
  • Good for stateless workloads
  • Provides natural load distribution

Session Stickiness

Session Stickiness ensures that clients are consistently routed to the same backend for subsequent requests, which is useful for:

  • Stateful Applications: When backends maintain client-specific state
  • Model Warm-up: Keeping frequently accessed models loaded on specific backends
  • Performance Optimization: Reducing model switching overhead

How It Works:

  1. Client Identification: Uses client IP address as identifier
  2. Backend Binding: First request creates a session binding client to a specific backend
  3. Session Persistence: Subsequent requests from the same client route to the bound backend
  4. Automatic Expiration: Sessions expire after the configured timeout period
  5. Health Awareness: Sessions are invalidated if the bound backend becomes unhealthy

Configuration:

  • UseStickySessions: Enable/disable session stickiness (default: false)
  • StickySessionExpirationMs: Session timeout in milliseconds (default: 30 minutes)
  • Minimum: 10,000ms (10 seconds)
  • Maximum: 86,400,000ms (24 hours)

Session Management:

  • Sessions are automatically cleaned up every 5 minutes
  • Expired sessions are removed from memory
  • Backend failures invalidate all associated sessions
  • Sessions are not persisted across OllamaFlow restarts

Security Controls

Frontend security controls enable fine-grained access control and request parameter enforcement:

Request Type Controls

  • AllowEmbeddings: Controls whether embeddings API endpoints are accessible through this frontend
    • Ollama API: /api/embed
    • OpenAI API: /v1/embeddings
  • AllowCompletions: Controls whether completion API endpoints are accessible through this frontend
    • Ollama API: /api/generate, /api/chat
    • OpenAI API: /v1/completions, /v1/chat/completions

For a request to succeed, both the frontend and at least one assigned backend must allow the request type.

Pinned Properties

Pinned properties allow administrators to enforce specific parameters in requests:

  • PinnedEmbeddingsProperties: Key-value pairs automatically merged into all embeddings requests
  • PinnedCompletionsProperties: Key-value pairs automatically merged into all completion requests

Common use cases:

  • Enforce maximum context size: {"options": {"num_ctx": 2048}}
  • Standardize temperature settings: {"options": {"temperature": 0.7}}
  • Override model selection: {"model": "approved-model:latest"}

Properties are merged with client requests, with pinned properties taking precedence over client-specified values.

Frontend Configuration Example

{
  "Identifier": "production-frontend",
  "Name": "Production AI Inference",
  "Hostname": "ai.company.com",
  "LoadBalancing": "RoundRobin",
  "TimeoutMs": 90000,
  "Backends": ["gpu-1", "gpu-2", "gpu-3"],
  "RequiredModels": ["llama3:8b", "mistral:7b", "codellama"],
  "AllowEmbeddings": true,
  "AllowCompletions": true,
  "PinnedEmbeddingsProperties": {
    "model": "nomic-embed-text",
    "options": {
      "temperature": 0.1
    }
  },
  "PinnedCompletionsProperties": {
    "options": {
      "temperature": 0.7,
      "num_ctx": 2048
    }
  },
  "MaxRequestBodySize": 1073741824,
  "UseStickySessions": true,
  "StickySessionExpirationMs": 3600000
}

Backends

A Backend represents a physical Ollama instance in your infrastructure. Backends handle the actual AI inference requests.

Backend Properties

PropertyDescriptionDefault
IdentifierUnique identifier for the backendRequired
NameHuman-readable nameRequired
HostnameOllama server hostname/IPRequired
PortOllama server port11434
SslEnable HTTPS for backend communicationfalse
HealthCheckUrlURL path for health checks/
HealthCheckMethodHTTP method for health checks, either GET or HEADGET
UnhealthyThresholdFailed checks before marking unhealthy2
HealthyThresholdSuccessful checks before marking healthy2
MaxParallelRequestsMaximum concurrent requests4
RateLimitRequestsThresholdRate limiting threshold10
AllowEmbeddingsAllow embeddings API requeststrue
AllowCompletionsAllow completions API requeststrue
LabelsSpecify labels to influence backend selection and routing[]
PinnedEmbeddingsPropertiesEnforce specific embeddings parameters{}
PinnedCompletionsPropertiesEnforce specific completion parameters{}

Health Monitoring

OllamaFlow continuously monitors backend health:

  • Health Checks: Periodic HTTP requests to validate backend availability
  • Automatic Failover: Unhealthy backends are removed from load balancing rotation
  • Recovery Detection: Backends are automatically restored when they become healthy

Backend States

  • Healthy: Backend is responding to health checks and available for requests
  • Unhealthy: Backend has failed health checks and is excluded from rotation
  • Unknown: Initial state before first health check completion

Backend Configuration Example

{
  "Identifier": "gpu-server-1",
  "Name": "Primary GPU Server",
  "Hostname": "192.168.1.100",
  "Port": 11434,
  "Ssl": false,
  "HealthCheckUrl": "/",
  "HealthCheckMethod": "GET",
  "UnhealthyThreshold": 3,
  "HealthyThreshold": 2,
  "MaxParallelRequests": 8,
  "RateLimitRequestsThreshold": 20,
  "AllowEmbeddings": true,
  "AllowCompletions": true,
  "Labels": [
    "europe",
    "gdpr"
  ],
  "PinnedEmbeddingsProperties": {
    "options": {
      "num_ctx": 512
    }
  },
  "PinnedCompletionsProperties": {
    "options": {
      "num_ctx": 4096,
      "temperature": 0.8
    }
  }
}

Models

OllamaFlow provides intelligent model management across your backend fleet.

Model Discovery

  • Automatic Detection: OllamaFlow periodically discovers available models on each backend
  • Real-time Updates: Model availability is continuously tracked
  • Cross-Backend Visibility: View which models are available on which backends

Model Synchronization

When a frontend specifies RequiredModels, OllamaFlow automatically:

  1. Checks Availability: Verifies if required models exist on associated backends
  2. Downloads Missing Models: Pulls models to backends that don't have them
  3. Parallel Operations: Downloads models concurrently for faster provisioning
  4. Status Tracking: Monitors sync progress and completion

Model Management Flow

graph TD
    A[Frontend Configured] --> B[Check Required Models]
    B --> C{Models Available?}
    C -->|Yes| D[Route Requests]
    C -->|No| E[Start Model Sync]
    E --> F[Pull Missing Models]
    F --> G[Update Model Inventory]
    G --> D[Route Requests]

Model Requirements Example

{
  "RequiredModels": [
    "llama3:8b",
    "mistral:7b",
    "codellama:13b",
    "nomic-embed-text"
  ]
}

Request Flow

Understanding how requests flow through OllamaFlow:

  1. Client Request: Client sends request to OllamaFlow frontend
  2. Frontend Matching: OllamaFlow matches request hostname to frontend
  3. Backend Selection: Load balancing algorithm selects healthy backend
  4. Model Verification: Ensures required model is available on selected backend
  5. Request Proxy: Request is forwarded to selected backend
  6. Response Streaming: Response is streamed back to client

Configuration Persistence

All frontend and backend configurations are stored in a SQLite database (ollamaflow.db), ensuring:

  • Persistence: Configurations survive restarts
  • Atomic Updates: Configuration changes are transactional
  • Historical Tracking: Creation and update timestamps are maintained
  • Backup-Friendly: Single file database for easy backup/restore

Next Steps