# Core Concepts Understanding OllamaFlow's core concepts is essential for effective deployment and management. This guide covers the three fundamental components: Frontends, Backends, and Models. ## Frontends A **Frontend** is a virtual Ollama endpoint that clients connect to. Frontends define how requests are routed and which backends serve those requests. ### Frontend Properties | Property | Description | Default | | ----------------------------- | -------------------------------------- | ------------------- | | `Identifier` | Unique identifier for the frontend | Required | | `Name` | Human-readable name | Required | | `Hostname` | Hostname pattern (`*` for catch-all) | `*` | | `TimeoutMs` | Request timeout in milliseconds | `60000` | | `LoadBalancing` | Load balancing algorithm | `RoundRobin` | | `Backends` | List of backend identifiers to use | `[]` | | `RequiredModels` | Models that must be available | `[]` | | `AllowEmbeddings` | Allow embeddings API requests | `true` | | `AllowCompletions` | Allow completions API requests | `true` | | `PinnedEmbeddingsProperties` | Enforce specific embeddings parameters | `{}` | | `PinnedCompletionsProperties` | Enforce specific completion parameters | `{}` | | `MaxRequestBodySize` | Maximum request size in bytes | `536870912` (512MB) | | `UseStickySessions` | Enable session stickiness | `false` | | `StickySessionExpirationMs` | Session timeout in milliseconds | `1800000` (30 min) | ### Load Balancing Algorithms **Round Robin** (`RoundRobin`) * Cycles through backends sequentially * Ensures even distribution of requests * Best for uniform backend capacity **Random** (`Random`) * Randomly selects from healthy backends * Good for stateless workloads * Provides natural load distribution ### Session Stickiness **Session Stickiness** ensures that clients are consistently routed to the same backend for subsequent requests, which is useful for: * **Stateful Applications**: When backends maintain client-specific state * **Model Warm-up**: Keeping frequently accessed models loaded on specific backends * **Performance Optimization**: Reducing model switching overhead **How It Works:** 1. **Client Identification**: Uses client IP address as identifier 2. **Backend Binding**: First request creates a session binding client to a specific backend 3. **Session Persistence**: Subsequent requests from the same client route to the bound backend 4. **Automatic Expiration**: Sessions expire after the configured timeout period 5. **Health Awareness**: Sessions are invalidated if the bound backend becomes unhealthy **Configuration:** * `UseStickySessions`: Enable/disable session stickiness (default: `false`) * `StickySessionExpirationMs`: Session timeout in milliseconds (default: 30 minutes) * **Minimum**: 10,000ms (10 seconds) * **Maximum**: 86,400,000ms (24 hours) **Session Management:** * Sessions are automatically cleaned up every 5 minutes * Expired sessions are removed from memory * Backend failures invalidate all associated sessions * Sessions are not persisted across OllamaFlow restarts ### Security Controls Frontend security controls enable fine-grained access control and request parameter enforcement: #### Request Type Controls * **`AllowEmbeddings`**: Controls whether embeddings API endpoints are accessible through this frontend * Ollama API: `/api/embed` * OpenAI API: `/v1/embeddings` * **`AllowCompletions`**: Controls whether completion API endpoints are accessible through this frontend * Ollama API: `/api/generate`, `/api/chat` * OpenAI API: `/v1/completions`, `/v1/chat/completions` For a request to succeed, both the frontend and at least one assigned backend must allow the request type. #### Pinned Properties Pinned properties allow administrators to enforce specific parameters in requests: * **`PinnedEmbeddingsProperties`**: Key-value pairs automatically merged into all embeddings requests * **`PinnedCompletionsProperties`**: Key-value pairs automatically merged into all completion requests Common use cases: * Enforce maximum context size: `{"options": {"num_ctx": 2048}}` * Standardize temperature settings: `{"options": {"temperature": 0.7}}` * Override model selection: `{"model": "approved-model:latest"}` Properties are merged with client requests, with pinned properties taking precedence over client-specified values. ### Frontend Configuration Example ```json { "Identifier": "production-frontend", "Name": "Production AI Inference", "Hostname": "ai.company.com", "LoadBalancing": "RoundRobin", "TimeoutMs": 90000, "Backends": ["gpu-1", "gpu-2", "gpu-3"], "RequiredModels": ["llama3:8b", "mistral:7b", "codellama"], "AllowEmbeddings": true, "AllowCompletions": true, "PinnedEmbeddingsProperties": { "model": "nomic-embed-text", "options": { "temperature": 0.1 } }, "PinnedCompletionsProperties": { "options": { "temperature": 0.7, "num_ctx": 2048 } }, "MaxRequestBodySize": 1073741824, "UseStickySessions": true, "StickySessionExpirationMs": 3600000 } ``` ## Backends A **Backend** represents a physical Ollama instance in your infrastructure. Backends handle the actual AI inference requests. ### Backend Properties | Property | Description | Default | | ----------------------------- | --------------------------------------------------------- | -------- | | `Identifier` | Unique identifier for the backend | Required | | `Name` | Human-readable name | Required | | `Hostname` | Ollama server hostname/IP | Required | | `Port` | Ollama server port | `11434` | | `Ssl` | Enable HTTPS for backend communication | `false` | | `HealthCheckUrl` | URL path for health checks | `/` | | `HealthCheckMethod` | HTTP method for health checks, either `GET` or `HEAD` | `GET` | | `UnhealthyThreshold` | Failed checks before marking unhealthy | `2` | | `HealthyThreshold` | Successful checks before marking healthy | `2` | | `MaxParallelRequests` | Maximum concurrent requests | `4` | | `RateLimitRequestsThreshold` | Rate limiting threshold | `10` | | `AllowEmbeddings` | Allow embeddings API requests | `true` | | `AllowCompletions` | Allow completions API requests | `true` | | `Labels` | Specify labels to influence backend selection and routing | `[]` | | `PinnedEmbeddingsProperties` | Enforce specific embeddings parameters | `{}` | | `PinnedCompletionsProperties` | Enforce specific completion parameters | `{}` | ### Health Monitoring OllamaFlow continuously monitors backend health: * **Health Checks**: Periodic HTTP requests to validate backend availability * **Automatic Failover**: Unhealthy backends are removed from load balancing rotation * **Recovery Detection**: Backends are automatically restored when they become healthy ### Backend States * **Healthy**: Backend is responding to health checks and available for requests * **Unhealthy**: Backend has failed health checks and is excluded from rotation * **Unknown**: Initial state before first health check completion ### Backend Configuration Example ```json { "Identifier": "gpu-server-1", "Name": "Primary GPU Server", "Hostname": "192.168.1.100", "Port": 11434, "Ssl": false, "HealthCheckUrl": "/", "HealthCheckMethod": "GET", "UnhealthyThreshold": 3, "HealthyThreshold": 2, "MaxParallelRequests": 8, "RateLimitRequestsThreshold": 20, "AllowEmbeddings": true, "AllowCompletions": true, "Labels": [ "europe", "gdpr" ], "PinnedEmbeddingsProperties": { "options": { "num_ctx": 512 } }, "PinnedCompletionsProperties": { "options": { "num_ctx": 4096, "temperature": 0.8 } } } ``` ## Models OllamaFlow provides intelligent model management across your backend fleet. ### Model Discovery * **Automatic Detection**: OllamaFlow periodically discovers available models on each backend * **Real-time Updates**: Model availability is continuously tracked * **Cross-Backend Visibility**: View which models are available on which backends ### Model Synchronization When a frontend specifies `RequiredModels`, OllamaFlow automatically: 1. **Checks Availability**: Verifies if required models exist on associated backends 2. **Downloads Missing Models**: Pulls models to backends that don't have them 3. **Parallel Operations**: Downloads models concurrently for faster provisioning 4. **Status Tracking**: Monitors sync progress and completion ### Model Management Flow ```mermaid graph TD A[Frontend Configured] --> B[Check Required Models] B --> C{Models Available?} C -->|Yes| D[Route Requests] C -->|No| E[Start Model Sync] E --> F[Pull Missing Models] F --> G[Update Model Inventory] G --> D[Route Requests] ``` ### Model Requirements Example ```json { "RequiredModels": [ "llama3:8b", "mistral:7b", "codellama:13b", "nomic-embed-text" ] } ``` ## Request Flow Understanding how requests flow through OllamaFlow: 1. **Client Request**: Client sends request to OllamaFlow frontend 2. **Frontend Matching**: OllamaFlow matches request hostname to frontend 3. **Backend Selection**: Load balancing algorithm selects healthy backend 4. **Model Verification**: Ensures required model is available on selected backend 5. **Request Proxy**: Request is forwarded to selected backend 6. **Response Streaming**: Response is streamed back to client ## Configuration Persistence All frontend and backend configurations are stored in a SQLite database (`ollamaflow.db`), ensuring: * **Persistence**: Configurations survive restarts * **Atomic Updates**: Configuration changes are transactional * **Historical Tracking**: Creation and update timestamps are maintained * **Backup-Friendly**: Single file database for easy backup/restore ## Next Steps * Learn about [Deployment Options](deployment-options.md) for your environment * Review [Configuration Examples](configuration-examples.md) for common scenarios * Explore the [API Reference](api-reference.md) for programmatic management