Understanding OllamaFlow's core concepts is essential for effective deployment and management. This guide covers the three fundamental components: Frontends, Backends, and Models.
Frontends
A Frontend is a virtual Ollama endpoint that clients connect to. Frontends define how requests are routed and which backends serve those requests.
Frontend Properties
Property | Description | Default |
---|---|---|
Identifier | Unique identifier for the frontend | Required |
Name | Human-readable name | Required |
Hostname | Hostname pattern (* for catch-all) | * |
TimeoutMs | Request timeout in milliseconds | 60000 |
LoadBalancing | Load balancing algorithm | RoundRobin |
Backends | List of backend identifiers to use | [] |
RequiredModels | Models that must be available | [] |
AllowEmbeddings | Allow embeddings API requests | true |
AllowCompletions | Allow completions API requests | true |
PinnedEmbeddingsProperties | Enforce specific embeddings parameters | {} |
PinnedCompletionsProperties | Enforce specific completion parameters | {} |
MaxRequestBodySize | Maximum request size in bytes | 536870912 (512MB) |
UseStickySessions | Enable session stickiness | false |
StickySessionExpirationMs | Session timeout in milliseconds | 1800000 (30 min) |
Load Balancing Algorithms
Round Robin (RoundRobin
)
- Cycles through backends sequentially
- Ensures even distribution of requests
- Best for uniform backend capacity
Random (Random
)
- Randomly selects from healthy backends
- Good for stateless workloads
- Provides natural load distribution
Session Stickiness
Session Stickiness ensures that clients are consistently routed to the same backend for subsequent requests, which is useful for:
- Stateful Applications: When backends maintain client-specific state
- Model Warm-up: Keeping frequently accessed models loaded on specific backends
- Performance Optimization: Reducing model switching overhead
How It Works:
- Client Identification: Uses client IP address as identifier
- Backend Binding: First request creates a session binding client to a specific backend
- Session Persistence: Subsequent requests from the same client route to the bound backend
- Automatic Expiration: Sessions expire after the configured timeout period
- Health Awareness: Sessions are invalidated if the bound backend becomes unhealthy
Configuration:
UseStickySessions
: Enable/disable session stickiness (default:false
)StickySessionExpirationMs
: Session timeout in milliseconds (default: 30 minutes)- Minimum: 10,000ms (10 seconds)
- Maximum: 86,400,000ms (24 hours)
Session Management:
- Sessions are automatically cleaned up every 5 minutes
- Expired sessions are removed from memory
- Backend failures invalidate all associated sessions
- Sessions are not persisted across OllamaFlow restarts
Security Controls
Frontend security controls enable fine-grained access control and request parameter enforcement:
Request Type Controls
AllowEmbeddings
: Controls whether embeddings API endpoints are accessible through this frontend- Ollama API:
/api/embed
- OpenAI API:
/v1/embeddings
- Ollama API:
AllowCompletions
: Controls whether completion API endpoints are accessible through this frontend- Ollama API:
/api/generate
,/api/chat
- OpenAI API:
/v1/completions
,/v1/chat/completions
- Ollama API:
For a request to succeed, both the frontend and at least one assigned backend must allow the request type.
Pinned Properties
Pinned properties allow administrators to enforce specific parameters in requests:
PinnedEmbeddingsProperties
: Key-value pairs automatically merged into all embeddings requestsPinnedCompletionsProperties
: Key-value pairs automatically merged into all completion requests
Common use cases:
- Enforce maximum context size:
{"options": {"num_ctx": 2048}}
- Standardize temperature settings:
{"options": {"temperature": 0.7}}
- Override model selection:
{"model": "approved-model:latest"}
Properties are merged with client requests, with pinned properties taking precedence over client-specified values.
Frontend Configuration Example
{
"Identifier": "production-frontend",
"Name": "Production AI Inference",
"Hostname": "ai.company.com",
"LoadBalancing": "RoundRobin",
"TimeoutMs": 90000,
"Backends": ["gpu-1", "gpu-2", "gpu-3"],
"RequiredModels": ["llama3:8b", "mistral:7b", "codellama"],
"AllowEmbeddings": true,
"AllowCompletions": true,
"PinnedEmbeddingsProperties": {
"model": "nomic-embed-text",
"options": {
"temperature": 0.1
}
},
"PinnedCompletionsProperties": {
"options": {
"temperature": 0.7,
"num_ctx": 2048
}
},
"MaxRequestBodySize": 1073741824,
"UseStickySessions": true,
"StickySessionExpirationMs": 3600000
}
Backends
A Backend represents a physical Ollama instance in your infrastructure. Backends handle the actual AI inference requests.
Backend Properties
Property | Description | Default |
---|---|---|
Identifier | Unique identifier for the backend | Required |
Name | Human-readable name | Required |
Hostname | Ollama server hostname/IP | Required |
Port | Ollama server port | 11434 |
Ssl | Enable HTTPS for backend communication | false |
HealthCheckUrl | URL path for health checks | / |
HealthCheckMethod | HTTP method for health checks, either GET or HEAD | GET |
UnhealthyThreshold | Failed checks before marking unhealthy | 2 |
HealthyThreshold | Successful checks before marking healthy | 2 |
MaxParallelRequests | Maximum concurrent requests | 4 |
RateLimitRequestsThreshold | Rate limiting threshold | 10 |
AllowEmbeddings | Allow embeddings API requests | true |
AllowCompletions | Allow completions API requests | true |
Labels | Specify labels to influence backend selection and routing | [] |
PinnedEmbeddingsProperties | Enforce specific embeddings parameters | {} |
PinnedCompletionsProperties | Enforce specific completion parameters | {} |
Health Monitoring
OllamaFlow continuously monitors backend health:
- Health Checks: Periodic HTTP requests to validate backend availability
- Automatic Failover: Unhealthy backends are removed from load balancing rotation
- Recovery Detection: Backends are automatically restored when they become healthy
Backend States
- Healthy: Backend is responding to health checks and available for requests
- Unhealthy: Backend has failed health checks and is excluded from rotation
- Unknown: Initial state before first health check completion
Backend Configuration Example
{
"Identifier": "gpu-server-1",
"Name": "Primary GPU Server",
"Hostname": "192.168.1.100",
"Port": 11434,
"Ssl": false,
"HealthCheckUrl": "/",
"HealthCheckMethod": "GET",
"UnhealthyThreshold": 3,
"HealthyThreshold": 2,
"MaxParallelRequests": 8,
"RateLimitRequestsThreshold": 20,
"AllowEmbeddings": true,
"AllowCompletions": true,
"Labels": [
"europe",
"gdpr"
],
"PinnedEmbeddingsProperties": {
"options": {
"num_ctx": 512
}
},
"PinnedCompletionsProperties": {
"options": {
"num_ctx": 4096,
"temperature": 0.8
}
}
}
Models
OllamaFlow provides intelligent model management across your backend fleet.
Model Discovery
- Automatic Detection: OllamaFlow periodically discovers available models on each backend
- Real-time Updates: Model availability is continuously tracked
- Cross-Backend Visibility: View which models are available on which backends
Model Synchronization
When a frontend specifies RequiredModels
, OllamaFlow automatically:
- Checks Availability: Verifies if required models exist on associated backends
- Downloads Missing Models: Pulls models to backends that don't have them
- Parallel Operations: Downloads models concurrently for faster provisioning
- Status Tracking: Monitors sync progress and completion
Model Management Flow
graph TD A[Frontend Configured] --> B[Check Required Models] B --> C{Models Available?} C -->|Yes| D[Route Requests] C -->|No| E[Start Model Sync] E --> F[Pull Missing Models] F --> G[Update Model Inventory] G --> D[Route Requests]
Model Requirements Example
{
"RequiredModels": [
"llama3:8b",
"mistral:7b",
"codellama:13b",
"nomic-embed-text"
]
}
Request Flow
Understanding how requests flow through OllamaFlow:
- Client Request: Client sends request to OllamaFlow frontend
- Frontend Matching: OllamaFlow matches request hostname to frontend
- Backend Selection: Load balancing algorithm selects healthy backend
- Model Verification: Ensures required model is available on selected backend
- Request Proxy: Request is forwarded to selected backend
- Response Streaming: Response is streamed back to client
Configuration Persistence
All frontend and backend configurations are stored in a SQLite database (ollamaflow.db
), ensuring:
- Persistence: Configurations survive restarts
- Atomic Updates: Configuration changes are transactional
- Historical Tracking: Creation and update timestamps are maintained
- Backup-Friendly: Single file database for easy backup/restore
Next Steps
- Learn about Deployment Options for your environment
- Review Configuration Examples for common scenarios
- Explore the API Reference for programmatic management