OllamaFlow provides three sets of APIs: Ollama-compatible APIs and OpenAI-compatible APIs for AI inference, plus Administrative APIs for cluster management. All APIs support JSON request/response format and maintain full compatibility with existing Ollama and OpenAI clients.
Base URL and Authentication
- Base URL:
http://your-ollamaflow-host:43411
- Admin Authentication: Bearer token required for administrative endpoints
- Ollama APIs: No authentication required (proxied to backends)
- OpenAI APIs: No authentication required (proxied to backends)
Authentication Header
# For administrative APIs
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/backends
API Compatibility
OllamaFlow supports both Ollama and OpenAI-compatible API formats, allowing clients to use either API style without modification.
Ollama-Compatible APIs
These endpoints maintain full compatibility with the Ollama API, allowing existing clients to work without modification.
Generate Completion
Generate text completions using a specified model.
POST /api/generate
Request Body
{
"model": "llama3:8b",
"prompt": "Why is the sky blue?",
"stream": true,
"options": {
"temperature": 0.8,
"num_predict": 100,
"top_k": 20,
"top_p": 0.9
}
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"prompt": "Explain quantum computing in simple terms",
"stream": false,
"options": {
"temperature": 0.7,
"num_predict": 200
}
}' \
http://localhost:43411/api/generate
Response
{
"model": "llama3:8b",
"created_at": "2024-01-15T10:30:00.123456Z",
"response": "Quantum computing is a revolutionary technology...",
"done": true,
"context": [1, 2, 3, 4, 5],
"total_duration": 1234567890,
"load_duration": 123456789,
"prompt_eval_count": 10,
"prompt_eval_duration": 234567890,
"eval_count": 25,
"eval_duration": 876543210
}
Chat Completion
Generate chat-style completions with conversation context.
POST /api/chat
Request Body
{
"model": "llama3:8b",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "What is machine learning?"
}
],
"stream": true,
"options": {
"temperature": 0.8,
"num_ctx": 2048
}
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"stream": false,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant specializing in technology."
},
{
"role": "user",
"content": "Explain the difference between AI and ML"
}
]
}' \
http://localhost:43411/api/chat
Pull Model
Download a model to the backend instances.
POST /api/pull
Request Body
{
"model": "llama3:8b",
"insecure": false,
"stream": true
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "mistral:7b"
}' \
http://localhost:43411/api/pull
Show Model Information
Get detailed information about a specific model.
POST /api/show
Request Body
{
"name": "llama3:8b",
"verbose": true
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "llama3:8b"
}' \
http://localhost:43411/api/show
List Models
Get a list of available models across all backends.
GET /api/tags
cURL Example
curl http://localhost:43411/api/tags
Response
{
"models": [
{
"name": "llama3:8b",
"model": "llama3:8b",
"modified_at": "2024-01-15T10:30:00.123456Z",
"size": 4661224576,
"digest": "sha256:8934d96d3f08...",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": ["llama"],
"parameter_size": "8B",
"quantization_level": "Q4_0"
}
}
]
}
List Running Models
Get information about currently running models.
GET /api/ps
cURL Example
curl http://localhost:43411/api/ps
Generate Embeddings
Generate embeddings for text input.
POST /api/embed
Request Body
{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog"
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": ["Hello world", "How are you?"]
}' \
http://localhost:43411/api/embed
Delete Model
Remove a model from backend instances.
DELETE /api/delete
Request Body
{
"name": "llama3:8b"
}
cURL Example
curl -X DELETE \
-H "Content-Type: application/json" \
-d '{
"name": "old-model:7b"
}' \
http://localhost:43411/api/delete
OpenAI-Compatible APIs
OllamaFlow also supports OpenAI-compatible API endpoints, allowing existing OpenAI clients and tools to work seamlessly.
Generate Completion
Generate text completions using OpenAI-compatible format.
POST /v1/completions
Request Body
{
"model": "llama3:8b",
"prompt": "Why is the sky blue?",
"max_tokens": 100,
"temperature": 0.8,
"top_p": 0.9,
"stream": false
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"prompt": "Explain quantum computing in simple terms",
"max_tokens": 200,
"temperature": 0.7,
"stream": false
}' \
http://localhost:43411/v1/completions
Chat Completion
Generate chat-style completions using OpenAI-compatible format.
POST /v1/chat/completions
Request Body
{
"model": "llama3:8b",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "What is machine learning?"
}
],
"max_tokens": 150,
"temperature": 0.8,
"stream": false
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant specializing in technology."
},
{
"role": "user",
"content": "Explain the difference between AI and ML"
}
],
"max_tokens": 150,
"temperature": 0.7
}' \
http://localhost:43411/v1/chat/completions
Generate Embeddings
Generate embeddings using OpenAI-compatible format.
POST /v1/embeddings
Request Body
{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog"
}
cURL Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": ["Hello world", "How are you?"]
}' \
http://localhost:43411/v1/embeddings
Response
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.1, 0.2, 0.3, ...],
"index": 0
}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
List Models
Get available models using OpenAI-compatible format.
GET /v1/models
cURL Example
curl http://localhost:43411/v1/models
Response
{
"object": "list",
"data": [
{
"id": "llama3:8b",
"object": "model",
"created": 1704067200,
"owned_by": "ollama"
}
]
}
Administrative APIs
These endpoints provide cluster management capabilities and require bearer token authentication.
Frontend Management
List All Frontends
GET /v1.0/frontends
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/frontends
Get Frontend
GET /v1.0/frontends/{identifier}
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/frontends/frontend1
Create Frontend
PUT /v1.0/frontends
curl -X PUT \
-H "Authorization: Bearer your-admin-token" \
-H "Content-Type: application/json" \
-d '{
"Identifier": "production-frontend",
"Name": "Production AI Inference",
"Hostname": "ai.company.com",
"LoadBalancing": "RoundRobin",
"TimeoutMs": 90000,
"Backends": ["gpu-1", "gpu-2", "gpu-3"],
"RequiredModels": ["llama3:8b", "mistral:7b"],
"MaxRequestBodySize": 1073741824,
"UseStickySessions": true,
"StickySessionExpirationMs": 3600000
}' \
http://localhost:43411/v1.0/frontends
Update Frontend
PUT /v1.0/frontends/{identifier}
curl -X PUT \
-H "Authorization: Bearer your-admin-token" \
-H "Content-Type: application/json" \
-d '{
"Identifier": "production-frontend",
"Name": "Updated Production Frontend",
"Hostname": "*",
"LoadBalancing": "Random",
"Backends": ["gpu-1", "gpu-2", "gpu-3", "gpu-4"],
"RequiredModels": ["llama3:8b", "mistral:7b", "codellama:13b"]
}' \
http://localhost:43411/v1.0/frontends/production-frontend
Delete Frontend
DELETE /v1.0/frontends/{identifier}
curl -X DELETE \
-H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/frontends/old-frontend
Backend Management
List All Backends
GET /v1.0/backends
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/backends
Get Backend
GET /v1.0/backends/{identifier}
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/backends/gpu-1
Create Backend
PUT /v1.0/backends
curl -X PUT \
-H "Authorization: Bearer your-admin-token" \
-H "Content-Type: application/json" \
-d '{
"Identifier": "gpu-server-4",
"Name": "GPU Server 4",
"Hostname": "192.168.1.104",
"Port": 11434,
"Ssl": false,
"HealthCheckUrl": "/api/version",
"HealthCheckMethod": "GET",
"UnhealthyThreshold": 3,
"HealthyThreshold": 2,
"MaxParallelRequests": 8,
"RateLimitRequestsThreshold": 20,
"LogRequestBody": false,
"LogResponseBody": false
}' \
http://localhost:43411/v1.0/backends
Update Backend
PUT /v1.0/backends/{identifier}
curl -X PUT \
-H "Authorization: Bearer your-admin-token" \
-H "Content-Type: application/json" \
-d '{
"Identifier": "gpu-server-1",
"Name": "Updated GPU Server 1",
"Hostname": "192.168.1.101",
"Port": 11434,
"MaxParallelRequests": 12,
"UnhealthyThreshold": 2
}' \
http://localhost:43411/v1.0/backends/gpu-server-1
Delete Backend
DELETE /v1.0/backends/{identifier}
curl -X DELETE \
-H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/backends/old-backend
Health Monitoring
Get All Backend Health
GET /v1.0/backends/health
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/backends/health
Response
[
{
"Identifier": "backend1",
"Name": "My localhost Ollama instance",
"Hostname": "localhost",
"Port": 11434,
"Ssl": false,
"UnhealthyThreshold": 2,
"HealthyThreshold": 2,
"HealthCheckMethod": {
"Method": "GET"
},
"HealthCheckUrl": "/",
"MaxParallelRequests": 4,
"RateLimitRequestsThreshold": 10,
"LogRequestFull": false,
"LogRequestBody": false,
"LogResponseBody": false,
"ApiFormat": "Ollama",
"PinnedEmbeddingsProperties": {},
"PinnedCompletionsProperties": {
"model": "qwen2.5:3b",
"options": {
"temperature": 0.1,
"howdy": "doody"
}
},
"AllowEmbeddings": true,
"AllowCompletions": true,
"Active": true,
"CreatedUtc": "2025-09-29T23:15:45.659639Z",
"LastUpdateUtc": "2025-09-29T23:19:07.346900Z",
"HealthySinceUtc": "2025-09-30T01:53:21.026058Z",
"Uptime": "00:25:52.4859452",
"ActiveRequests": 0,
"IsSticky": false
}
]
Get Single Backend Health
GET /v1.0/backends/{identifier}/health
curl -H "Authorization: Bearer your-admin-token" \
http://localhost:43411/v1.0/backends/backend1/health
Response
{
"Identifier": "backend1",
"Name": "My localhost Ollama instance",
"Hostname": "localhost",
"Port": 11434,
"Ssl": false,
"UnhealthyThreshold": 2,
"HealthyThreshold": 2,
"HealthCheckMethod": {
"Method": "GET"
},
"HealthCheckUrl": "/",
"MaxParallelRequests": 4,
"RateLimitRequestsThreshold": 10,
"LogRequestFull": false,
"LogRequestBody": false,
"LogResponseBody": false,
"ApiFormat": "Ollama",
"PinnedEmbeddingsProperties": {},
"PinnedCompletionsProperties": {
"model": "qwen2.5:3b",
"options": {
"temperature": 0.1,
"howdy": "doody"
}
},
"AllowEmbeddings": true,
"AllowCompletions": true,
"Active": true,
"CreatedUtc": "2025-09-29T23:15:45.659639Z",
"LastUpdateUtc": "2025-09-29T23:19:07.346900Z",
"HealthySinceUtc": "2025-09-30T01:53:21.026058Z",
"Uptime": "00:26:32.4690556",
"ActiveRequests": 0,
"IsSticky": false
}
Error Responses
All APIs return standard HTTP status codes and JSON error responses.
Error Response Format
{
"error": "BadRequest",
"message": "Invalid request format",
"details": "Missing required field: model",
"timestamp": "2024-01-15T10:30:00.123456Z",
"requestId": "12345678-1234-1234-1234-123456789012"
}
Common Error Codes
Status | Error Type | Description |
---|---|---|
400 | BadRequest | Invalid request format or parameters |
401 | Unauthorized | Missing or invalid bearer token |
404 | NotFound | Resource not found |
409 | Conflict | Resource already exists or conflict |
429 | TooManyRequests | Rate limit exceeded |
500 | InternalError | Server error |
502 | BadGateway | Backend unavailable |
503 | ServiceUnavailable | No healthy backends available |
Rate Limiting
OllamaFlow implements rate limiting at the backend level:
- Each backend has a configurable
RateLimitRequestsThreshold
- Requests exceeding the threshold receive
429 Too Many Requests
- Rate limiting is applied per backend, not globally
Streaming Responses
Both Ollama APIs and admin APIs support streaming where applicable:
- Text Generation: Set
"stream": true
for real-time token streaming - Model Downloads: Progress updates during model pulls
- Health Monitoring: Server-sent events for real-time status updates
Streaming Example
# Stream text generation
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"prompt": "Write a story about space exploration",
"stream": true
}' \
http://localhost:43411/api/generate
Request Headers
Standard Headers
Content-Type: application/json
- Required for POST/PUT requestsAccept: application/json
- Recommended for consistent responsesUser-Agent: your-client/1.0
- Optional client identification
Custom Headers
X-Request-ID: uuid
- Optional request trackingX-Frontend-Hint: frontend-id
- Optional frontend selection hint
Response Headers
X-Request-ID: uuid
- Request tracking identifierX-Backend-Used: backend-id
- Which backend processed the requestX-Model-Synchronized: true/false
- Whether model sync was required
Postman Collection
A complete Postman collection with all API endpoints and examples is available in the OllamaFlow repository:
Download: OllamaFlow.postman_collection.json
The collection includes:
- All Ollama-compatible endpoints with sample requests
- Complete admin API coverage with authentication
- Environment variables for easy configuration
- Response examples and test scripts
Security and Access Control
OllamaFlow provides comprehensive security controls through Frontend and Backend configuration:
Request Type Controls
- AllowEmbeddings: Controls access to embeddings endpoints
- Ollama API:
/api/embed
- OpenAI API:
/v1/embeddings
- Ollama API:
- AllowCompletions: Controls access to completion endpoints
- Ollama API:
/api/generate
,/api/chat
- OpenAI API:
/v1/completions
,/v1/chat/completions
- Ollama API:
For a request to succeed, both the frontend and at least one assigned backend must allow the request type.
Pinned Properties
Administrators can enforce specific parameters in requests through pinned properties:
- PinnedEmbeddingsProperties: Key-value pairs merged into all embeddings requests
- PinnedCompletionsProperties: Key-value pairs merged into all completion requests
Pinned properties take precedence over client-specified values, enabling organizational compliance and standardization.
Example Security Configuration
# Create a frontend that only allows completions with enforced temperature
curl -X PUT \
-H "Authorization: Bearer your-admin-token" \
-H "Content-Type: application/json" \
-d '{
"Identifier": "secure-frontend",
"Name": "Secure Completions Only",
"AllowEmbeddings": false,
"AllowCompletions": true,
"PinnedCompletionsProperties": {
"options": {
"temperature": 0.7,
"num_ctx": 2048
}
},
"Backends": ["secure-backend"]
}' \
http://localhost:43411/v1.0/frontends
With this configuration:
- ✅ Allowed: Completion requests to both API formats
POST /api/generate
(Ollama)POST /api/chat
(Ollama)POST /v1/completions
(OpenAI)POST /v1/chat/completions
(OpenAI)
- ❌ Blocked: Embeddings requests to both API formats
POST /api/embed
(Ollama)POST /v1/embeddings
(OpenAI)
API Explorer
OllamaFlow includes a companion web-based API Explorer for testing and validation:
- Repository: https://github.com/ollamaflow/apiexplorer
- Purpose: Test and evaluate APIs in scaled inference architectures
- Features: Real-time API testing, JSON validation, response inspection
- Formats: Supports both Ollama and OpenAI API formats
The API Explorer provides an intuitive interface for development debugging, load testing, and integration validation.
SDK and Client Libraries
OllamaFlow supports both Ollama and OpenAI client libraries:
Ollama-Compatible Libraries
- Python:
ollama-python
- JavaScript:
ollama-js
- Go:
ollama-go
- Rust:
ollama-rs
- Java:
ollama-java
OpenAI-Compatible Libraries
- Python:
openai
(official OpenAI Python library) - JavaScript:
openai
(official OpenAI Node.js library) - Go:
go-openai
- Rust:
async-openai
- Java:
openai-java
Simply point these libraries to your OllamaFlow endpoint instead of a direct Ollama or OpenAI instance. For OpenAI libraries, use the base URL http://your-ollamaflow-host:43411/v1
.
Next Steps
- Explore Configuration Examples for common scenarios
- Review REST API Basics for API fundamentals
- Check Monitoring and Observability for production insights