API Reference

Base URL and Authentication

Base URL: http://your-ollamaflow-host:43411
Admin Authentication: Bearer token required for administrative endpoints
Ollama APIs: No authentication required (proxied to backends)
OpenAI APIs: No authentication required (proxied to backends)

Authentication Header

# For administrative APIs
curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends

API Compatibility

OllamaFlow supports both Ollama and OpenAI-compatible API formats, allowing clients to use either API style without modification.

Ollama-Compatible APIs

These endpoints maintain full compatibility with the Ollama API, allowing existing clients to work without modification.

Generate Completion

Generate text completions using a specified model.

POST /api/generate

Request Body

{
  "model": "llama3:8b",
  "prompt": "Why is the sky blue?",
  "stream": true,
  "options": {
    "temperature": 0.8,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9
  }
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Explain quantum computing in simple terms",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "num_predict": 200
    }
  }' \
  http://localhost:43411/api/generate

Response

{
  "model": "llama3:8b",
  "created_at": "2024-01-15T10:30:00.123456Z",
  "response": "Quantum computing is a revolutionary technology...",
  "done": true,
  "context": [1, 2, 3, 4, 5],
  "total_duration": 1234567890,
  "load_duration": 123456789,
  "prompt_eval_count": 10,
  "prompt_eval_duration": 234567890,
  "eval_count": 25,
  "eval_duration": 876543210
}

Chat Completion

Generate chat-style completions with conversation context.

POST /api/chat

Request Body

{
  "model": "llama3:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant."
    },
    {
      "role": "user",
      "content": "What is machine learning?"
    }
  ],
  "stream": true,
  "options": {
    "temperature": 0.8,
    "num_ctx": 2048
  }
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "stream": false,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant specializing in technology."
      },
      {
        "role": "user",
        "content": "Explain the difference between AI and ML"
      }
    ]
  }' \
  http://localhost:43411/api/chat

Pull Model

Download a model to the backend instances.

POST /api/pull

Request Body

{
  "model": "llama3:8b",
  "insecure": false,
  "stream": true
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:7b"
  }' \
  http://localhost:43411/api/pull

Show Model Information

Get detailed information about a specific model.

POST /api/show

Request Body

{
  "name": "llama3:8b",
  "verbose": true
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama3:8b"
  }' \
  http://localhost:43411/api/show

List Models

Get a list of available models across all backends.

GET /api/tags

cURL Example

curl http://localhost:43411/api/tags

Response

{
  "models": [
    {
      "name": "llama3:8b",
      "model": "llama3:8b",
      "modified_at": "2024-01-15T10:30:00.123456Z",
      "size": 4661224576,
      "digest": "sha256:8934d96d3f08...",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": ["llama"],
        "parameter_size": "8B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

List Running Models

Get information about currently running models.

GET /api/ps

cURL Example

curl http://localhost:43411/api/ps

Generate Embeddings

Generate embeddings for text input.

POST /api/embed

Request Body

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox jumps over the lazy dog"
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Hello world", "How are you?"]
  }' \
  http://localhost:43411/api/embed

Delete Model

Remove a model from backend instances.

DELETE /api/delete

Request Body

{
  "name": "llama3:8b"
}

cURL Example

curl -X DELETE \
  -H "Content-Type: application/json" \
  -d '{
    "name": "old-model:7b"
  }' \
  http://localhost:43411/api/delete

OpenAI-Compatible APIs

OllamaFlow also supports OpenAI-compatible API endpoints, allowing existing OpenAI clients and tools to work seamlessly.

Generate Completion

Generate text completions using OpenAI-compatible format.

POST /v1/completions

Request Body

{
  "model": "llama3:8b",
  "prompt": "Why is the sky blue?",
  "max_tokens": 100,
  "temperature": 0.8,
  "top_p": 0.9,
  "stream": false
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 200,
    "temperature": 0.7,
    "stream": false
  }' \
  http://localhost:43411/v1/completions

Chat Completion

Generate chat-style completions using OpenAI-compatible format.

POST /v1/chat/completions

Request Body

{
  "model": "llama3:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant."
    },
    {
      "role": "user",
      "content": "What is machine learning?"
    }
  ],
  "max_tokens": 150,
  "temperature": 0.8,
  "stream": false
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant specializing in technology."
      },
      {
        "role": "user",
        "content": "Explain the difference between AI and ML"
      }
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }' \
  http://localhost:43411/v1/chat/completions

Generate Embeddings

Generate embeddings using OpenAI-compatible format.

POST /v1/embeddings

Request Body

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox jumps over the lazy dog"
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Hello world", "How are you?"]
  }' \
  http://localhost:43411/v1/embeddings

Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.1, 0.2, 0.3, ...],
      "index": 0
    }
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

List Models

Get available models using OpenAI-compatible format.

GET /v1/models

cURL Example

curl http://localhost:43411/v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "llama3:8b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "ollama"
    }
  ]
}

Administrative APIs

These endpoints provide cluster management capabilities and require bearer token authentication.

Frontend Management

List All Frontends

GET /v1.0/frontends

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/frontends

Get Frontend

GET /v1.0/frontends/{identifier}

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/frontends/frontend1

Create Frontend

PUT /v1.0/frontends

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "production-frontend",
    "Name": "Production AI Inference",
    "Hostname": "ai.company.com",
    "LoadBalancing": "RoundRobin",
    "TimeoutMs": 90000,
    "Backends": ["gpu-1", "gpu-2", "gpu-3"],
    "RequiredModels": ["llama3:8b", "mistral:7b"],
    "MaxRequestBodySize": 1073741824,
    "UseStickySessions": true,
    "StickySessionExpirationMs": 3600000
  }' \
  http://localhost:43411/v1.0/frontends

Update Frontend

PUT /v1.0/frontends/{identifier}

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "production-frontend",
    "Name": "Updated Production Frontend",
    "Hostname": "*",
    "LoadBalancing": "Random",
    "Backends": ["gpu-1", "gpu-2", "gpu-3", "gpu-4"],
    "RequiredModels": ["llama3:8b", "mistral:7b", "codellama:13b"]
  }' \
  http://localhost:43411/v1.0/frontends/production-frontend

Delete Frontend

DELETE /v1.0/frontends/{identifier}

curl -X DELETE \
  -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/frontends/old-frontend

Backend Management

List All Backends

GET /v1.0/backends

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends

Get Backend

GET /v1.0/backends/{identifier}

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/gpu-1

Create Backend

PUT /v1.0/backends

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "gpu-server-4",
    "Name": "GPU Server 4",
    "Hostname": "192.168.1.104",
    "Port": 11434,
    "Ssl": false,
    "HealthCheckUrl": "/api/version",
    "HealthCheckMethod": "GET",
    "UnhealthyThreshold": 3,
    "HealthyThreshold": 2,
    "MaxParallelRequests": 8,
    "RateLimitRequestsThreshold": 20,
    "LogRequestBody": false,
    "LogResponseBody": false
  }' \
  http://localhost:43411/v1.0/backends

Update Backend

PUT /v1.0/backends/{identifier}

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "gpu-server-1",
    "Name": "Updated GPU Server 1",
    "Hostname": "192.168.1.101",
    "Port": 11434,
    "MaxParallelRequests": 12,
    "UnhealthyThreshold": 2
  }' \
  http://localhost:43411/v1.0/backends/gpu-server-1

Delete Backend

DELETE /v1.0/backends/{identifier}

curl -X DELETE \
  -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/old-backend

Health Monitoring

Get All Backend Health

GET /v1.0/backends/health

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/health

Response

[
  {
    "Identifier": "backend1",
    "Name": "My localhost Ollama instance",
    "Hostname": "localhost",
    "Port": 11434,
    "Ssl": false,
    "UnhealthyThreshold": 2,
    "HealthyThreshold": 2,
    "HealthCheckMethod": {
      "Method": "GET"
    },
    "HealthCheckUrl": "/",
    "MaxParallelRequests": 4,
    "RateLimitRequestsThreshold": 10,
    "LogRequestFull": false,
    "LogRequestBody": false,
    "LogResponseBody": false,
    "ApiFormat": "Ollama",
    "PinnedEmbeddingsProperties": {},
    "PinnedCompletionsProperties": {
      "model": "qwen2.5:3b",
      "options": {
        "temperature": 0.1,
        "howdy": "doody"
      }
    },
    "AllowEmbeddings": true,
    "AllowCompletions": true,
    "Active": true,
    "CreatedUtc": "2025-09-29T23:15:45.659639Z",
    "LastUpdateUtc": "2025-09-29T23:19:07.346900Z",
    "HealthySinceUtc": "2025-09-30T01:53:21.026058Z",
    "Uptime": "00:25:52.4859452",
    "ActiveRequests": 0,
    "IsSticky": false
  }
]

Get Single Backend Health

GET /v1.0/backends/{identifier}/health

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/backend1/health

Response

{
  "Identifier": "backend1",
  "Name": "My localhost Ollama instance",
  "Hostname": "localhost",
  "Port": 11434,
  "Ssl": false,
  "UnhealthyThreshold": 2,
  "HealthyThreshold": 2,
  "HealthCheckMethod": {
    "Method": "GET"
  },
  "HealthCheckUrl": "/",
  "MaxParallelRequests": 4,
  "RateLimitRequestsThreshold": 10,
  "LogRequestFull": false,
  "LogRequestBody": false,
  "LogResponseBody": false,
  "ApiFormat": "Ollama",
  "PinnedEmbeddingsProperties": {},
  "PinnedCompletionsProperties": {
    "model": "qwen2.5:3b",
    "options": {
      "temperature": 0.1,
      "howdy": "doody"
    }
  },
  "AllowEmbeddings": true,
  "AllowCompletions": true,
  "Active": true,
  "CreatedUtc": "2025-09-29T23:15:45.659639Z",
  "LastUpdateUtc": "2025-09-29T23:19:07.346900Z",
  "HealthySinceUtc": "2025-09-30T01:53:21.026058Z",
  "Uptime": "00:26:32.4690556",
  "ActiveRequests": 0,
  "IsSticky": false
}

Error Responses

All APIs return standard HTTP status codes and JSON error responses.

Error Response Format

{
  "error": "BadRequest",
  "message": "Invalid request format",
  "details": "Missing required field: model",
  "timestamp": "2024-01-15T10:30:00.123456Z",
  "requestId": "12345678-1234-1234-1234-123456789012"
}

Common Error Codes

Status	Error Type	Description
400	BadRequest	Invalid request format or parameters
401	Unauthorized	Missing or invalid bearer token
404	NotFound	Resource not found
409	Conflict	Resource already exists or conflict
429	TooManyRequests	Rate limit exceeded
500	InternalError	Server error
502	BadGateway	Backend unavailable
503	ServiceUnavailable	No healthy backends available

Rate Limiting

OllamaFlow implements rate limiting at the backend level:

Each backend has a configurable RateLimitRequestsThreshold
Requests exceeding the threshold receive 429 Too Many Requests
Rate limiting is applied per backend, not globally

Streaming Responses

Both Ollama APIs and admin APIs support streaming where applicable:

Text Generation: Set "stream": true for real-time token streaming
Model Downloads: Progress updates during model pulls
Health Monitoring: Server-sent events for real-time status updates

Streaming Example

# Stream text generation
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Write a story about space exploration",
    "stream": true
  }' \
  http://localhost:43411/api/generate

Request Headers

Standard Headers

Content-Type: application/json - Required for POST/PUT requests
Accept: application/json - Recommended for consistent responses
User-Agent: your-client/1.0 - Optional client identification

Custom Headers

X-Request-ID: uuid - Optional request tracking
X-Frontend-Hint: frontend-id - Optional frontend selection hint

Response Headers

X-Request-ID: uuid - Request tracking identifier
X-Backend-Used: backend-id - Which backend processed the request
X-Model-Synchronized: true/false - Whether model sync was required

Postman Collection

A complete Postman collection with all API endpoints and examples is available in the OllamaFlow repository:

Download: OllamaFlow.postman_collection.json

The collection includes:

All Ollama-compatible endpoints with sample requests
Complete admin API coverage with authentication
Environment variables for easy configuration
Response examples and test scripts

Security and Access Control

OllamaFlow provides comprehensive security controls through Frontend and Backend configuration:

Request Type Controls

AllowEmbeddings: Controls access to embeddings endpoints
- Ollama API: /api/embed
- OpenAI API: /v1/embeddings
AllowCompletions: Controls access to completion endpoints
- Ollama API: /api/generate, /api/chat
- OpenAI API: /v1/completions, /v1/chat/completions

For a request to succeed, both the frontend and at least one assigned backend must allow the request type.

Pinned Properties

Administrators can enforce specific parameters in requests through pinned properties:

PinnedEmbeddingsProperties: Key-value pairs merged into all embeddings requests
PinnedCompletionsProperties: Key-value pairs merged into all completion requests

Pinned properties take precedence over client-specified values, enabling organizational compliance and standardization.

Example Security Configuration

# Create a frontend that only allows completions with enforced temperature
curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "secure-frontend",
    "Name": "Secure Completions Only",
    "AllowEmbeddings": false,
    "AllowCompletions": true,
    "PinnedCompletionsProperties": {
      "options": {
        "temperature": 0.7,
        "num_ctx": 2048
      }
    },
    "Backends": ["secure-backend"]
  }' \
  http://localhost:43411/v1.0/frontends

With this configuration:

✅ Allowed: Completion requests to both API formats
- POST /api/generate (Ollama)
- POST /api/chat (Ollama)
- POST /v1/completions (OpenAI)
- POST /v1/chat/completions (OpenAI)
❌ Blocked: Embeddings requests to both API formats
- POST /api/embed (Ollama)
- POST /v1/embeddings (OpenAI)

API Explorer

OllamaFlow includes a companion web-based API Explorer for testing and validation:

Repository: https://github.com/ollamaflow/apiexplorer
Purpose: Test and evaluate APIs in scaled inference architectures
Features: Real-time API testing, JSON validation, response inspection
Formats: Supports both Ollama and OpenAI API formats

The API Explorer provides an intuitive interface for development debugging, load testing, and integration validation.

SDK and Client Libraries

OllamaFlow supports both Ollama and OpenAI client libraries:

Ollama-Compatible Libraries

Python: ollama-python
JavaScript: ollama-js
Go: ollama-go
Rust: ollama-rs
Java: ollama-java

OpenAI-Compatible Libraries

Python: openai (official OpenAI Python library)
JavaScript: openai (official OpenAI Node.js library)
Go: go-openai
Rust: async-openai
Java: openai-java

Simply point these libraries to your OllamaFlow endpoint instead of a direct Ollama or OpenAI instance. For OpenAI libraries, use the base URL http://your-ollamaflow-host:43411/v1.

Next Steps

Explore Configuration Examples for common scenarios
Review REST API Basics for API fundamentals
Check Monitoring and Observability for production insights