API Reference

API Reference

OllamaFlow provides three sets of APIs: Ollama-compatible APIs and OpenAI-compatible APIs for AI inference, plus Administrative APIs for cluster management. All APIs support JSON request/response format and maintain full compatibility with existing Ollama and OpenAI clients.

Base URL and Authentication

  • Base URL: http://your-ollamaflow-host:43411
  • Admin Authentication: Bearer token required for administrative endpoints
  • Ollama APIs: No authentication required (proxied to backends)
  • OpenAI APIs: No authentication required (proxied to backends)

Authentication Header

# For administrative APIs
curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends

API Compatibility

OllamaFlow supports both Ollama and OpenAI-compatible API formats, allowing clients to use either API style without modification.

Ollama-Compatible APIs

These endpoints maintain full compatibility with the Ollama API, allowing existing clients to work without modification.

Generate Completion

Generate text completions using a specified model.

POST /api/generate

Request Body

{
  "model": "llama3:8b",
  "prompt": "Why is the sky blue?",
  "stream": true,
  "options": {
    "temperature": 0.8,
    "num_predict": 100,
    "top_k": 20,
    "top_p": 0.9
  }
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Explain quantum computing in simple terms",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "num_predict": 200
    }
  }' \
  http://localhost:43411/api/generate

Response

{
  "model": "llama3:8b",
  "created_at": "2024-01-15T10:30:00.123456Z",
  "response": "Quantum computing is a revolutionary technology...",
  "done": true,
  "context": [1, 2, 3, 4, 5],
  "total_duration": 1234567890,
  "load_duration": 123456789,
  "prompt_eval_count": 10,
  "prompt_eval_duration": 234567890,
  "eval_count": 25,
  "eval_duration": 876543210
}

Chat Completion

Generate chat-style completions with conversation context.

POST /api/chat

Request Body

{
  "model": "llama3:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant."
    },
    {
      "role": "user",
      "content": "What is machine learning?"
    }
  ],
  "stream": true,
  "options": {
    "temperature": 0.8,
    "num_ctx": 2048
  }
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "stream": false,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant specializing in technology."
      },
      {
        "role": "user",
        "content": "Explain the difference between AI and ML"
      }
    ]
  }' \
  http://localhost:43411/api/chat

Pull Model

Download a model to the backend instances.

POST /api/pull

Request Body

{
  "model": "llama3:8b",
  "insecure": false,
  "stream": true
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:7b"
  }' \
  http://localhost:43411/api/pull

Show Model Information

Get detailed information about a specific model.

POST /api/show

Request Body

{
  "name": "llama3:8b",
  "verbose": true
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama3:8b"
  }' \
  http://localhost:43411/api/show

List Models

Get a list of available models across all backends.

GET /api/tags

cURL Example

curl http://localhost:43411/api/tags

Response

{
  "models": [
    {
      "name": "llama3:8b",
      "model": "llama3:8b",
      "modified_at": "2024-01-15T10:30:00.123456Z",
      "size": 4661224576,
      "digest": "sha256:8934d96d3f08...",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": ["llama"],
        "parameter_size": "8B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

List Running Models

Get information about currently running models.

GET /api/ps

cURL Example

curl http://localhost:43411/api/ps

Generate Embeddings

Generate embeddings for text input.

POST /api/embed

Request Body

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox jumps over the lazy dog"
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Hello world", "How are you?"]
  }' \
  http://localhost:43411/api/embed

Delete Model

Remove a model from backend instances.

DELETE /api/delete

Request Body

{
  "name": "llama3:8b"
}

cURL Example

curl -X DELETE \
  -H "Content-Type: application/json" \
  -d '{
    "name": "old-model:7b"
  }' \
  http://localhost:43411/api/delete

OpenAI-Compatible APIs

OllamaFlow also supports OpenAI-compatible API endpoints, allowing existing OpenAI clients and tools to work seamlessly.

Generate Completion

Generate text completions using OpenAI-compatible format.

POST /v1/completions

Request Body

{
  "model": "llama3:8b",
  "prompt": "Why is the sky blue?",
  "max_tokens": 100,
  "temperature": 0.8,
  "top_p": 0.9,
  "stream": false
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 200,
    "temperature": 0.7,
    "stream": false
  }' \
  http://localhost:43411/v1/completions

Chat Completion

Generate chat-style completions using OpenAI-compatible format.

POST /v1/chat/completions

Request Body

{
  "model": "llama3:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant."
    },
    {
      "role": "user",
      "content": "What is machine learning?"
    }
  ],
  "max_tokens": 150,
  "temperature": 0.8,
  "stream": false
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant specializing in technology."
      },
      {
        "role": "user",
        "content": "Explain the difference between AI and ML"
      }
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }' \
  http://localhost:43411/v1/chat/completions

Generate Embeddings

Generate embeddings using OpenAI-compatible format.

POST /v1/embeddings

Request Body

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox jumps over the lazy dog"
}

cURL Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Hello world", "How are you?"]
  }' \
  http://localhost:43411/v1/embeddings

Response

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.1, 0.2, 0.3, ...],
      "index": 0
    }
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

List Models

Get available models using OpenAI-compatible format.

GET /v1/models

cURL Example

curl http://localhost:43411/v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "llama3:8b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "ollama"
    }
  ]
}

Administrative APIs

These endpoints provide cluster management capabilities and require bearer token authentication.

Frontend Management

List All Frontends

GET /v1.0/frontends

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/frontends

Get Frontend

GET /v1.0/frontends/{identifier}

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/frontends/frontend1

Create Frontend

PUT /v1.0/frontends

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "production-frontend",
    "Name": "Production AI Inference",
    "Hostname": "ai.company.com",
    "LoadBalancing": "RoundRobin",
    "TimeoutMs": 90000,
    "Backends": ["gpu-1", "gpu-2", "gpu-3"],
    "RequiredModels": ["llama3:8b", "mistral:7b"],
    "MaxRequestBodySize": 1073741824,
    "UseStickySessions": true,
    "StickySessionExpirationMs": 3600000
  }' \
  http://localhost:43411/v1.0/frontends

Update Frontend

PUT /v1.0/frontends/{identifier}

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "production-frontend",
    "Name": "Updated Production Frontend",
    "Hostname": "*",
    "LoadBalancing": "Random",
    "Backends": ["gpu-1", "gpu-2", "gpu-3", "gpu-4"],
    "RequiredModels": ["llama3:8b", "mistral:7b", "codellama:13b"]
  }' \
  http://localhost:43411/v1.0/frontends/production-frontend

Delete Frontend

DELETE /v1.0/frontends/{identifier}

curl -X DELETE \
  -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/frontends/old-frontend

Backend Management

List All Backends

GET /v1.0/backends

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends

Get Backend

GET /v1.0/backends/{identifier}

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/gpu-1

Create Backend

PUT /v1.0/backends

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "gpu-server-4",
    "Name": "GPU Server 4",
    "Hostname": "192.168.1.104",
    "Port": 11434,
    "Ssl": false,
    "HealthCheckUrl": "/api/version",
    "HealthCheckMethod": "GET",
    "UnhealthyThreshold": 3,
    "HealthyThreshold": 2,
    "MaxParallelRequests": 8,
    "RateLimitRequestsThreshold": 20,
    "LogRequestBody": false,
    "LogResponseBody": false
  }' \
  http://localhost:43411/v1.0/backends

Update Backend

PUT /v1.0/backends/{identifier}

curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "gpu-server-1",
    "Name": "Updated GPU Server 1",
    "Hostname": "192.168.1.101",
    "Port": 11434,
    "MaxParallelRequests": 12,
    "UnhealthyThreshold": 2
  }' \
  http://localhost:43411/v1.0/backends/gpu-server-1

Delete Backend

DELETE /v1.0/backends/{identifier}

curl -X DELETE \
  -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/old-backend

Health Monitoring

Get All Backend Health

GET /v1.0/backends/health

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/health

Response

[
  {
    "Identifier": "backend1",
    "Name": "My localhost Ollama instance",
    "Hostname": "localhost",
    "Port": 11434,
    "Ssl": false,
    "UnhealthyThreshold": 2,
    "HealthyThreshold": 2,
    "HealthCheckMethod": {
      "Method": "GET"
    },
    "HealthCheckUrl": "/",
    "MaxParallelRequests": 4,
    "RateLimitRequestsThreshold": 10,
    "LogRequestFull": false,
    "LogRequestBody": false,
    "LogResponseBody": false,
    "ApiFormat": "Ollama",
    "PinnedEmbeddingsProperties": {},
    "PinnedCompletionsProperties": {
      "model": "qwen2.5:3b",
      "options": {
        "temperature": 0.1,
        "howdy": "doody"
      }
    },
    "AllowEmbeddings": true,
    "AllowCompletions": true,
    "Active": true,
    "CreatedUtc": "2025-09-29T23:15:45.659639Z",
    "LastUpdateUtc": "2025-09-29T23:19:07.346900Z",
    "HealthySinceUtc": "2025-09-30T01:53:21.026058Z",
    "Uptime": "00:25:52.4859452",
    "ActiveRequests": 0,
    "IsSticky": false
  }
]

Get Single Backend Health

GET /v1.0/backends/{identifier}/health

curl -H "Authorization: Bearer your-admin-token" \
  http://localhost:43411/v1.0/backends/backend1/health

Response

{
  "Identifier": "backend1",
  "Name": "My localhost Ollama instance",
  "Hostname": "localhost",
  "Port": 11434,
  "Ssl": false,
  "UnhealthyThreshold": 2,
  "HealthyThreshold": 2,
  "HealthCheckMethod": {
    "Method": "GET"
  },
  "HealthCheckUrl": "/",
  "MaxParallelRequests": 4,
  "RateLimitRequestsThreshold": 10,
  "LogRequestFull": false,
  "LogRequestBody": false,
  "LogResponseBody": false,
  "ApiFormat": "Ollama",
  "PinnedEmbeddingsProperties": {},
  "PinnedCompletionsProperties": {
    "model": "qwen2.5:3b",
    "options": {
      "temperature": 0.1,
      "howdy": "doody"
    }
  },
  "AllowEmbeddings": true,
  "AllowCompletions": true,
  "Active": true,
  "CreatedUtc": "2025-09-29T23:15:45.659639Z",
  "LastUpdateUtc": "2025-09-29T23:19:07.346900Z",
  "HealthySinceUtc": "2025-09-30T01:53:21.026058Z",
  "Uptime": "00:26:32.4690556",
  "ActiveRequests": 0,
  "IsSticky": false
}

Error Responses

All APIs return standard HTTP status codes and JSON error responses.

Error Response Format

{
  "error": "BadRequest",
  "message": "Invalid request format",
  "details": "Missing required field: model",
  "timestamp": "2024-01-15T10:30:00.123456Z",
  "requestId": "12345678-1234-1234-1234-123456789012"
}

Common Error Codes

StatusError TypeDescription
400BadRequestInvalid request format or parameters
401UnauthorizedMissing or invalid bearer token
404NotFoundResource not found
409ConflictResource already exists or conflict
429TooManyRequestsRate limit exceeded
500InternalErrorServer error
502BadGatewayBackend unavailable
503ServiceUnavailableNo healthy backends available

Rate Limiting

OllamaFlow implements rate limiting at the backend level:

  • Each backend has a configurable RateLimitRequestsThreshold
  • Requests exceeding the threshold receive 429 Too Many Requests
  • Rate limiting is applied per backend, not globally

Streaming Responses

Both Ollama APIs and admin APIs support streaming where applicable:

  • Text Generation: Set "stream": true for real-time token streaming
  • Model Downloads: Progress updates during model pulls
  • Health Monitoring: Server-sent events for real-time status updates

Streaming Example

# Stream text generation
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Write a story about space exploration",
    "stream": true
  }' \
  http://localhost:43411/api/generate

Request Headers

Standard Headers

  • Content-Type: application/json - Required for POST/PUT requests
  • Accept: application/json - Recommended for consistent responses
  • User-Agent: your-client/1.0 - Optional client identification

Custom Headers

  • X-Request-ID: uuid - Optional request tracking
  • X-Frontend-Hint: frontend-id - Optional frontend selection hint

Response Headers

  • X-Request-ID: uuid - Request tracking identifier
  • X-Backend-Used: backend-id - Which backend processed the request
  • X-Model-Synchronized: true/false - Whether model sync was required

Postman Collection

A complete Postman collection with all API endpoints and examples is available in the OllamaFlow repository:

Download: OllamaFlow.postman_collection.json

The collection includes:

  • All Ollama-compatible endpoints with sample requests
  • Complete admin API coverage with authentication
  • Environment variables for easy configuration
  • Response examples and test scripts

Security and Access Control

OllamaFlow provides comprehensive security controls through Frontend and Backend configuration:

Request Type Controls

  • AllowEmbeddings: Controls access to embeddings endpoints
    • Ollama API: /api/embed
    • OpenAI API: /v1/embeddings
  • AllowCompletions: Controls access to completion endpoints
    • Ollama API: /api/generate, /api/chat
    • OpenAI API: /v1/completions, /v1/chat/completions

For a request to succeed, both the frontend and at least one assigned backend must allow the request type.

Pinned Properties

Administrators can enforce specific parameters in requests through pinned properties:

  • PinnedEmbeddingsProperties: Key-value pairs merged into all embeddings requests
  • PinnedCompletionsProperties: Key-value pairs merged into all completion requests

Pinned properties take precedence over client-specified values, enabling organizational compliance and standardization.

Example Security Configuration

# Create a frontend that only allows completions with enforced temperature
curl -X PUT \
  -H "Authorization: Bearer your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{
    "Identifier": "secure-frontend",
    "Name": "Secure Completions Only",
    "AllowEmbeddings": false,
    "AllowCompletions": true,
    "PinnedCompletionsProperties": {
      "options": {
        "temperature": 0.7,
        "num_ctx": 2048
      }
    },
    "Backends": ["secure-backend"]
  }' \
  http://localhost:43411/v1.0/frontends

With this configuration:

  • Allowed: Completion requests to both API formats
    • POST /api/generate (Ollama)
    • POST /api/chat (Ollama)
    • POST /v1/completions (OpenAI)
    • POST /v1/chat/completions (OpenAI)
  • Blocked: Embeddings requests to both API formats
    • POST /api/embed (Ollama)
    • POST /v1/embeddings (OpenAI)

API Explorer

OllamaFlow includes a companion web-based API Explorer for testing and validation:

  • Repository: https://github.com/ollamaflow/apiexplorer
  • Purpose: Test and evaluate APIs in scaled inference architectures
  • Features: Real-time API testing, JSON validation, response inspection
  • Formats: Supports both Ollama and OpenAI API formats

The API Explorer provides an intuitive interface for development debugging, load testing, and integration validation.

SDK and Client Libraries

OllamaFlow supports both Ollama and OpenAI client libraries:

Ollama-Compatible Libraries

  • Python: ollama-python
  • JavaScript: ollama-js
  • Go: ollama-go
  • Rust: ollama-rs
  • Java: ollama-java

OpenAI-Compatible Libraries

  • Python: openai (official OpenAI Python library)
  • JavaScript: openai (official OpenAI Node.js library)
  • Go: go-openai
  • Rust: async-openai
  • Java: openai-java

Simply point these libraries to your OllamaFlow endpoint instead of a direct Ollama or OpenAI instance. For OpenAI libraries, use the base URL http://your-ollamaflow-host:43411/v1.

Next Steps