Skip to main content

Ollama Provider Guide

Run AI models locally with full privacy - no API key or cloud service required


Overview

Ollama lets you run open-source large language models entirely on your own machine. NeuroLink integrates with Ollama through a custom OllamaLanguageModel implementation that supports both the native Ollama API (/api/generate) and an OpenAI-compatible mode (/v1/chat/completions).

Key Benefits

  • 100% Local: All inference runs on your hardware, no data leaves your machine
  • No API Key Required: No accounts, billing, or rate limits
  • Offline Capable: Works completely without internet after models are pulled
  • 70+ Models: Llama, Mistral, Qwen, DeepSeek, Gemma, Phi, CodeLlama, and more
  • Tool/Function Calling: Multi-step tool execution via the OpenAI-compatible endpoint
  • Streaming: Full streaming support in both native and OpenAI-compatible modes
  • Multimodal: Image input support for vision-capable models (LLaVA, Llama 3.2)
  • Proxy-Aware: Supports HTTP/HTTPS proxy configuration

API Modes

ModeEndpointUse Case
Native (default)/api/generateStandard text generation and streaming
OpenAI-compatible/v1/chat/completionsTool calling, chat-format messages, compatibility

Tool calling always uses the OpenAI-compatible endpoint regardless of the mode setting.


Quick Start

1. Install Ollama

brew install ollama

2. Start Ollama and Pull a Model

# Start the Ollama service (may auto-start on install)
ollama serve

# Pull the default model
ollama pull llama3.2:latest

# Verify installation
ollama list

Add to your .env file:

# Optional: All values below show defaults. Ollama works with zero configuration.

# Override the default model
OLLAMA_MODEL=llama3.2:latest

# Override the base URL (default: http://localhost:11434)
OLLAMA_BASE_URL=http://localhost:11434

4. Test the Setup

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink();

const result = await ai.generate({
input: { text: "Explain quantum computing in simple terms" },
provider: "ollama",
});

console.log(result.content);

Supported Models

Available Models (from OllamaModels enum)

Any model in the Ollama library can be used by passing its tag to --model. The OllamaModels enum in src/lib/constants/enums.ts provides named constants for common models:

Llama Series

Enum KeyModel IDDescription
LLAMA4_SCOUTllama4:scoutLlama 4 multimodal with vision and tools
LLAMA4_MAVERICKllama4:maverickLlama 4 multimodal with vision and tools
LLAMA3_3_70Bllama3.3:70bHigh-performance 70B
LLAMA3_2_LATESTllama3.2:latestOptimized for edge deployment (default)
LLAMA3_2_3Bllama3.2:3bCompact 3B edge model
LLAMA3_2_1Bllama3.2:1bUltra-compact 1B model
LLAMA3_1_8Bllama3.1:8bOpen model rivaling proprietary models
LLAMA3_1_70Bllama3.1:70bLarge-scale open model
LLAMA3_1_405Bllama3.1:405bLargest open Llama model

Qwen Series

Enum KeyModel IDDescription
QWEN3_4Bqwen3:4bAdvanced reasoning, multilingual
QWEN3_8Bqwen3:8bAdvanced reasoning, multilingual
QWEN3_14Bqwen3:14bAdvanced reasoning, multilingual
QWEN3_32Bqwen3:32bAdvanced reasoning, multilingual
QWEN3_72Bqwen3:72bAdvanced reasoning, multilingual
QWQ_32Bqwq:32bReasoning-specialized model
QWEN2_5_72Bqwen2.5:72bEnhanced coding and mathematics

DeepSeek Series

Enum KeyModel IDDescription
DEEPSEEK_R1_7Bdeepseek-r1:7bState-of-the-art reasoning
DEEPSEEK_R1_14Bdeepseek-r1:14bReasoning at 14B scale
DEEPSEEK_R1_32Bdeepseek-r1:32bReasoning at 32B scale
DEEPSEEK_R1_70Bdeepseek-r1:70bLarge-scale reasoning
DEEPSEEK_V3_LATESTdeepseek-v3:latestMixture of Experts model

Mistral Series

Enum KeyModel IDDescription
MISTRAL_LATESTmistral:latestEfficient general-purpose 7B
MISTRAL_SMALL_LATESTmistral-small:latestCompact Mistral variant
MISTRAL_NEMO_LATESTmistral-nemo:latestNemo architecture
MISTRAL_LARGE_LATESTmistral-large:latestLargest Mistral model

Code-Specialized Models

Enum KeyModel IDDescription
CODELLAMA_7Bcodellama:7bCode-focused Llama 7B
CODELLAMA_13Bcodellama:13bCode-focused Llama 13B
CODELLAMA_34Bcodellama:34bCode-focused Llama 34B
CODELLAMA_70Bcodellama:70bCode-focused Llama 70B
QWEN2_5_CODER_7Bqwen2.5-coder:7bQwen coding model
QWEN2_5_CODER_32Bqwen2.5-coder:32bQwen coding model (large)
STARCODER2_3Bstarcoder2:3bCompact code generation
STARCODER2_15Bstarcoder2:15bLarger code generation

Vision-Language Models

Enum KeyModel IDDescription
LLAVA_7Bllava:7bVision-language 7B
LLAVA_13Bllava:13bVision-language 13B
LLAVA_34Bllava:34bVision-language 34B
LLAVA_LLAMA3_8Bllava-llama3:8bLLaVA with Llama 3 backbone

Other Notable Models

Enum KeyModel IDDescription
GEMMA3_LATESTgemma3:latestGoogle Gemma 3
GEMMA2_27Bgemma2:27bGoogle Gemma 2 large
PHI4_LATESTphi4:latestMicrosoft Phi 4
PHI3_MINIphi3:miniMicrosoft Phi 3 compact
MIXTRAL_8X7Bmixtral:8x7bMixture of Experts
MIXTRAL_8X22Bmixtral:8x22bLarge Mixture of Experts
COMMAND_R_PLUScommand-r-plus:104bCohere enterprise model
GLM_5_LATESTglm-5:latestZ.AI flagship reasoning
NEMOTRON_3_NANO_LATESTnemotron-3-nano:latestNVIDIA hybrid MoE, 1M context

Default Model

The default model is llama3.2:latest (set via OllamaModels.LLAMA3_2_LATEST in the provider registry). The internal OllamaLanguageModel uses llama3.1:8b as its default with llama3.2:latest as a fallback when the primary model fails. Override the default with the OLLAMA_MODEL environment variable.

Model Selection by Use Case

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink();

// Fast responses on limited hardware
const quickResult = await ai.generate({
input: { text: "Summarize this text..." },
provider: "ollama",
model: "llama3.2:1b",
});

// Balanced general purpose (recommended)
const balancedResult = await ai.generate({
input: { text: "Analyze this problem..." },
provider: "ollama",
model: "llama3.1:8b",
});

// Code generation
const codeResult = await ai.generate({
input: { text: "Write a Python function to sort a linked list" },
provider: "ollama",
model: "codellama:7b",
});

// Deep reasoning
const reasoningResult = await ai.generate({
input: { text: "Prove this mathematical theorem..." },
provider: "ollama",
model: "deepseek-r1:14b",
});

// Image analysis (vision model)
const visionResult = await ai.generate({
input: {
text: "Describe what you see",
images: ["data:image/jpeg;base64,..."],
},
provider: "ollama",
model: "llava:7b",
});

Model Recommendations by System Resources

RAMRecommended Models
8 GBllama3.2:1b, phi3:mini, gemma2:2b
16 GBllama3.1:8b, mistral:latest, codellama:7b, qwen3:8b
32 GB+llama3.3:70b, mixtral:8x7b, deepseek-r1:32b, qwen3:32b
64 GB+llama3.1:405b, mixtral:8x22b, deepseek-v3:latest

Provider Aliases

The Ollama provider is registered with the following aliases in the provider registry:

AliasDescription
ollamaPrimary provider name
localConvenience alias for local models

Both aliases resolve to the same OllamaProvider. Use either in the --provider flag or the provider option:

# These are equivalent
pnpm run cli -- generate "Hello" --provider ollama
pnpm run cli -- generate "Hello" --provider local

OpenAI-Compatible Mode

By default, NeuroLink uses Ollama's native API (/api/generate). Setting OLLAMA_OPENAI_COMPATIBLE=true switches all requests to the OpenAI-compatible endpoint (/v1/chat/completions).

When to Use OpenAI-Compatible Mode

  • Your Ollama deployment only exposes the OpenAI-compatible route (e.g., certain hosted or proxied setups)
  • You want consistent message formatting across providers
  • You need chat-format messages instead of raw prompt concatenation

Configuration

# Enable OpenAI-compatible mode
OLLAMA_OPENAI_COMPATIBLE=true

Behavior Differences

FeatureNative Mode (/api/generate)OpenAI-Compatible Mode (/v1/chat/completions)
Message formatConcatenated prompt stringChat messages array
System promptSent as system fieldSent as system message role
Streaming formatNDJSON lines with response fieldSSE with data: prefix, choices[0].delta
Image supportNative images field (base64)Text-only (images converted to text)
Tool Calling and API Mode

Tool calling always uses the /v1/chat/completions endpoint regardless of the OLLAMA_OPENAI_COMPATIBLE setting. This is because Ollama's tool/function calling support is only available through the OpenAI-compatible API.


Tool Use / Function Calling

Ollama supports tool calling through its OpenAI-compatible endpoint. The provider converts tools to the OpenAI function calling format and handles multi-step tool execution in a conversation loop.

Tool Capability Detection

By default, tool calling is assumed to be supported for all models. You can restrict tool calling to specific models by configuring OLLAMA_TOOL_CAPABLE_MODELS or setting providers.ollama.modelBehavior.toolCapableModels in the model configuration.

The provider includes static recommendations via OllamaProvider.getToolCallingRecommendations():

ModelSpeedQualitySizeNotes
llama3.1:8b-instructFastGood4.6 GBBest balance of speed and tool capability
mistral:7b-instruct-v0.3FastGood4.1 GBLightweight with reliable function calling
hermes3:8b-llama3.1FastGood4.6 GBSpecialized for tool execution
codellama:34b-instructSlowHigh19 GBExcellent for code-related tool calling
firefunction-v2:70bSlowHigh40 GBOptimized specifically for function calling

SDK Example

const tools = [
{
name: "get_weather",
description: "Get current weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string", description: "City name" },
},
required: ["location"],
},
},
];

const result = await ai.generate({
input: { text: "What's the weather in Tokyo?" },
provider: "ollama",
model: "llama3.1:8b",
tools,
});

console.log(result.toolCalls);

Multi-Step Tool Execution

The provider supports multi-step tool execution with a configurable maximum number of iterations (controlled by maxSteps, defaulting to DEFAULT_MAX_STEPS). In each iteration:

  1. The model receives the conversation history and available tools
  2. If the model returns tool calls, NeuroLink executes them automatically
  3. Tool results are appended to the conversation history
  4. The model is called again with the updated context
  5. This repeats until the model returns a final text response or the iteration limit is reached

Streaming Responses

Streaming is supported in both native and OpenAI-compatible modes.

const stream = await ai.stream({
input: { text: "Write a detailed article about local AI" },
provider: "ollama",
model: "llama3.1:8b",
});
pnpm run cli -- stream "Write a story about a robot" \
--provider ollama

The provider performs a health check (GET /api/version) before each streaming request to give an early, actionable error if Ollama is not running.


Multimodal Capabilities

Image Analysis

Vision-capable models (LLaVA, Llama 3.2 vision variants) can analyze images. In native mode, images are sent as base64-encoded data in the Ollama images field. In OpenAI-compatible mode, images are converted to text descriptions.

const result = await ai.generate({
input: {
text: "Describe what you see in this image",
images: ["data:image/jpeg;base64,..."],
},
provider: "ollama",
model: "llava:7b",
});
pnpm run cli -- generate "Describe this image" \
--provider ollama \
--model "llava:7b" \
--image ./photo.jpg
PDF Support

PDF inputs are not supported by the Ollama provider. Use a provider with native PDF support (OpenAI, Anthropic, Google Vertex AI, Google AI Studio) for PDF processing.


Configuration Reference

Environment Variables

VariableDescriptionDefaultRequired
OLLAMA_BASE_URLBase URL for the Ollama APIhttp://localhost:11434No
OLLAMA_MODELDefault model to usellama3.2:latestNo
OLLAMA_TIMEOUTRequest timeout in milliseconds240000 (4 minutes)No
OLLAMA_OPENAI_COMPATIBLESet to true to use the OpenAI-compatible API endpointfalseNo
OLLAMA_TOOL_CAPABLE_MODELSComma-separated list of model patterns that support tool calling(empty, all models assumed)No

CLI Provider Options

FlagValuesDescription
--provider / -pollama or localUse Ollama provider
--model / -mAny Ollama model tagSpecific model to use
--imageFile pathImage for vision models

Error Handling

The Ollama provider maps errors to specific error types with actionable guidance:

Error TypeCondition
NetworkErrorConnection refused (Ollama not running), endpoint not found
InvalidModelErrorRequested model not pulled locally
TimeoutErrorRequest exceeded the configured timeout
ProviderErrorOther Ollama-side failures

Troubleshooting

"Connection refused" / Ollama not running

The most common error. The provider checks OLLAMA_BASE_URL (default http://localhost:11434) and will fail if Ollama is not serving.

# Start Ollama
ollama serve

# Verify it is running
curl http://localhost:11434/api/version

# Check if the port is in use
lsof -i :11434 # macOS/Linux
netstat -an | findstr 11434 # Windows

If Ollama is running on a different host or port:

OLLAMA_BASE_URL=http://your-host:11434

"Model not found"

The model must be pulled before it can be used. Ollama downloads models on demand.

# Pull the model you need
ollama pull llama3.2:latest

# List installed models
ollama list

# Try a lightweight model first
ollama pull phi3:mini

Timeout errors with large models

Large models (70B+) can take a long time to load into memory on the first request, and inference is slower. Increase the timeout:

# Increase to 10 minutes for very large models
OLLAMA_TIMEOUT=600000

Slow performance

  • Close other memory-intensive applications
  • Use a smaller model variant (e.g., llama3.2:1b instead of llama3.1:70b)
  • GPU acceleration is automatic on supported hardware:
    • Apple Silicon: Metal acceleration on M1/M2/M3/M4
    • NVIDIA: Automatic if CUDA drivers are installed
    • AMD: ROCm support on Linux

Tool calls not working

  1. Ensure your model supports function calling (see Recommended Models for Tool Calling)
  2. Tool calling always uses the /v1/chat/completions endpoint; verify it is accessible:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "hello"}]}'

404 errors from the API

The Ollama version may be too old or the API endpoint has changed.

# Check version
ollama --version

# Update Ollama
# macOS: brew upgrade ollama
# Linux: curl -fsSL https://ollama.ai/install.sh | sh

Privacy and Security

  • All data stays local: No network calls to external services during inference
  • No telemetry from Ollama: Ollama does not track usage
  • Air-gap capable: After pulling models, works entirely offline
  • No API keys stored: No credentials to manage or rotate


Additional Resources