llama.cpp Provider Guide

Fully offline GGUF inference — connect NeuroLink directly to a llama-server process

Overview

llama.cpp is the canonical open-source C++ runtime for running GGUF quantised models on CPU (and GPU). When started with llama-server, it exposes an OpenAI-compatible HTTP API at http://localhost:8080/v1 by default.

NeuroLink's llamacpp provider connects to this server and automatically discovers the loaded model by querying /v1/models at request time. Unlike LM Studio, llama-server loads exactly one model at startup — the model embedded in the path you supply via -m.

Key Facts

Runs locally: No data leaves your machine
No API key needed: llama-server does not authenticate by default (NeuroLink sends a placeholder)
Single model per process: llama-server loads one GGUF file at startup
Auto-discovery: Omit model: and NeuroLink fetches the model ID from /v1/models
Default base URL: http://localhost:8080/v1
Vision: Depends on the loaded model (LLaVA-style multimodal models supported by llama-server)
Streaming: Supported
Tool calling: Depends on the loaded model; start llama-server with --jinja for best tool support

Quick Start

1. Install and Build llama.cpp

# Clone the repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build (CPU-only — works on any machine)
cmake -B build
cmake --build build --config Release -j $(nproc)

# The server binary is now at build/bin/llama-server

For GPU-accelerated builds, see the llama.cpp build docs.

2. Download a GGUF Model

# Example: download Llama 3.2 3B Instruct Q4 from Hugging Face
huggingface-cli download \
  bartowski/Llama-3.2-3B-Instruct-GGUF \
  Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

Or download directly from https://huggingface.co/models — search for GGUF variants.

3. Start the Server

# Basic startup (CPU inference)
./build/bin/llama-server \
  -m ./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --port 8080

# With tool/function calling support (recommended)
./build/bin/llama-server \
  -m ./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --port 8080 \
  --jinja

# GPU-accelerated (N layers offloaded to GPU)
./build/bin/llama-server \
  -m ./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --port 8080 \
  -ngl 99

The server prints listening on http://127.0.0.1:8080 when ready.

4. Configure Environment (Optional)

No environment variables are required for a default setup:

# Override the base URL if using a non-default port or remote host
LLAMACPP_BASE_URL=http://localhost:8080/v1

# Pin a specific model name (default: auto-discover from /v1/models)
LLAMACPP_MODEL=

# API key — only needed if llama-server is behind an auth-proxying reverse-proxy
LLAMACPP_API_KEY=

5. Install NeuroLink

npm install @juspay/neurolink
# or
pnpm add @juspay/neurolink

6. Generate Your First Response

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink();

// Omit `model:` — NeuroLink discovers the loaded model automatically
const result = await ai.generate({
  provider: "llamacpp",
  input: { text: "Explain the difference between a stack and a heap." },
});

console.log(result.content);

Model Auto-Discovery

When no model is specified (and LLAMACPP_MODEL is empty), the provider queries GET /v1/models with a 5-second timeout. The first model returned is used — which is whichever GGUF file the server was started with.

If discovery fails, the provider falls back to "loaded-model" as a placeholder and logs a warning. The next call re-attempts discovery, so you do not need to restart your application after starting llama-server.

// Auto-discover
const result = await ai.generate({
  provider: "llamacpp",
  input: { text: "What is a closure in programming?" },
  // No `model:` field
});

To pin the model explicitly:

const result = await ai.generate({
  provider: "llamacpp",
  model: "Llama-3.2-3B-Instruct-Q4_K_M", // match the ID from /v1/models
  input: { text: "What is a closure?" },
});

SDK Usage

Basic Generation (Auto-Discover)

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink();

const result = await ai.generate({
  provider: "llamacpp",
  input: {
    text: "Write a Rust function to compute the nth Fibonacci number iteratively.",
  },
});

console.log(result.content);

Streaming

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink();

const stream = await ai.stream({
  provider: "llamacpp",
  input: { text: "Explain how the Linux kernel schedules processes." },
});

for await (const chunk of stream.stream) {
  process.stdout.write(chunk);
}

Per-Call Base URL Override

Useful when llama-server runs on a different machine on your local network or on a non-default port.

const result = await ai.generate({
  provider: "llamacpp",
  input: { text: "Hello from the network!" },
  credentials: {
    llamacpp: {
      baseURL: "http://192.168.1.42:8080/v1",
    },
  },
});

If your llama-server is behind an auth-proxying reverse-proxy:

const result = await ai.generate({
  provider: "llamacpp",
  input: { text: "Hello" },
  credentials: {
    llamacpp: {
      baseURL: "https://llama.internal.example.com/v1",
      apiKey: "bearer-token-for-proxy",
    },
  },
});

CLI Usage

Basic Commands

# Auto-discover the loaded model
pnpm run cli generate "What is garbage collection?" --provider llamacpp

# Use provider aliases
pnpm run cli generate "Hello" --provider llama.cpp
pnpm run cli generate "Hello" --provider llama-cpp

# Pin a model explicitly
pnpm run cli generate "Describe merge sort" \
  --provider llamacpp \
  --model Llama-3.2-3B-Instruct-Q4_K_M

# Interactive loop (re-discovers model on each request)
pnpm run cli loop --provider llamacpp

# Connect to a server on a different host
LLAMACPP_BASE_URL=http://192.168.1.42:8080/v1 \
  pnpm run cli generate "Hello from network" --provider llamacpp

Provider Aliases

Alias	Example
`llamacpp`	`--provider llamacpp`
`llama.cpp`	`--provider llama.cpp`
`llama-cpp`	`--provider llama-cpp`

Configuration Reference

Environment Variable	Required	Default	Description
`LLAMACPP_BASE_URL`	No	`http://localhost:8080/v1`	Base URL of the llama-server
`LLAMACPP_MODEL`	No	`""` (auto-discover)	Specific model ID; leave blank for auto-discovery via `/v1/models`
`LLAMACPP_API_KEY`	No	`llamacpp` (placeholder)	Auth token — only needed for reverse-proxy setups with auth

Feature Support

Feature	Supported	Notes
Text generation	Yes
Streaming	Yes
Tool calling	Model-dependent	Start `llama-server` with `--jinja` for function-call template support
Vision / images	Model-dependent	Load a multimodal GGUF (LLaVA-style)
Embeddings	No	Use OpenAI or another embeddings provider
Auto-discovery	Yes	Queries `/v1/models` at request time; falls back gracefully

llama-server Tips

Context Window

Set a larger context window at startup with -c:

./build/bin/llama-server -m model.gguf -c 8192 --port 8080

GPU Offloading

Use -ngl N to offload N transformer layers to GPU (requires a CUDA or Metal build):

./build/bin/llama-server -m model.gguf -ngl 99 --port 8080

Multiple CPU Threads

./build/bin/llama-server -m model.gguf -t 8 --port 8080

Tool / Function Calling

Start with --jinja to enable Jinja-based chat template processing, which is required for function calling on most models:

./build/bin/llama-server -m model.gguf --jinja --port 8080

Troubleshooting

"llama.cpp server not reachable"

llama-server is not running or is on a different address.

# Test reachability
curl http://localhost:8080/v1/models

# Start the server
./build/bin/llama-server -m ./models/your-model.gguf --port 8080

"llama.cpp request timed out"

CPU inference can be slow, especially for large models or long prompts. Reduce the model size (use a smaller Q4 quantisation), increase GPU offloading, or raise the NeuroLink timeout setting.

HTTP 400 — model does not support tools

Tool calling requires the model to understand function-call syntax. Restart llama-server with --jinja and use a model fine-tuned for instruction following (e.g., Llama 3.1/3.2 Instruct).

# With Jinja for tool support
./build/bin/llama-server -m model.gguf --jinja --port 8080

Auto-discovery keeps returning "loaded-model"

llama-server is running but /v1/models returned an empty list, or the server is not reachable. Confirm the server started successfully:

curl http://localhost:8080/health
curl http://localhost:8080/v1/models

Server crashes or runs out of memory

Your model is too large for available RAM. Use a more aggressively quantised variant (Q2 or Q4) or a smaller model. You can also limit the batch size at startup with -b 512.

Overview​

Key Facts​

Quick Start​

1. Install and Build llama.cpp​

2. Download a GGUF Model​

3. Start the Server​

4. Configure Environment (Optional)​

5. Install NeuroLink​

6. Generate Your First Response​

Model Auto-Discovery​

SDK Usage​

Basic Generation (Auto-Discover)​

Streaming​

Per-Call Base URL Override​

CLI Usage​

Basic Commands​

Provider Aliases​

Configuration Reference​

Feature Support​

llama-server Tips​

Context Window​

GPU Offloading​

Multiple CPU Threads​

Tool / Function Calling​

Troubleshooting​

"llama.cpp server not reachable"​

"llama.cpp request timed out"​

HTTP 400 — model does not support tools​

Auto-discovery keeps returning "loaded-model"​

Server crashes or runs out of memory​

See Also​