Can I use this with LangChain?

Yes. LangChain's ChatOpenAI class accepts a custom openai_api_base parameter. Set it to http://localhost:9998/v1 and pass any string as the API key. All LangChain chains, agents, and tools that use OpenAI chat models will work against your local server without modification.

Do I need an API key?

Not by default. The local server does not require authentication for basic use. Pass any non-empty string (like not-needed) as the api_key parameter since most SDKs require the field to be set. Pro users can optionally generate developer tokens (tp_*) for authenticated access.

Which models are available?

ToolPiper ships with a curated catalog of open-source models including Llama 3.2 (1B, 3B), Qwen 3.5 (0.8B, 4B), and embedding models like nomic-embed-text. You can also load any GGUF model file. Use GET /v1/models to list everything available on your instance.

Can I use this for embeddings?

Yes. POST /v1/embeddings generates vector embeddings for RAG pipelines, semantic search, and similarity matching. Load an embedding model and use the standard OpenAI embeddings API format. This means your entire RAG pipeline, from indexing to querying to generation, can run on localhost.

Local OpenAI-Compatible API on Mac: Drop-In Replacement on Localhost

You have built an app that calls the OpenAI API. It works great, but every request costs money, sends your users' data to OpenAI's servers, and stops working when their API has an outage. What if you could point that same code at localhost and have it work identically, with a local model, zero cost per request, and complete privacy?

You can. And you do not have to change a single line of application logic to do it.

Why has the OpenAI API become the standard for LLMs?

OpenAI's /v1/chat/completions endpoint has become the de facto standard for language model APIs. Hundreds of tools, libraries, and applications are built against it. Anthropic, Google, Mistral, and open-source projects all offer OpenAI-compatible endpoints because the ecosystem is too large to ignore.

This convergence created something valuable: a universal interface. If your code speaks the OpenAI protocol, it can talk to almost any LLM provider. The request format (model, messages, temperature, stream) and the response format (choices, message, usage) are the same everywhere. The only thing that changes is the URL you point at.

That portability is the key insight. If you can swap the URL, you can swap where inference happens. And if inference can happen on localhost, you get privacy, zero per-query cost, and full offline operation without touching your application code.

This is not a theoretical benefit. Real production codebases with thousands of lines of OpenAI integration code can switch to a local model by changing a single environment variable. No refactoring, no new dependencies, no new abstraction layers.

What does a local OpenAI-compatible API actually give you?

A local compatible server accepts the same JSON request format and returns the same JSON response format. Your existing code, your existing prompts, your existing error handling, your existing streaming logic all stay the same. The only change is two lines of configuration:

- base_url: https://api.openai.com/v1
+ base_url: http://localhost:9998/v1

- api_key: sk-abc123...
+ api_key: not-needed

That is the entire migration. Everything downstream of those two lines continues to work. Your retry logic. Your token counting. Your response parsing. Your streaming handlers. All unchanged.

This matters because migration cost is what keeps people on cloud APIs even when they would prefer to run locally. When the migration cost is two lines, the decision becomes purely about tradeoffs: model quality, latency, privacy, and cost.

What are the real benefits of running the API locally?

Zero cost per request. Cloud API pricing scales with usage. At $0.01-$0.06 per 1K tokens, costs add up fast if you are building an app that makes hundreds of calls per day. A local model costs nothing per query after the one-time download.

Complete data privacy. Every request to a cloud API transmits your prompt text to a remote server. For apps handling user data, confidential code, medical records, or legal documents, that is a compliance problem. A localhost API never sends data anywhere.

No rate limits or throttling. OpenAI's rate limits are generous for casual use and frustrating for batch processing. A local server processes requests as fast as your hardware allows, with no artificial throttling.

Works offline. Cloud APIs fail during outages, on airplanes, and on networks that block API traffic. A local server works without internet.

Deterministic latency. No variable network round trips. Your first-token latency is determined by your hardware, not by the load on someone else's data center.

Full control over the model. You choose which model runs, which version, which quantization level. Cloud providers update, deprecate, and retire models on their own schedule. A local model stays exactly as you deployed it until you decide to change it.

How does ToolPiper serve the OpenAI-compatible API?

ToolPiper runs an OpenAI-compatible HTTP server on port 9998. Install the app, launch it, and the API is live. No Docker, no compilation, no terminal configuration.

The endpoints:

POST /v1/chat/completions - Chat completions, streaming and non-streaming
POST /v1/embeddings - Text embeddings for RAG and similarity search
GET /v1/models - List available models
POST /models/load - Load a specific model into memory

Behind this single API surface, ToolPiper coordinates nine inference backends: llama.cpp on Metal GPU for language models, Apple Intelligence for on-device foundation models, FluidAudio for speech-to-text and text-to-speech on the Neural Engine, MLX Audio for high-quality voice synthesis, Apple Vision for OCR, and more. The API routes requests to the right backend based on the model you specify.

How do you use it with the OpenAI Python SDK?

If you already use the OpenAI Python SDK, the change is minimal:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9998/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[{"role": "user", "content": "Explain the difference between REST and GraphQL"}]
)

print(response.choices[0].message.content)

That is real, working code. The OpenAI class accepts any base_url. Point it at localhost, pass any string as the API key (the local server does not require one by default), and every method on the client works against your local model.

Streaming works the same way:

stream = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[{"role": "user", "content": "Write a haiku about local inference"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Server-sent events stream back with delta tokens, identical to OpenAI's streaming format. Your existing streaming UI code works without modification.

What about the Node.js SDK and other clients?

The same pattern applies to every OpenAI-compatible client:

// Node.js
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:9998/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'llama-3.2-3b',
  messages: [{ role: 'user', content: 'Hello' }]
});

# curl
curl http://localhost:9998/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Any tool that accepts a custom OpenAI base URL works out of the box: LangChain, LlamaIndex, Continue.dev, Open Interpreter, Cursor, Aider, and dozens more. If it has a field for "OpenAI Base URL" or "API endpoint," point it at http://localhost:9998/v1.

For environment-variable-driven setups, many tools respect OPENAI_BASE_URL and OPENAI_API_KEY. Set those two variables in your shell profile and every tool that reads them will use your local server by default:

export OPENAI_BASE_URL=http://localhost:9998/v1
export OPENAI_API_KEY=not-needed

How does model management work?

ToolPiper includes a model browser with curated presets. Download models with one click from the GUI, or use the API to load a specific model into memory:

curl -X POST http://localhost:9998/v1/engine/load \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOOLPIPER_BEARER" \
  -d '{"model": "llama-3.2-3b"}'

The GET /v1/models endpoint lists everything available, matching the OpenAI models listing format. You can also bring your own GGUF files and load them directly. ToolPiper monitors system memory pressure in real time and shows you exactly how much RAM each model consumes, so you always know what your hardware can handle.

Does it support embeddings?

Yes. POST /v1/embeddings works with embedding models for RAG pipelines and similarity search:

response = client.embeddings.create(
    model="nomic-embed-text",
    input="Your text to embed"
)

vector = response.data[0].embedding

This means you can run a complete RAG pipeline, from document embedding to query embedding to LLM generation, entirely on localhost. No data leaves your machine at any stage.

What about cloud APIs? Do you have to choose one or the other?

You do not have to choose. When ToolPiper is connected, it can proxy requests to cloud APIs (OpenAI, Anthropic, Gemini) with API key injection from your Mac's Keychain. One base URL handles both local and cloud models. Your code does not need to know which backend is serving a given request.

This means you can use local models for development and testing (free, fast, private) and switch to cloud models for production tasks that need frontier-level quality. Same code path, same error handling, same response format.

Pro users can generate developer tokens (tp_*) for authentication. These work as the api_key parameter in any OpenAI SDK, so you can share a local ToolPiper instance with your team or use it in CI pipelines.

What are the honest limitations?

Not every OpenAI API parameter is supported. Function calling depends on the model's capabilities. Logprobs are not available for all models. There is no fine-tuning endpoint. If your app relies on these features, test them against your local model before switching.

Local model quality is lower than GPT-4 or Claude for complex reasoning tasks. A 3B or 8B parameter model running on your Mac is genuinely capable for most tasks, but it will not match a 1.8-trillion parameter model on hard multi-step reasoning, nuanced legal analysis, or frontier-level code generation.

The API only runs when ToolPiper is running. If you quit the app, port 9998 goes down. For development workflows this is fine. For always-on production services, you need the app running.

Localhost only by default. ToolPiper binds to 127.0.0.1 for security. Other machines on your network cannot reach it without explicit port forwarding. This is intentional, it prevents accidental exposure of your local inference server to the network, but it is worth knowing if you are building a multi-machine setup.

Performance depends on your hardware. An M1 MacBook Air with 8GB runs smaller models (1B-3B parameters) comfortably at 20+ tokens per second. Larger models (7B-8B) need 16GB or more and benefit from the faster memory bandwidth on M2 Pro, M3 Pro, or M4 chips. Check the model selection guide for hardware-specific recommendations.

Try It

Download ModelPiper. Install ToolPiper. Change your base_url to http://localhost:9998/v1. Your existing code works.

Two lines of config. Zero lines of application logic. Complete privacy.

This is part of a series on local-first AI workflows on macOS. See also: 300+ MCP Tools in One Install for how ToolPiper works as an MCP server for Claude Code, Cursor, and other AI coding assistants.