Running Claude Code with a Local LLM — Step-by-Step Guide
A complete walkthrough for connecting Claude Code to a locally-running LLM using Ollama, enabling offline coding assistance without API costs.
Claude Code is one of the most capable coding agents available — but running it against Anthropic's API has real costs, especially when you're in a heavy development flow. What if you could run a similar agentic coding workflow against a local LLM, with no API costs and full privacy?
This guide walks through exactly that setup using Ollama as the local model server and configuring an OpenAI-compatible proxy to bridge the gap.
This approach trades raw capability for cost and privacy. Local models (even large ones) are generally less capable than frontier models for complex coding tasks. Use this for exploration, learning, or when privacy is critical.
Architecture Overview#
Step 1: Install and Configure Ollama#
Ollama provides a simple way to run open-source LLMs locally.
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from https://ollama.com/downloadPull a coding-optimized model. I recommend qwen2.5-coder for coding tasks:
# Pull Qwen2.5 Coder (14B — good balance of quality and speed)
ollama pull qwen2.5-coder:14b
# Or DeepSeek Coder for more code-focused tasks
ollama pull deepseek-coder-v2:16b
# Verify it works
ollama run qwen2.5-coder:14b "Write a Python function to reverse a string"Check available models:
ollama list
# NAME ID SIZE MODIFIED
# qwen2.5-coder:14b abc123 9.0 GB 2 minutes agoStep 2: Set Up LiteLLM as an OpenAI-Compatible Proxy#
Claude Code speaks OpenAI's API format. LiteLLM translates between OpenAI format and Ollama's API.
# Install LiteLLM
pip install litellm
# Create proxy config
cat > litellm_config.yaml << 'EOF'
model_list:
- model_name: claude-local
litellm_params:
model: ollama/qwen2.5-coder:14b
api_base: http://localhost:11434
- model_name: claude-local-fast
litellm_params:
model: ollama/qwen2.5-coder:7b
api_base: http://localhost:11434
general_settings:
master_key: "sk-local-dev-key"
litellm_settings:
drop_params: true # Drop unsupported params instead of erroring
request_timeout: 600
EOFStart the proxy:
litellm --config litellm_config.yaml --port 4000
# You should see:
# INFO: LiteLLM proxy server started on http://0.0.0.0:4000Test the proxy:
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-local-dev-key" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-local",
"messages": [{"role": "user", "content": "Hello, write a hello world in Python"}]
}'Step 3: Configure Claude Code#
Set environment variables to redirect Claude Code to your local proxy:
# Add to your ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-local-dev-key"
export ANTHROPIC_BASE_URL="http://localhost:4000"
# Reload your shell
source ~/.zshrcNow launch Claude Code:
claudeClaude Code will now route requests to your local Ollama model through LiteLLM.
Step 4: Verify the Setup#
Inside Claude Code, run a quick test:
> What model are you running on?
You can also check the LiteLLM logs to confirm requests are being handled locally:
# In a separate terminal, watch LiteLLM logs
litellm --config litellm_config.yaml --port 4000 --debugOptimizing for Coding Tasks#
System Prompt Tuning#
Local models often need more explicit instruction than frontier models. Create a custom system prompt:
# Create ~/.claude/system_prompt.md
cat > ~/.claude/system_prompt.md << 'EOF'
You are an expert software engineer and coding assistant.
When writing code:
- Always include error handling
- Add type hints (Python) or proper types (TypeScript)
- Write clean, readable code with meaningful variable names
- Add docstrings/JSDoc for functions
- Prefer explicit over implicit
When analyzing code:
- Identify potential bugs and edge cases
- Suggest performance improvements
- Point out security concerns
Always think step-by-step before writing code.
EOFModel Selection by Task#
# For complex tasks — use the larger model
export CLAUDE_MODEL="claude-local" # qwen2.5-coder:14b
# For fast autocomplete/simple tasks
export CLAUDE_MODEL="claude-local-fast" # qwen2.5-coder:7bHardware Requirements#
| Model | VRAM (GPU) | RAM (CPU) | Speed |
|---|---|---|---|
| qwen2.5-coder:7b | 6 GB | 10 GB | Fast |
| qwen2.5-coder:14b | 10 GB | 18 GB | Medium |
| deepseek-coder-v2:16b | 12 GB | 20 GB | Medium |
| qwen2.5-coder:32b | 22 GB | 40 GB | Slow |
For CPU-only setups, expect significantly slower inference. A 14B model on a modern MacBook Pro M3 runs at approximately 15-25 tokens/second — usable but not fast.
Limitations vs. Claude API#
Being honest about the trade-offs:
| Capability | Claude API | Local LLM |
|---|---|---|
| Complex multi-file refactoring | Excellent | Fair |
| Bug detection | Excellent | Good |
| Code explanation | Excellent | Good |
| Architecture design | Excellent | Fair |
| Simple code generation | Excellent | Good |
| Cost | Per-token billing | Free (compute only) |
| Privacy | Sends to Anthropic | 100% local |
| Speed | Fast | Depends on hardware |
Troubleshooting#
"Connection refused" errors:
# Make sure Ollama is running
ollama serve
# Make sure LiteLLM proxy is running
ps aux | grep litellmSlow responses:
# Check GPU utilization
nvidia-smi # NVIDIA
sudo powermetrics --samplers gpu_power # Apple Silicon
# Switch to smaller model if too slow
ollama pull qwen2.5-coder:7bModel giving poor code quality:
- Try adding more context in your prompt
- Use a larger model (14b or 32b vs 7b)
- Consider switching to
deepseek-coder-v2for pure coding tasks
Conclusion#
Running Claude Code against a local LLM is a viable setup for:
- Privacy-sensitive codebases
- Offline development
- Cost-sensitive workflows
- Experimentation and learning
The setup takes about 15 minutes and the result is a surprisingly capable local coding assistant. For production work, the frontier models are still clearly ahead — but for the right use cases, local is a legitimate option.
The gap between local and frontier models is closing fast. This setup that feels like a compromise today will feel perfectly capable in 12 months.
Written by
Niteen Badgujar
AI Engineer specializing in Agentic AI, LLMs, and production-grade machine learning systems on Azure. Writing to make complex AI concepts accessible and actionable.