Running Claude Code with a Local LLM — Step-by-Step Guide

Claude Code is one of the most capable coding agents available — but running it against Anthropic's API has real costs, especially when you're in a heavy development flow. What if you could run a similar agentic coding workflow against a local LLM, with no API costs and full privacy?

This guide walks through exactly that setup using Ollama as the local model server and configuring an OpenAI-compatible proxy to bridge the gap.

This approach trades raw capability for cost and privacy. Local models (even large ones) are generally less capable than frontier models for complex coding tasks. Use this for exploration, learning, or when privacy is critical.

Architecture Overview #

Step 1: Install and Configure Ollama #

Ollama provides a simple way to run open-source LLMs locally.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
 
# Windows: download installer from https://ollama.com/download

Pull a coding-optimized model. I recommend qwen2.5-coder for coding tasks:

# Pull Qwen2.5 Coder (14B — good balance of quality and speed)
ollama pull qwen2.5-coder:14b
 
# Or DeepSeek Coder for more code-focused tasks
ollama pull deepseek-coder-v2:16b
 
# Verify it works
ollama run qwen2.5-coder:14b "Write a Python function to reverse a string"

Check available models:

ollama list
# NAME                    ID              SIZE    MODIFIED
# qwen2.5-coder:14b       abc123          9.0 GB  2 minutes ago

Step 2: Set Up LiteLLM as an OpenAI-Compatible Proxy #

Claude Code speaks OpenAI's API format. LiteLLM translates between OpenAI format and Ollama's API.

# Install LiteLLM
pip install litellm
 
# Create proxy config
cat > litellm_config.yaml << 'EOF'
model_list:
  - model_name: claude-local
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://localhost:11434
 
  - model_name: claude-local-fast  
    litellm_params:
      model: ollama/qwen2.5-coder:7b
      api_base: http://localhost:11434
 
general_settings:
  master_key: "sk-local-dev-key"
  
litellm_settings:
  drop_params: true  # Drop unsupported params instead of erroring
  request_timeout: 600
EOF

Start the proxy:

litellm --config litellm_config.yaml --port 4000
 
# You should see:
# INFO: LiteLLM proxy server started on http://0.0.0.0:4000

Test the proxy:

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-local-dev-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-local",
    "messages": [{"role": "user", "content": "Hello, write a hello world in Python"}]
  }'

Step 3: Configure Claude Code #

Set environment variables to redirect Claude Code to your local proxy:

# Add to your ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-local-dev-key"
export ANTHROPIC_BASE_URL="http://localhost:4000"
 
# Reload your shell
source ~/.zshrc

Now launch Claude Code:

claude

Claude Code will now route requests to your local Ollama model through LiteLLM.

Step 4: Verify the Setup #

Inside Claude Code, run a quick test:

> What model are you running on?

You can also check the LiteLLM logs to confirm requests are being handled locally:

# In a separate terminal, watch LiteLLM logs
litellm --config litellm_config.yaml --port 4000 --debug

Optimizing for Coding Tasks #

System Prompt Tuning #

Local models often need more explicit instruction than frontier models. Create a custom system prompt:

# Create ~/.claude/system_prompt.md
cat > ~/.claude/system_prompt.md << 'EOF'
You are an expert software engineer and coding assistant.
 
When writing code:
- Always include error handling
- Add type hints (Python) or proper types (TypeScript)
- Write clean, readable code with meaningful variable names
- Add docstrings/JSDoc for functions
- Prefer explicit over implicit
 
When analyzing code:
- Identify potential bugs and edge cases
- Suggest performance improvements
- Point out security concerns
 
Always think step-by-step before writing code.
EOF

Model Selection by Task #

# For complex tasks — use the larger model
export CLAUDE_MODEL="claude-local"  # qwen2.5-coder:14b
 
# For fast autocomplete/simple tasks
export CLAUDE_MODEL="claude-local-fast"  # qwen2.5-coder:7b

Hardware Requirements #

Model	VRAM (GPU)	RAM (CPU)	Speed
qwen2.5-coder:7b	6 GB	10 GB	Fast
qwen2.5-coder:14b	10 GB	18 GB	Medium
deepseek-coder-v2:16b	12 GB	20 GB	Medium
qwen2.5-coder:32b	22 GB	40 GB	Slow

For CPU-only setups, expect significantly slower inference. A 14B model on a modern MacBook Pro M3 runs at approximately 15-25 tokens/second — usable but not fast.

Limitations vs. Claude API #

Being honest about the trade-offs:

Capability	Claude API	Local LLM
Complex multi-file refactoring	Excellent	Fair
Bug detection	Excellent	Good
Code explanation	Excellent	Good
Architecture design	Excellent	Fair
Simple code generation	Excellent	Good
Cost	Per-token billing	Free (compute only)
Privacy	Sends to Anthropic	100% local
Speed	Fast	Depends on hardware

Troubleshooting #

"Connection refused" errors:

# Make sure Ollama is running
ollama serve
 
# Make sure LiteLLM proxy is running
ps aux | grep litellm

Slow responses:

# Check GPU utilization
nvidia-smi  # NVIDIA
sudo powermetrics --samplers gpu_power  # Apple Silicon
 
# Switch to smaller model if too slow
ollama pull qwen2.5-coder:7b

Model giving poor code quality:

Try adding more context in your prompt
Use a larger model (14b or 32b vs 7b)
Consider switching to deepseek-coder-v2 for pure coding tasks

Conclusion #

Running Claude Code against a local LLM is a viable setup for:

Privacy-sensitive codebases
Offline development
Cost-sensitive workflows
Experimentation and learning

The setup takes about 15 minutes and the result is a surprisingly capable local coding assistant. For production work, the frontier models are still clearly ahead — but for the right use cases, local is a legitimate option.

The gap between local and frontier models is closing fast. This setup that feels like a compromise today will feel perfectly capable in 12 months.

Niteen Badgujar