Back to all articles

Running Claude Code with a Local LLM — Step-by-Step Guide

A complete walkthrough for connecting Claude Code to a locally-running LLM using Ollama, enabling offline coding assistance without API costs.

February 8, 20265 min read
Share:

Claude Code is one of the most capable coding agents available — but running it against Anthropic's API has real costs, especially when you're in a heavy development flow. What if you could run a similar agentic coding workflow against a local LLM, with no API costs and full privacy?

This guide walks through exactly that setup using Ollama as the local model server and configuring an OpenAI-compatible proxy to bridge the gap.

This approach trades raw capability for cost and privacy. Local models (even large ones) are generally less capable than frontier models for complex coding tasks. Use this for exploration, learning, or when privacy is critical.

Architecture Overview#

Step 1: Install and Configure Ollama#

Ollama provides a simple way to run open-source LLMs locally.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
 
# Windows: download installer from https://ollama.com/download

Pull a coding-optimized model. I recommend qwen2.5-coder for coding tasks:

# Pull Qwen2.5 Coder (14B — good balance of quality and speed)
ollama pull qwen2.5-coder:14b
 
# Or DeepSeek Coder for more code-focused tasks
ollama pull deepseek-coder-v2:16b
 
# Verify it works
ollama run qwen2.5-coder:14b "Write a Python function to reverse a string"

Check available models:

ollama list
# NAME                    ID              SIZE    MODIFIED
# qwen2.5-coder:14b       abc123          9.0 GB  2 minutes ago

Step 2: Set Up LiteLLM as an OpenAI-Compatible Proxy#

Claude Code speaks OpenAI's API format. LiteLLM translates between OpenAI format and Ollama's API.

# Install LiteLLM
pip install litellm
 
# Create proxy config
cat > litellm_config.yaml << 'EOF'
model_list:
  - model_name: claude-local
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://localhost:11434
 
  - model_name: claude-local-fast  
    litellm_params:
      model: ollama/qwen2.5-coder:7b
      api_base: http://localhost:11434
 
general_settings:
  master_key: "sk-local-dev-key"
  
litellm_settings:
  drop_params: true  # Drop unsupported params instead of erroring
  request_timeout: 600
EOF

Start the proxy:

litellm --config litellm_config.yaml --port 4000
 
# You should see:
# INFO: LiteLLM proxy server started on http://0.0.0.0:4000

Test the proxy:

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-local-dev-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-local",
    "messages": [{"role": "user", "content": "Hello, write a hello world in Python"}]
  }'

Step 3: Configure Claude Code#

Set environment variables to redirect Claude Code to your local proxy:

# Add to your ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-local-dev-key"
export ANTHROPIC_BASE_URL="http://localhost:4000"
 
# Reload your shell
source ~/.zshrc

Now launch Claude Code:

claude

Claude Code will now route requests to your local Ollama model through LiteLLM.

Step 4: Verify the Setup#

Inside Claude Code, run a quick test:

> What model are you running on?

You can also check the LiteLLM logs to confirm requests are being handled locally:

# In a separate terminal, watch LiteLLM logs
litellm --config litellm_config.yaml --port 4000 --debug

Optimizing for Coding Tasks#

System Prompt Tuning#

Local models often need more explicit instruction than frontier models. Create a custom system prompt:

# Create ~/.claude/system_prompt.md
cat > ~/.claude/system_prompt.md << 'EOF'
You are an expert software engineer and coding assistant.
 
When writing code:
- Always include error handling
- Add type hints (Python) or proper types (TypeScript)
- Write clean, readable code with meaningful variable names
- Add docstrings/JSDoc for functions
- Prefer explicit over implicit
 
When analyzing code:
- Identify potential bugs and edge cases
- Suggest performance improvements
- Point out security concerns
 
Always think step-by-step before writing code.
EOF

Model Selection by Task#

# For complex tasks — use the larger model
export CLAUDE_MODEL="claude-local"  # qwen2.5-coder:14b
 
# For fast autocomplete/simple tasks
export CLAUDE_MODEL="claude-local-fast"  # qwen2.5-coder:7b

Hardware Requirements#

ModelVRAM (GPU)RAM (CPU)Speed
qwen2.5-coder:7b6 GB10 GBFast
qwen2.5-coder:14b10 GB18 GBMedium
deepseek-coder-v2:16b12 GB20 GBMedium
qwen2.5-coder:32b22 GB40 GBSlow

For CPU-only setups, expect significantly slower inference. A 14B model on a modern MacBook Pro M3 runs at approximately 15-25 tokens/second — usable but not fast.

Limitations vs. Claude API#

Being honest about the trade-offs:

CapabilityClaude APILocal LLM
Complex multi-file refactoringExcellentFair
Bug detectionExcellentGood
Code explanationExcellentGood
Architecture designExcellentFair
Simple code generationExcellentGood
CostPer-token billingFree (compute only)
PrivacySends to Anthropic100% local
SpeedFastDepends on hardware

Troubleshooting#

"Connection refused" errors:

# Make sure Ollama is running
ollama serve
 
# Make sure LiteLLM proxy is running
ps aux | grep litellm

Slow responses:

# Check GPU utilization
nvidia-smi  # NVIDIA
sudo powermetrics --samplers gpu_power  # Apple Silicon
 
# Switch to smaller model if too slow
ollama pull qwen2.5-coder:7b

Model giving poor code quality:

  • Try adding more context in your prompt
  • Use a larger model (14b or 32b vs 7b)
  • Consider switching to deepseek-coder-v2 for pure coding tasks

Conclusion#

Running Claude Code against a local LLM is a viable setup for:

  • Privacy-sensitive codebases
  • Offline development
  • Cost-sensitive workflows
  • Experimentation and learning

The setup takes about 15 minutes and the result is a surprisingly capable local coding assistant. For production work, the frontier models are still clearly ahead — but for the right use cases, local is a legitimate option.

The gap between local and frontier models is closing fast. This setup that feels like a compromise today will feel perfectly capable in 12 months.

NB

Written by

Niteen Badgujar

AI Engineer specializing in Agentic AI, LLMs, and production-grade machine learning systems on Azure. Writing to make complex AI concepts accessible and actionable.