Local LLM Setup¶

Configure and use local language models with Ollama, vLLM, or LM Studio.

What You'll Learn¶

Install and configure Ollama
Set up vLLM for production use
Configure LM Studio
Troubleshoot common issues

Prerequisites¶

Python 3.12+
Mamba Agents installed
Sufficient RAM (8GB+ recommended)
GPU recommended for faster inference

Option 1: Ollama (Recommended for Development)¶

Step 1: Install Ollama¶

macOS/LinuxWindows

curl -fsSL https://ollama.com/install.sh | sh

Download from ollama.com/download

Step 2: Start the Server¶

ollama serve

The server runs on http://localhost:11434 by default.

Step 3: Pull a Model¶

# Small, fast model (3B parameters)
ollama pull llama3.2

# Larger, more capable (8B parameters)
ollama pull llama3.1

# Code-focused model
ollama pull codellama

# List available models
ollama list

Step 4: Configure Mamba Agents¶

from mamba_agents import Agent, AgentSettings

settings = AgentSettings(
    model_backend={
        "base_url": "http://localhost:11434/v1",
        "model": "llama3.2",
        "temperature": 0.7,
    }
)

agent = Agent(settings=settings)
result = agent.run_sync("Hello!")
print(result.output)

Or via environment variables:

export MAMBA_MODEL_BACKEND__BASE_URL=http://localhost:11434/v1
export MAMBA_MODEL_BACKEND__MODEL=llama3.2

Option 2: vLLM (Production)¶

Step 1: Install vLLM¶

pip install vllm

Step 2: Start the Server¶

vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --port 8000 \
    --max-model-len 4096

Step 3: Configure Mamba Agents¶

from mamba_agents import Agent, AgentSettings

settings = AgentSettings(
    model_backend={
        "base_url": "http://localhost:8000/v1",
        "model": "meta-llama/Llama-3.2-3B-Instruct",
    }
)

agent = Agent(settings=settings)

vLLM Production Options¶

# Multi-GPU setup
vllm serve meta-llama/Llama-3.2-70B-Instruct \
    --tensor-parallel-size 4 \
    --port 8000

# With authentication
vllm serve model-name \
    --api-key your-secret-key

# Memory optimization
vllm serve model-name \
    --gpu-memory-utilization 0.9 \
    --max-num-batched-tokens 4096

Option 3: LM Studio¶

Step 1: Install LM Studio¶

Download from lmstudio.ai

Step 2: Download a Model¶

Open LM Studio
Go to the "Discover" tab
Search for a model (e.g., "Llama 3.2")
Click "Download"

Step 3: Start the Server¶

Go to "Local Server" tab
Select your downloaded model
Click "Start Server"
Note the port (default: 1234)

Step 4: Configure Mamba Agents¶

from mamba_agents import Agent, AgentSettings

settings = AgentSettings(
    model_backend={
        "base_url": "http://localhost:1234/v1",
        "model": "local-model",  # LM Studio uses loaded model
    }
)

agent = Agent(settings=settings)

Configuration File¶

Create a config.toml for local development:

[model_backend]
base_url = "http://localhost:11434/v1"
model = "llama3.2"
temperature = 0.7
timeout = 60.0  # Local models can be slower

[logging]
level = "DEBUG"  # Helpful for debugging

[retry]
max_retries = 2
retry_level = 1  # Conservative for local

Complete Example¶

import asyncio
from mamba_agents import Agent, AgentSettings, AgentConfig
from mamba_agents.tools import read_file, run_bash


async def main():
    # Configure for local model
    settings = AgentSettings(
        model_backend={
            "base_url": "http://localhost:11434/v1",
            "model": "llama3.2",
            "temperature": 0.7,
            "timeout": 60.0,
        }
    )

    # Create agent
    agent = Agent(
        settings=settings,
        tools=[read_file, run_bash],
        config=AgentConfig(
            system_prompt="You are a helpful coding assistant.",
        ),
    )

    # Run query
    result = await agent.run("List files in the current directory")
    print(result.output)

    # Check usage (local models are free!)
    print(f"Tokens used: {agent.get_usage().total_tokens}")
    print(f"Cost: ${agent.get_cost():.4f}")  # Should be $0.00


if __name__ == "__main__":
    asyncio.run(main())

Troubleshooting¶

Connection Refused¶

# Check if server is running
import httpx

try:
    response = httpx.get("http://localhost:11434/v1/models")
    print("Server is running")
    print(f"Models: {response.json()}")
except httpx.ConnectError:
    print("Server not running!")
    print("Start with: ollama serve")

Model Not Found¶

# List available models
ollama list

# Pull missing model
ollama pull llama3.2

Out of Memory¶

# Use a smaller model
ollama pull llama3.2:1b  # 1B parameters

# Or for vLLM, reduce memory usage
vllm serve model --gpu-memory-utilization 0.8

Slow Responses¶

# Increase timeout
settings = AgentSettings(
    model_backend={
        "timeout": 120.0,  # 2 minutes
    }
)

# Or use a smaller model
settings = AgentSettings(
    model_backend={
        "model": "llama3.2:1b",  # Faster, less capable
    }
)

Tool Calling Not Working¶

Not all local models support tool calling. Check model capabilities:

from mamba_agents.backends import get_profile

profile = get_profile("llama3.2")
if not profile.supports_tools:
    print("This model doesn't support tool calling")
    print("Try: llama3.2:70b or use without tools")

Performance Tips¶

Use GPU - Significantly faster than CPU
Smaller models - Faster inference, less RAM
Increase timeout - Local models are slower than API
Batch requests - vLLM handles batching well
Quantized models - Less RAM, similar quality

Model Recommendations¶

Use Case	Recommended Model	Size
Quick testing	llama3.2:1b	1B
Development	llama3.2	3B
Production	llama3.1:8b	8B
Code tasks	codellama	7B+
Complex reasoning	llama3.2:70b	70B

Next Steps¶

Model Backends Guide - More configuration options
Configuration - Full config reference
Error Handling - Handle connection issues