How to Run Local LLMs with Ollama: Complete Beginner's Guide (2026)

Want to run AI models on your own computer — without paying for API calls, without sending your data to the cloud, and without an internet connection? Ollama makes it straightforward.

In 2026, Ollama is the most popular tool for running large language models (LLMs) locally. This guide walks you through everything: installation, pulling models, using the API, and integrating with Python.

1. What Is Ollama?

Ollama is an open-source tool that lets you download and run LLMs on your local machine with a single command. Think of it as "Docker for AI models" — it handles model downloads, GPU acceleration, and an OpenAI-compatible API, all behind a simple CLI.

Key benefit: Run Llama 4, Mistral, Qwen, DeepSeek, and hundreds of other models on your laptop — free and private.

2. Installation

ⓘ Before you install: Ollama downloads model files (2-50 GB each) to ~/.ollama/. Make sure you have enough disk space. On macOS/Windows, you can change the storage location via the Ollama settings.

macOS

brew install ollama

Or download the .dmg from ollama.com.

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download/windows. After installation, Ollama runs as a background service.

Verify Installation

ollama --version
# ollama version 0.9.0

3. Pull and Run Your First Model

Pull a Model

ollama pull llama3.2:3b

The :3b tag specifies the 3-billion-parameter version — lightweight, fast, and runs on almost any computer. For more power, try llama3.2 (default, ~8B) or mistral:7b.

Popular Models to Start With

ModelSizeBest For
llama3.2:3b2 GBQuick tests, low-end hardware
mistral:7b4.1 GBGeneral-purpose chat, coding
qwen2.5:7b4.4 GBStrong multilingual support
deepseek-r1:8b4.9 GBReasoning-heavy tasks
codellama:7b3.8 GBCode generation & completion

Chat with the Model

ollama run llama3.2:3b

>>> Explain what a REST API is in simple terms.
A REST API is like a waiter at a restaurant. You (the client)
tell the waiter what you want (the request), the waiter goes
to the kitchen (the server), and brings back your food (the
response). REST is just a set of rules for how that
communication should work.

>>> /bye

4. Use the OpenAI-Compatible API

Ollama exposes an API that is compatible with the OpenAI SDK. This means any code written for OpenAI's API works with Ollama — just change the base_url.

Start the Server

ollama serve

This starts the API server on http://localhost:11434.

Test with curl

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "What is Python?",
  "stream": false
}'

Python Integration (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require a real key
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "user", "content": "Write a hello world in Python"}
    ]
)

print(response.choices[0].message.content)

LangChain with Ollama

from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.2:3b", temperature=0.7)
response = llm.invoke("Explain quantum computing simply")
print(response.content)

5. Run Multiple Models at Once

Ollama supports running multiple models simultaneously. Just pull and run:

ollama pull mistral:7b
ollama pull codellama:7b

# Each runs on its own port/process
ollama run mistral:7b    # Terminal 1
ollama run codellama:7b  # Terminal 2

The API server handles all loaded models — specify the model name in your API call.

6. Create Custom Models with Modelfiles

Need to customize a model? Create a Modelfile:

FROM llama3.2:3b

# Set system prompt
SYSTEM "You are a senior Python developer. Answer with code examples."

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
ollama create my-python-expert -f Modelfile
ollama run my-python-expert

7. Hardware Requirements

Model SizeRAM NeededGPU (VRAM)Runs On
3B (light)8 GB4 GBMost laptops (2018+)
7B (medium)16 GB8 GBModern laptops / Gaming PC
13B+ (heavy)32 GB16 GB+Workstation / Server
No GPU? Ollama runs on CPU too — just slower. For casual use on a 3B model, CPU is perfectly fine.

8. Real-World Use Cases

  • Privacy-first chatbots: No data leaves your machine
  • Offline coding assistant: Code help without internet
  • Document Q&A: Load local PDFs and ask questions (pair with LangChain)
  • Prototyping: Test prompts locally before scaling to cloud APIs
  • Cost savings: Zero API bills — run as much as you want

9. Common Mistakes

9.1. Pulling the Largest Model First

Beginners often start with the largest available model (70B+), which requires 40+ GB of RAM and a powerful GPU. Start with a 3B or 7B model. These run on most consumer hardware and give you a working setup in minutes rather than hours.

9.2. Forgetting the Ollama Server Must Be Running

The ollama run command starts an interactive session and also launches the background server. But if you use the API directly (Python SDK, curl), you need ollama serve running separately. If API calls time out, check that the server is active on port 11434.

9.3. Using the Wrong Python Package

There are two Python Ollama packages: ollama (the official library) and langchain-ollama (for LangChain integration). The official ollama package provides a simpler API for direct use. The LangChain integration wraps it for chain-based workflows. Pick one — do not install both unless you need LangChain features.

Frequently Asked Questions

Is Ollama really free?

Yes. Ollama is fully open-source and free. The models are free too. You pay nothing — just your own electricity.

Can I use Ollama without a GPU?

Absolutely. Ollama automatically falls back to CPU. 3B models run smoothly on CPU; 7B models are usable but slower.

Is my data private when using Ollama?

Yes. Everything runs on your machine. No data is sent to any server. No telemetry unless you opt in.