Feb 16, 2026

How to Run LLMs Locally: Complete Guide for 2026

Last updated: February 2026

Running AI models on your own machine means no API costs, no rate limits, no data leaving your computer, and no company deciding what you can or can’t ask. The tradeoff: you need decent hardware and some technical comfort.

Good news — it’s gotten dramatically easier. Here’s everything you need to know.

Why Run Locally?

Privacy. Your prompts never leave your machine. No training on your data. No logs on someone else’s server. For lawyers, doctors, journalists, and anyone handling sensitive information, this matters.

Cost. After the hardware investment, every query is free. If you’re making hundreds of API calls per day, local inference pays for itself in months.

No limits. No rate limiting. No content policies. No “I can’t help with that.” The model does what you tell it.

Offline access. Works on a plane, in a bunker, during an internet outage. The model is on your disk.

Speed. For small-to-medium models on good hardware, local inference can be faster than cloud APIs (no network latency).

What Hardware Do You Need?

The key bottleneck is RAM (for CPU inference) or VRAM (for GPU inference). LLMs are big. Here’s what different setups can handle:

GPU Inference (Faster)

GPU	VRAM	Max Model Size	Performance
RTX 3060	12GB	7-13B params	Good for small models
RTX 4060 Ti	16GB	13B params	Solid mid-range
RTX 4070 Ti Super	16GB	13B params	Fast mid-range
RTX 4090	24GB	30B params	Excellent
RTX 5090	32GB	40B+ params	Best consumer GPU
2x RTX 4090	48GB	70B params	Enthusiast setup

Apple Silicon (Unified Memory)

Mac	Memory	Max Model Size	Performance
M1/M2 8GB	8GB	7B params	Usable but slow
M1/M2 Pro 16GB	16GB	13B params	Good
M2/M3 Pro 32GB	32GB	30B params	Very good
M3/M4 Max 64GB	64GB	70B params	Excellent
M2/M4 Ultra 128GB+	128GB+	100B+ params	Run almost anything

CPU Only (Slowest, but works)

Any machine with 16GB+ RAM can run small models (7B) on CPU. It’s slow (2-5 tokens/second) but functional. 32GB+ RAM opens up 13B models. 64GB+ can handle 30B.

My recommendation for getting started: If you have a Mac with 16GB+ memory or a PC with an RTX 3060+, you’re good to go. Don’t buy hardware just for this — try it with what you have first.

The Software Stack

Option 1: Ollama (Easiest — Start Here)

Ollama is the “Docker for LLMs.” One command to install, one command to run a model. It handles downloading, quantization, and serving automatically.

Install:

# Mac/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from ollama.com

Run a model:

# Download and run Llama 3.1 8B
ollama run llama3.1

# Download and run Deepseek V3 (needs lots of RAM)
ollama run deepseek-v3

# Download and run a coding model
ollama run codellama

# List available models
ollama list

That’s it. You’re running a local LLM. Type your prompt, get a response.

Best models to start with on Ollama:

Model	Size	Good For	Min RAM
llama3.1:8b	4.7GB	General purpose	8GB
mistral	4.1GB	Fast, good quality	8GB
codellama:13b	7.4GB	Coding	16GB
deepseek-coder-v2	8.9GB	Coding	16GB
llama3.1:70b	40GB	Best open-source quality	48GB+
qwen2.5:32b	18GB	Excellent multilingual	32GB

Option 2: LM Studio (Best GUI)

If you prefer a graphical interface over the terminal, LM Studio is excellent. It’s a desktop app that lets you browse, download, and run models with a ChatGPT-like interface.

Setup:

Download from lmstudio.ai
Browse the model catalog
Click download on any model
Click “Chat” and start talking

LM Studio also runs a local API server compatible with the OpenAI API format — meaning any tool that works with ChatGPT’s API can be pointed at your local model instead.

Option 3: llama.cpp (Most Control)

llama.cpp is the engine that powers both Ollama and LM Studio under the hood. Running it directly gives you maximum control over inference parameters, quantization, and performance tuning.

When to use llama.cpp directly:

You want to fine-tune performance for specific hardware
You’re building a production application
You need features that Ollama/LM Studio don’t expose
You want to run on unusual hardware (old GPUs, embedded systems)

For most people, Ollama or LM Studio is sufficient. llama.cpp is for when you need to go deeper.

Option 4: vLLM (Production Serving)

If you’re running local models for a team or application (not just personal use), vLLM provides high-throughput serving with features like continuous batching, PagedAttention, and OpenAI-compatible API endpoints.

When to use vLLM:

Serving models to multiple users
Building applications that need high throughput
Production deployments on your own servers

The Best Local Models Right Now

General Purpose

Llama 3.1 70B — Best overall open-source model. Needs 48GB+ RAM/VRAM.
Qwen 2.5 32B — Excellent quality, great multilingual support (especially Chinese). Needs 32GB.
Mistral Large — Strong reasoning, good instruction following. Various sizes available.
Deepseek V3 — Rivals GPT-4 on many tasks. Very large (needs significant hardware).

Coding

Deepseek Coder V2 — Best open-source coding model.
CodeLlama 34B — Solid, well-tested.
Qwen 2.5 Coder — Excellent for multiple languages.

Small and Fast (for limited hardware)

Llama 3.1 8B — Best quality at 8B size.
Mistral 7B — Fast, efficient, good quality.
Phi-3 Mini — Microsoft’s small model, surprisingly capable.

Quantization: The Magic Trick

Full-precision models are huge. A 70B parameter model at full precision needs ~140GB of memory. Nobody has that in a consumer machine.

Quantization compresses models by reducing numerical precision. The quality loss is minimal, but the size reduction is dramatic:

Quantization	Size Reduction	Quality Loss	When to Use
Q8 (8-bit)	~50%	Negligible	When you have enough RAM
Q6_K	~60%	Very small	Good balance
Q5_K_M	~65%	Small	Recommended default
Q4_K_M	~75%	Noticeable on complex tasks	When RAM is tight
Q3_K	~80%	Significant	Last resort

Rule of thumb: Use Q5_K_M or Q4_K_M. These give you the best quality-to-size ratio. Ollama and LM Studio handle quantization automatically — you don’t need to think about it unless you want to optimize.

Connecting Local Models to Your Tools

The killer feature of local LLMs: they can replace cloud APIs in your existing tools.

Ollama + Aider (AI coding agent):

aider --model ollama/deepseek-coder-v2

Ollama + Continue (VS Code extension): Install Continue, set the model to your local Ollama endpoint. Free AI coding assistance in your editor.

LM Studio + Any OpenAI-compatible tool: LM Studio runs a local server at http://localhost:1234/v1. Point any tool that accepts an OpenAI API endpoint to this address.

Ollama + Open WebUI: A self-hosted ChatGPT-like interface for your local models. Beautiful UI, conversation history, multiple models.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main

Getting Started: The 15-Minute Path

Install Ollama (1 minute)
Run ollama run llama3.1 (5 minutes to download, then instant)
Start chatting (0 minutes)
Try ollama run deepseek-coder-v2 for coding tasks
Install Open WebUI if you want a nice chat interface

Total time: 15 minutes. Total cost: $0.

You now have a private, unlimited, free AI assistant running on your own hardware. Welcome to the future.

All tools mentioned are free or open source: