How to Run LLMs Locally: Complete Guide for 2026
Last updated: February 2026
Running AI models on your own machine means no API costs, no rate limits, no data leaving your computer, and no company deciding what you can or can’t ask. The tradeoff: you need decent hardware and some technical comfort.
Good news — it’s gotten dramatically easier. Here’s everything you need to know.
Why Run Locally?
Privacy. Your prompts never leave your machine. No training on your data. No logs on someone else’s server. For lawyers, doctors, journalists, and anyone handling sensitive information, this matters.
Cost. After the hardware investment, every query is free. If you’re making hundreds of API calls per day, local inference pays for itself in months.
No limits. No rate limiting. No content policies. No “I can’t help with that.” The model does what you tell it.
Offline access. Works on a plane, in a bunker, during an internet outage. The model is on your disk.
Speed. For small-to-medium models on good hardware, local inference can be faster than cloud APIs (no network latency).
What Hardware Do You Need?
The key bottleneck is RAM (for CPU inference) or VRAM (for GPU inference). LLMs are big. Here’s what different setups can handle:
GPU Inference (Faster)
| GPU | VRAM | Max Model Size | Performance |
|---|---|---|---|
| RTX 3060 | 12GB | 7-13B params | Good for small models |
| RTX 4060 Ti | 16GB | 13B params | Solid mid-range |
| RTX 4070 Ti Super | 16GB | 13B params | Fast mid-range |
| RTX 4090 | 24GB | 30B params | Excellent |
| RTX 5090 | 32GB | 40B+ params | Best consumer GPU |
| 2x RTX 4090 | 48GB | 70B params | Enthusiast setup |
Apple Silicon (Unified Memory)
| Mac | Memory | Max Model Size | Performance |
|---|---|---|---|
| M1/M2 8GB | 8GB | 7B params | Usable but slow |
| M1/M2 Pro 16GB | 16GB | 13B params | Good |
| M2/M3 Pro 32GB | 32GB | 30B params | Very good |
| M3/M4 Max 64GB | 64GB | 70B params | Excellent |
| M2/M4 Ultra 128GB+ | 128GB+ | 100B+ params | Run almost anything |
CPU Only (Slowest, but works)
Any machine with 16GB+ RAM can run small models (7B) on CPU. It’s slow (2-5 tokens/second) but functional. 32GB+ RAM opens up 13B models. 64GB+ can handle 30B.
My recommendation for getting started: If you have a Mac with 16GB+ memory or a PC with an RTX 3060+, you’re good to go. Don’t buy hardware just for this — try it with what you have first.
The Software Stack
Option 1: Ollama (Easiest — Start Here)
Ollama is the “Docker for LLMs.” One command to install, one command to run a model. It handles downloading, quantization, and serving automatically.
Install:
# Mac/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download from ollama.com
Run a model:
# Download and run Llama 3.1 8B
ollama run llama3.1
# Download and run Deepseek V3 (needs lots of RAM)
ollama run deepseek-v3
# Download and run a coding model
ollama run codellama
# List available models
ollama list
That’s it. You’re running a local LLM. Type your prompt, get a response.
Best models to start with on Ollama:
| Model | Size | Good For | Min RAM |
|---|---|---|---|
| llama3.1:8b | 4.7GB | General purpose | 8GB |
| mistral | 4.1GB | Fast, good quality | 8GB |
| codellama:13b | 7.4GB | Coding | 16GB |
| deepseek-coder-v2 | 8.9GB | Coding | 16GB |
| llama3.1:70b | 40GB | Best open-source quality | 48GB+ |
| qwen2.5:32b | 18GB | Excellent multilingual | 32GB |
Option 2: LM Studio (Best GUI)
If you prefer a graphical interface over the terminal, LM Studio is excellent. It’s a desktop app that lets you browse, download, and run models with a ChatGPT-like interface.
Setup:
- Download from lmstudio.ai
- Browse the model catalog
- Click download on any model
- Click “Chat” and start talking
LM Studio also runs a local API server compatible with the OpenAI API format — meaning any tool that works with ChatGPT’s API can be pointed at your local model instead.
Option 3: llama.cpp (Most Control)
llama.cpp is the engine that powers both Ollama and LM Studio under the hood. Running it directly gives you maximum control over inference parameters, quantization, and performance tuning.
When to use llama.cpp directly:
- You want to fine-tune performance for specific hardware
- You’re building a production application
- You need features that Ollama/LM Studio don’t expose
- You want to run on unusual hardware (old GPUs, embedded systems)
For most people, Ollama or LM Studio is sufficient. llama.cpp is for when you need to go deeper.
Option 4: vLLM (Production Serving)
If you’re running local models for a team or application (not just personal use), vLLM provides high-throughput serving with features like continuous batching, PagedAttention, and OpenAI-compatible API endpoints.
When to use vLLM:
- Serving models to multiple users
- Building applications that need high throughput
- Production deployments on your own servers
The Best Local Models Right Now
General Purpose
- Llama 3.1 70B — Best overall open-source model. Needs 48GB+ RAM/VRAM.
- Qwen 2.5 32B — Excellent quality, great multilingual support (especially Chinese). Needs 32GB.
- Mistral Large — Strong reasoning, good instruction following. Various sizes available.
- Deepseek V3 — Rivals GPT-4 on many tasks. Very large (needs significant hardware).
Coding
- Deepseek Coder V2 — Best open-source coding model.
- CodeLlama 34B — Solid, well-tested.
- Qwen 2.5 Coder — Excellent for multiple languages.
Small and Fast (for limited hardware)
- Llama 3.1 8B — Best quality at 8B size.
- Mistral 7B — Fast, efficient, good quality.
- Phi-3 Mini — Microsoft’s small model, surprisingly capable.
Quantization: The Magic Trick
Full-precision models are huge. A 70B parameter model at full precision needs ~140GB of memory. Nobody has that in a consumer machine.
Quantization compresses models by reducing numerical precision. The quality loss is minimal, but the size reduction is dramatic:
| Quantization | Size Reduction | Quality Loss | When to Use |
|---|---|---|---|
| Q8 (8-bit) | ~50% | Negligible | When you have enough RAM |
| Q6_K | ~60% | Very small | Good balance |
| Q5_K_M | ~65% | Small | Recommended default |
| Q4_K_M | ~75% | Noticeable on complex tasks | When RAM is tight |
| Q3_K | ~80% | Significant | Last resort |
Rule of thumb: Use Q5_K_M or Q4_K_M. These give you the best quality-to-size ratio. Ollama and LM Studio handle quantization automatically — you don’t need to think about it unless you want to optimize.
Connecting Local Models to Your Tools
The killer feature of local LLMs: they can replace cloud APIs in your existing tools.
Ollama + Aider (AI coding agent):
aider --model ollama/deepseek-coder-v2
Ollama + Continue (VS Code extension): Install Continue, set the model to your local Ollama endpoint. Free AI coding assistance in your editor.
LM Studio + Any OpenAI-compatible tool:
LM Studio runs a local server at http://localhost:1234/v1. Point any tool that accepts an OpenAI API endpoint to this address.
Ollama + Open WebUI: A self-hosted ChatGPT-like interface for your local models. Beautiful UI, conversation history, multiple models.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
Getting Started: The 15-Minute Path
- Install Ollama (1 minute)
- Run
ollama run llama3.1(5 minutes to download, then instant) - Start chatting (0 minutes)
- Try
ollama run deepseek-coder-v2for coding tasks - Install Open WebUI if you want a nice chat interface
Total time: 15 minutes. Total cost: $0.
You now have a private, unlimited, free AI assistant running on your own hardware. Welcome to the future.
All tools mentioned are free or open source: