A practical guide to running AI models locally with Ollama, llama.cpp, and more. Hardware requirements, setup instructions, and when local beats cloud.
Local AI: Running LLMs on Your Own Hardware (And Why You'd Want To)
ChatGPT requires internet. Claude requires an API key. Every prompt goes to someone else's server.
But it doesn't have to.
You can run AI models on your own hardware. Completely offline. Completely private. Often completely free.
Here's how.
Why Run AI Locally?
Privacy
Your data never leaves your machine. No terms of service. No training on your inputs. No logs on someone else's server.
For sensitive work—legal documents, medical notes, proprietary code—this matters.
Cost
After hardware, running local AI is essentially free. No per-token charges. No subscription fees. Process millions of tokens at zero marginal cost.
If you're concerned about API spending, our guide to the real cost of AI pricing breaks down when local makes financial sense versus cloud.
Speed (Sometimes)
No network latency. If you have good hardware, local inference can be faster than API calls, especially for short interactions.
Offline Access
No internet required. Use AI on planes, in remote locations, during outages.
Control
Choose your model. Customize behavior. No unexpected changes when the provider updates.
Learning
Running models locally teaches you how AI actually works. Better understanding leads to better usage.
What You Need
Minimum (Basic Usage)
- CPU: Modern multi-core (Intel 10th gen+, AMD Ryzen 3000+)
- RAM: 16GB (8GB models)
- Storage: 10GB free
- GPU: Not required but helps significantly
This runs small models (7B and under) at usable speeds.
Recommended (Good Experience)
- CPU: Modern 8+ core
- RAM: 32GB
- Storage: SSD with 50GB+ free
- GPU: NVIDIA RTX 3060 or better (12GB VRAM)
This runs medium models (7B-13B) comfortably.
Ideal (Best Performance)
- RAM: 64GB+
- GPU: RTX 4090 (24GB VRAM) or Apple M3 Max
- Storage: Fast NVMe SSD
This runs larger models (30B-70B) at acceptable speeds.
Apple Silicon Note
M1/M2/M3 Macs are excellent for local AI. Unified memory means models can use system RAM, and Metal acceleration is well-supported.
- M1 Pro (16GB): Good for 7B models
- M2 Max (32GB): Good for 13B models
- M3 Max (64GB+): Good for 30B+ models
Method 1: Ollama (Easiest)
Ollama makes running local AI trivially easy.
Installation
Mac:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Windows:
Download from ollama.ai
Basic Usage
# Start Ollama service (usually automatic)
ollama serve
# Run a model (downloads automatically)
ollama run llama3.2
# Now you're chatting with a local AI
>>> What is the capital of France?
The capital of France is Paris.
Popular Models
# General purpose
ollama run llama3.2 # Good all-around
ollama run phi3 # Microsoft's efficient model
ollama run gemma2 # Google's open model
ollama run mistral # Strong 7B model
# Coding
ollama run codellama # Code-focused Llama
ollama run deepseek-coder # Excellent for code
# Small/Fast
ollama run phi3:mini # Very small, fast
ollama run gemma2:2b # Tiny but capable
For more on choosing the right small model for your needs, see our guide to small language models (SLMs).
Using Ollama as API
Ollama provides an OpenAI-compatible API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a haiku about programming"
}'
Works with any OpenAI-compatible client.
Ollama Tips
- Models download once, then run from cache
- Use
ollama list to see downloaded models
- Use
ollama rm model-name to remove models
- Check GPU usage with
ollama ps
Method 2: llama.cpp (Most Efficient)
For maximum performance on limited hardware, llama.cpp is the gold standard.
Installation
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build (with GPU support if available)
make LLAMA_CUDA=1 # For NVIDIA
make LLAMA_METAL=1 # For Mac
make # CPU only
Download Models
Get GGUF format models from Hugging Face:
- TheBloke has quantized versions of most models
- Look for Q4_K_M or Q5_K_M for good quality/size balance
Running
./main -m models/llama-3.2-8b.Q4_K_M.gguf \
-n 512 \
--prompt "Write a function to calculate Fibonacci numbers:"
Quantization Explained
GGUF models come in different quantization levels:
| Quantization |
Quality |
Size |
Speed |
| Q2_K |
Low |
Smallest |
Fastest |
| Q4_K_M |
Good |
Small |
Fast |
| Q5_K_M |
Better |
Medium |
Medium |
| Q6_K |
Great |
Larger |
Slower |
| Q8_0 |
Near-original |
Large |
Slow |
For most use: Q4_K_M or Q5_K_M.
Method 3: LM Studio (GUI)
If you prefer a graphical interface:
- Download from lmstudio.ai
- Browse and download models in-app
- Chat through the GUI or use local API
Good for: People who want local AI without command line.
Method 4: GPT4All
Another user-friendly option:
- Download from gpt4all.io
- Includes models optimized for local use
- Simple GUI with chat interface
Good for: Beginners, quick setup.
Performance Optimization
Use GPU When Possible
GPU inference is 5-50x faster than CPU for supported models.
# Ollama uses GPU automatically if available
# Check with:
ollama ps
Adjust Context Length
Longer context = more memory = slower inference.
# Ollama: set context length
ollama run llama3.2 --context 4096
Use Appropriate Quantization
Don't run Q8_0 if Q4_K_M gives acceptable quality for your task.
Batch Requests
If processing many items, batch them rather than one-by-one.
Consider Model Size vs Quality
For many tasks, a fast 7B model beats a slow 70B model. Test before assuming bigger is better.
Use Cases for Local AI
Document Analysis (Privacy)
Process confidential documents without cloud exposure.
Development Assistant (Offline)
Code assistance that works without internet.
Bulk Processing (Cost)
Process thousands of items at zero marginal cost.
Experimentation (Control)
Test different models, prompts, parameters freely.
Learning (Education)
Understand how AI works by running it yourself.
When Local Doesn't Make Sense
Need Highest Quality
GPT-4 and Claude are still better than any local model. When quality is paramount, use cloud.
For a detailed comparison of open source versus closed models, check out our open source vs closed AI models guide.
Limited Hardware
If you only have an old laptop, API access provides better results than struggling with local.
Variable Workloads
If usage is sporadic and low-volume, API per-token pricing beats hardware investment.
Need Latest Features
Vision, real-time voice, and cutting-edge capabilities hit cloud first.
Recommended Setup
For Most People
Ollama + Llama 3.2 8B
brew install ollama # or your platform's install
ollama run llama3.2
Done. You now have local AI.
For Developers
Ollama + multiple models + API access
ollama run llama3.2 # General use
ollama run deepseek-coder # Coding
ollama run phi3 # Fast tasks
Use Ollama's API endpoint for programmatic access. You can even integrate local models into n8n automation workflows for cost-effective AI automation.
For Power Users
llama.cpp + quantized models + custom configurations
Maximum control, maximum efficiency, requires more setup.
The Local AI Experience
Running AI locally feels different than cloud APIs:
- Faster feedback loop (no network latency)
- More experimentation (no cost concerns)
- More understanding (you see what's happening)
- More limitations (hardware constrains model size)
It's not better or worse than cloud AI. It's a different tool for different situations.
Frequently Asked Questions
What is local AI and why would I want to run it?
Local AI means running AI models directly on your own computer instead of using cloud services. This gives you complete privacy (your data never leaves your machine), zero cost per query after initial setup, and the ability to work completely offline.
Can I really run AI models on my laptop?
Yes, modern laptops can run AI models locally. You need at least 16GB RAM for small models, but even an M1 MacBook or mid-range Windows laptop can run models like Llama 3.2 at usable speeds. You don't need expensive GPUs for basic usage.
How does Ollama make running local AI so easy?
Ollama simplifies local AI by handling model downloads, GPU acceleration, and API serving automatically. You just run one command like ollama run llama3.2 and it downloads the model and starts an interactive chat session - no complex setup required.
Is local AI quality as good as ChatGPT or Claude?
Local AI models are generally not as capable as GPT-4 or Claude for complex reasoning tasks. However, for many use cases like document analysis, code assistance, and basic queries, models like Llama 3.2 provide surprisingly good quality at zero marginal cost.
What's the difference between Q4_K_M and Q8_0 quantization?
Quantization reduces model size and memory requirements. Q4_K_M is a 4-bit quantization that offers good quality at smaller size, while Q8_0 is 8-bit with near-original quality but larger size. For most users, Q4_K_M or Q5_K_M provides the best balance of quality and performance.
When should I use local AI instead of cloud APIs?
Use local AI when you need complete privacy (processing sensitive documents), have high volume needs (processing thousands of items at zero cost), want offline access, or are experimenting and learning. Use cloud APIs when you need the absolute highest quality or latest features.
Getting Started Today
- Install Ollama (2 minutes)
- Run
ollama run llama3.2 (downloads ~4GB)
- Ask it something
- You're running local AI
That's it. You now have an AI that runs completely on your hardware, works offline, and costs nothing per query.
Try it. The barrier is lower than you think.
Need help implementing AI solutions in your business? Cedar Operations helps companies leverage AI effectively. Let's discuss your needs →
Related reading: