Setting Up llama.cpp on macOS: What Actually Worked for Me
My Python 3.12 setup for Apple Silicon with text generation and embedding models
One thing troubled me these past days: I couldn’t get llama.cpp to install on macOS. I had set it up successfully on Linux using Python 3.13. The same steps failed on my Mac. They kept failing
llama.cpp is a tool written in C++ that allows you to run large language models on consumer hardware.. It supports various model types and provides GPU acceleration on Apple Silicon through Metal. For developers and IT professionals, this means you can run AI models locally without relying on cloud services or paying for API calls.
I searched everywhere, ChatGPT, Claude, and Google. Nothing worked. I went back to the official llama.cpp documentation. There I found the answer: llama.cpp doesn’t work well with Python 3.13 on macOS.
After more testing, I arrived at the right steps for macOS. I run an M2 Mac, but this should work on other Apple Silicon chips, thoughI have not tested those variants.
I wanted to verify everything worked. I downloaded two models from Hugging Face in GGUF format: a text generation model and a text embedding model. I created two minimalistic Python scripts, one for text generation, another for embedding generation.
Both scripts ran fine. The embedding script generated embeddings for input text. The text generation script responded correctly to prompts.
To be certain, I documented all steps. Then I created a fresh directory and repeated everything. It worked again.
Here are my steps, the models I used, and the test scripts. I hope this helps those who want to set up language models locally on macOS for prototyping or development.
Complete Setup Process
Step 1: Environment Setup
# Install Python 3.12 using uv (if not already installed)
uv python install 3.12
# Check Python version
python3.12 --version
# Create virtual environment with Python 3.12
uv venv --python=$(which python3.12)
# Upgrade pip
uv pip install --upgrade pip
# Install required packages with Metal support
uv pip install \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal \
llama-cpp-python \
“langchain>=0.2” “langchain-community>=0.2”Step 2: Download Models
# Download Nomic embedding model
wget https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf
# Download Qwen text generation model
wget -O qwen2.5-3b-instruct-q4_k_m.gguf \
https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.ggufStep 3: Create Test Scripts
Text Generation (langchain_qwen_generation.py):
from langchain_community.llms import LlamaCpp
llm = LlamaCpp(
model_path=”qwen2.5-3b-instruct-q4_k_m.gguf”,
n_ctx=2048,
temperature=0.7,
max_tokens=100,
verbose=False
)
response = llm.invoke(”What is the capital of France?”)
print(response)Embedding Generation (langchain_nomic_embeddings.py):
from langchain_community.embeddings import LlamaCppEmbeddings
embeddings = LlamaCppEmbeddings(
model_path=”nomic-embed-text-v1.5.Q8_0.gguf”,
n_ctx=512,
verbose=False
)
query = embeddings.embed_query(”search_query: What is AI?”)
print(f”Dimensions: {len(query)}”)
print(f”First 5: {query[:5]}”)Step 4: Run Test Scripts
# Test text generation
uv run langchain_qwen_generation.py
# Test embeddings
uv run langchain_nomic_embeddings.pyWhat I Learned
For now, use Python 3.12, not 3.13, on macOS. The installation command includes Metal support for Apple Silicon. Both test scripts verify that everything works.
This setup lets you run language models locally without cloud costs. Good for prototyping and development.



