🚀 Building and Running vLLM Locally on Your M1 Mac (CPU-Only)

In this post, you’ll go from cloning the repo to asking your locally hosted model for a prompt completion, all running CPU-only on your Apple Silicon Mac (M1).

🧰 What You'll Need

Before we start, here’s what you need:

✅ Docker Desktop (with BuildKit and buildx support)
✅ macOS with Apple Silicon
✅ At least 16 GB RAM (some models need memory!)
✅ Terminal and curl installed

🔧 Step 1: Clone the vLLM Repository

Let’s grab the latest vLLM code from GitHub:

git clone git@github.com:vllm-project/vllm.git
cd vllm

This repo has everything you need, including Dockerfiles for CPU and GPU builds.

🏗️ Step 2: Build the vLLM CPU Image

DOCKER_BUILDKIT=1 docker buildx build \                                                                                                                                                                                                                                     130 ↵
  --platform=linux/arm64 \
  -f docker/Dockerfile.cpu \
  --target=vllm-openai \
  -t vllm-cpu-arm64 \
  --load \
  .

🧠 Step 3: Start the Inference Server

Let’s fire up the server with a small model (facebook/opt-125m) to keep RAM usage light:

docker run -it --rm \
  -p 8000:8000 \
  -e VLLM_CPU_OMP_THREADS_BIND=nobind \
  vllm-cpu-arm64 facebook/opt-125m

💡 Why `VLLM_CPU_DISABLE_AFFINITY=1`?

On macOS, Docker doesn’t expose NUMA or low-level CPU affinity. This env var disables thread pinning to prevent runtime errors like AssertionError: Not enough allowed NUMA nodes to bind threads of 1 CPUWorkers. Allowed NUMA nodes are []. Please try to bind threads manually.]

🌐 Step 4: Send a Completion Request via `curl`

Let’s ask the model to complete a prompt:

 curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m", 
    "prompt": "What are three healthy snacks I can keep at my desk? ",
    "max_tokens": 1000,
    "temperature": 0.9,
    "top_p": 0.95
  }'

The model replies with its best attempt.


{"id":"cmpl-b180a957fc581c3e","object":"text_completion","created":1770101699,"model":"facebook/opt-125m","choices":[{"index":0,"text":" The items below are snacks, so I’m assuming snack #1.  I need some fruits, vegetables, and nuts.  The ones I’ve had for a while.\nTin foil, peanut butter, and peanut butter and orange juice  (I found them at some local shop you go to)  Also some dried fruit\nNothing too “healthy” at the moment but I’m thinking of making some almond flour from this lol\nI'd rather use the butter, but I do have avocado so I'm probably gonna make some peanut butter too\nIt's definitely worth it! It's hard to find almond flour, but I’m sure you can find it there","logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":160,"completion_tokens":146,"prompt_tokens_details":null},"kv_transfer_params":null}

and server side log.

🧪 Tips, Tricks, and Troubleshooting

🐢 It's slow!

Yes, CPUs aren't great at real-time inference for large models. Try lighter models like:

facebook/opt-125m
EleutherAI/pythia-70m

🧠 Want smarter answers?

Try a bigger model like facebook/opt-1.3b, but be warned, you’ll need more RAM.

🎉 Wrapping Up

In just a few steps, you:

Built a CPU-only vLLM image tailored for Apple Silicon
Ran a local OpenAI-compatible inference server
Sent a live request and got a generated response

No GPU. No cloud. All local. 🧠💻

🚀

May the Forge be with you!

🚀 Building and Running vLLM Locally on Your M1 Mac (CPU-Only)

🧰 What You'll Need

🔧 Step 1: Clone the vLLM Repository

🏗️ Step 2: Build the vLLM CPU Image

🧠 Step 3: Start the Inference Server

💡 Why `VLLM_CPU_DISABLE_AFFINITY=1`?

🌐 Step 4: Send a Completion Request via `curl`

🧪 Tips, Tricks, and Troubleshooting

🐢 It's slow!

🧠 Want smarter answers?

🎉 Wrapping Up

Comments

More from this blog

How SRE Build Systems That Don't Break (…Too Often) with NALSD

Data Engineering Foundations: A Practical Introduction to Snowflake, Fivetran, and dbt

Boost Your Learning Trajectory: This Prompt Will Revolutionize Your Skill Acquisition

SFM2 Algorithm Forge: Your Ultimate Guide to Algorithms and Coding Interviews!

Command Palette

🧰 What You'll Need

🔧 Step 1: Clone the vLLM Repository

🏗️ Step 2: Build the vLLM CPU Image

🧠 Step 3: Start the Inference Server

💡 Why VLLM_CPU_DISABLE_AFFINITY=1?

🌐 Step 4: Send a Completion Request via curl

🧪 Tips, Tricks, and Troubleshooting

🐢 It's slow!

🧠 Want smarter answers?

🎉 Wrapping Up

Comments

More from this blog

💡 Why `VLLM_CPU_DISABLE_AFFINITY=1`?

🌐 Step 4: Send a Completion Request via `curl`