Skip to main content

Command Palette

Search for a command to run...

🚀 Building and Running vLLM Locally on Your M1 Mac (CPU-Only)

Test LLMs on your M1 Mac without needing a GPU: From Docker build to local LLM server, all on your MacBook

Updated
3 min read
🚀 Building and Running vLLM Locally on Your M1 Mac (CPU-Only)

In this post, you’ll go from cloning the repo to asking your locally hosted model for a prompt completion, all running CPU-only on your Apple Silicon Mac (M1).


🧰 What You'll Need

Before we start, here’s what you need:

  • Docker Desktop (with BuildKit and buildx support)

  • macOS with Apple Silicon

  • At least 16 GB RAM (some models need memory!)

  • ✅ Terminal and curl installed


🔧 Step 1: Clone the vLLM Repository

Let’s grab the latest vLLM code from GitHub:

git clone git@github.com:vllm-project/vllm.git
cd vllm

This repo has everything you need, including Dockerfiles for CPU and GPU builds.

🏗️ Step 2: Build the vLLM CPU Image

DOCKER_BUILDKIT=1 docker buildx build \                                                                                                                                                                                                                                     130 ↵
  --platform=linux/arm64 \
  -f docker/Dockerfile.cpu \
  --target=vllm-openai \
  -t vllm-cpu-arm64 \
  --load \
  .

🧠 Step 3: Start the Inference Server

Let’s fire up the server with a small model (facebook/opt-125m) to keep RAM usage light:

docker run -it --rm \
  -p 8000:8000 \
  -e VLLM_CPU_OMP_THREADS_BIND=nobind \
  vllm-cpu-arm64 facebook/opt-125m

💡 Why VLLM_CPU_DISABLE_AFFINITY=1?

On macOS, Docker doesn’t expose NUMA or low-level CPU affinity. This env var disables thread pinning to prevent runtime errors like AssertionError: Not enough allowed NUMA nodes to bind threads of 1 CPUWorkers. Allowed NUMA nodes are []. Please try to bind threads manually.]

🌐 Step 4: Send a Completion Request via curl

Let’s ask the model to complete a prompt:

 curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m", 
    "prompt": "What are three healthy snacks I can keep at my desk? ",
    "max_tokens": 1000,
    "temperature": 0.9,
    "top_p": 0.95
  }'

The model replies with its best attempt.


{"id":"cmpl-b180a957fc581c3e","object":"text_completion","created":1770101699,"model":"facebook/opt-125m","choices":[{"index":0,"text":" The items below are snacks, so I’m assuming snack #1.  I need some fruits, vegetables, and nuts.  The ones I’ve had for a while.\nTin foil, peanut butter, and peanut butter and orange juice  (I found them at some local shop you go to)  Also some dried fruit\nNothing too “healthy” at the moment but I’m thinking of making some almond flour from this lol\nI'd rather use the butter, but I do have avocado so I'm probably gonna make some peanut butter too\nIt's definitely worth it! It's hard to find almond flour, but I’m sure you can find it there","logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":160,"completion_tokens":146,"prompt_tokens_details":null},"kv_transfer_params":null}

and server side log.

🧪 Tips, Tricks, and Troubleshooting

🐢 It's slow!

Yes, CPUs aren't great at real-time inference for large models. Try lighter models like:

  • facebook/opt-125m

  • EleutherAI/pythia-70m

🧠 Want smarter answers?

Try a bigger model like facebook/opt-1.3b, but be warned, you’ll need more RAM.

🎉 Wrapping Up

In just a few steps, you:

  • Built a CPU-only vLLM image tailored for Apple Silicon

  • Ran a local OpenAI-compatible inference server

  • Sent a live request and got a generated response

No GPU. No cloud. All local. 🧠💻

🚀
May the Forge be with you!