vLLM vs LM Studio vs llama.cpp: How to Self-Host AI Models in Hong Kong
If you're self-hosting AI models in Hong Kong — and increasingly, that's the smart move — you need to pick an inference engine. The three main options are vLLM, LM Studio, and llama.cpp (often via Ollama). Each has a clear use case.
Here's when to use what.
The Quick Answer
LM Studio — You want a GUI on your Mac or PC. Download a model, click run, start chatting. No terminal required.
Ollama (llama.cpp) — You want a CLI on your laptop or a lightweight server. One command to pull and run models. Great for development.
vLLM — You need production serving. Multiple users, high throughput, batching, GPU optimization. The serious option.
LM Studio: The Desktop App
What It Is
LM Studio is a desktop application for running LLMs locally. Available for Mac, Windows, and Linux. It provides a visual interface for downloading, managing, and chatting with models.
When to Use It
- -You're not a developer (or you are, but you just want to chat with a model)
- -You want to browse and download models from a visual catalog
- -You want to compare different models side-by-side
- -You're on a Mac and want the simplest possible setup
How It Works
Download the app. Browse the model library (it indexes Hugging Face). Click download on a model. Click run. Chat. That's it.
LM Studio handles quantization selection, memory management, and GPU acceleration automatically. It also exposes a local API server if you want to connect other tools.
Strengths
- -Beautiful GUI
- -Model discovery and management
- -Side-by-side model comparison
- -Automatic hardware optimization
- -Local API server for tool integration
Limitations
- -Desktop-only (no headless server option)
- -Not designed for production serving
- -Single-user focused
- -Limited batch processing
Ollama (llama.cpp under the hood)
What It Is
Ollama is a CLI tool that wraps llama.cpp with a friendly interface. One command to install, one command to run a model. It exposes an OpenAI-compatible API on localhost.
When to Use It
- -You prefer the terminal
- -You're developing locally and need a quick model to test against
- -You want to run models on a remote server via SSH
- -You need an OpenAI-compatible endpoint for local development
- -You want offline AI on your laptop (commuting on the MTR, spotty wifi)
How It Works
Install Ollama. Pull a model by name. Run it. Ollama handles downloading, quantization, and serving. It exposes an API at localhost:11434 that's compatible with OpenAI's format — so Cursor, OpenCode, and other tools can connect directly.
Strengths
- -Simplest CLI experience
- -Huge model library
- -OpenAI-compatible API out of the box
- -Runs on CPU (no GPU required, though GPU helps)
- -Low resource overhead
- -Great for Apple Silicon Macs
Limitations
- -Single-model at a time (by default)
- -Limited batching and throughput optimization
- -Not ideal for high-concurrency production
- -Basic quantization options compared to manual llama.cpp
Raw llama.cpp
Ollama wraps llama.cpp, but you can use llama.cpp directly for more control. This gives you fine-grained quantization options, custom sampling parameters, and the ability to run on exotic hardware. The tradeoff is more manual setup.
Use raw llama.cpp when:
- -You need specific quantization (Q4_K_M, Q5_K_S, etc.)
- -You're deploying on unusual hardware
- -You need maximum control over inference parameters
- -You're building a custom inference pipeline
vLLM: Production Serving
What It Is
vLLM is a high-throughput inference engine designed for production serving. It uses PagedAttention for efficient memory management and supports continuous batching for handling multiple concurrent requests.
When to Use It
- -Multiple users hitting the same model
- -You need high throughput (hundreds or thousands of requests/hour)
- -You're serving a model as an internal API for your team or product
- -You have GPU(s) and want maximum utilization
- -You need production features: health checks, metrics, auto-scaling
How It Works
Install vLLM (Python package). Point it at a model (Hugging Face ID or local path). It loads the model onto your GPU(s) and starts serving an OpenAI-compatible API.
vLLM's PagedAttention algorithm manages GPU memory like an operating system manages RAM — dynamically allocating and freeing memory blocks as requests come and go. This means it can serve more concurrent requests on the same hardware.
Strengths
- -Highest throughput of the three options
- -Continuous batching (handles concurrent requests efficiently)
- -PagedAttention for memory efficiency
- -Tensor parallelism (split one model across multiple GPUs)
- -Pipeline parallelism (run multiple models)
- -OpenAI-compatible API
- -Production-grade reliability
Limitations
- -Requires NVIDIA GPU (CUDA)
- -More complex setup than Ollama
- -Overkill for single-user local development
- -Higher baseline resource requirements
The Decision Matrix
| Scenario | Tool | Why |
|---|---|---|
| Exploring models on your Mac | LM Studio | Visual, easy, no terminal needed |
| Local dev + testing | Ollama | Simple CLI, fast setup, good enough speed |
| Offline coding on the MTR | Ollama | Runs on laptop, no internet |
| Internal team API | vLLM | Handles multiple users, high throughput |
| Production customer-facing | vLLM | Reliability, batching, monitoring |
| SSH into remote server | Ollama or vLLM | Ollama for quick tests, vLLM for serving |
| Maximum control | llama.cpp | Custom quantization, exotic hardware |
| Cost-sensitive production | vLLM | Better hardware utilization = lower per-query cost |
The HK Self-Hosting Stack
Here's a practical HK self-hosting stack:
Developer laptops: Ollama with DeepSeek-Coder or Qwen 7B/14B. Fast enough for coding assistance, runs offline.
Team inference server: vLLM on a machine with an A100 or 4090, serving DeepSeek-V3.2 or Qwen3-235B. The whole team points their tools at this endpoint.
Production: vLLM on cloud GPUs (Alibaba Cloud or Lambda Labs), with load balancing and monitoring. Multiple model replicas for redundancy.
Experimentation: LM Studio on someone's Mac for trying out new models before the team commits to deploying them.
The key insight: you'll probably use more than one of these tools. They serve different purposes at different stages. Ollama for dev, vLLM for prod, LM Studio for exploration. That's not redundancy — that's the right tool for each job.
Sources
- -vLLM — GitHub
- -vLLM Documentation
- -LM Studio — Official Site
- -llama.cpp — GitHub
- -Ollama — Official Site
- -PagedAttention: Efficient Memory Management for LLM Serving — arXiv
Self-hosting AI models in Hong Kong? Share your setup with us. Subscribe to the Hong Kong AI Podcast or reach out at contact@hongkongaipodcast.com.
Get notified when we publish new articles and episodes. No spam, just signal.