What is the Hong Kong AI Podcast?

The Hong Kong AI Podcast features real conversations with AI practitioners in Hong Kong, covering research, fintech, robotics, spatial computing, healthcare, and culture. It is grassroots, community-driven, and available in English, Traditional Chinese, and Simplified Chinese.

Who are the hosts of the Hong Kong AI Podcast?

The show is hosted by Tanya Chou (main host) and Augustin Chan (co-host), with Ricky Chan as Director of Photography. Together they bring curiosity and practitioner expertise to every episode.

What topics does the Hong Kong AI Podcast cover?

The podcast covers AI tools available in Hong Kong, Chinese AI models and platforms, the local Hong Kong AI ecosystem, and practical guides for AI practitioners. The blog is organized into four pillars: The HK AI Stack, Chinese AI Landscape, HK AI Scene, and Practitioner Guides.

Is the Hong Kong AI Podcast available in Chinese?

Yes. The website and all articles are available in English, Traditional Chinese (繁體中文), and Simplified Chinese (简体中文). Episodes are recorded primarily in English and Cantonese.

Home/All articles/vllm-lmstudio-llamacpp

Practitioner Guides

vLLM vs LM Studio vs llama.cpp: How to Self-Host AI Models in Hong Kong

Hong Kong AI Podcast/2026-03-07/7 min read/vLLMLM Studiollama.cppSelf-HostingHong Kong

If you're self-hosting AI models in Hong Kong — and increasingly, that's the smart move — you need to pick an inference engine. The three main options are vLLM, LM Studio, and llama.cpp (often via Ollama). Each has a clear use case.

Here's when to use what.

The Quick Answer

LM Studio — You want a GUI on your Mac or PC. Download a model, click run, start chatting. No terminal required.

Ollama (llama.cpp) — You want a CLI on your laptop or a lightweight server. One command to pull and run models. Great for development.

vLLM — You need production serving. Multiple users, high throughput, batching, GPU optimization. The serious option.

LM Studio: The Desktop App

What It Is

LM Studio is a desktop application for running LLMs locally. Available for Mac, Windows, and Linux. It provides a visual interface for downloading, managing, and chatting with models.

When to Use It

-You're not a developer (or you are, but you just want to chat with a model)
-You want to browse and download models from a visual catalog
-You want to compare different models side-by-side
-You're on a Mac and want the simplest possible setup

How It Works

Download the app. Browse the model library (it indexes Hugging Face). Click download on a model. Click run. Chat. That's it.

LM Studio handles quantization selection, memory management, and GPU acceleration automatically. It also exposes a local API server if you want to connect other tools.

Strengths

-Beautiful GUI
-Model discovery and management
-Side-by-side model comparison
-Automatic hardware optimization
-Local API server for tool integration

Limitations

-Desktop-only (no headless server option)
-Not designed for production serving
-Single-user focused
-Limited batch processing

Ollama (llama.cpp under the hood)

What It Is

Ollama is a CLI tool that wraps llama.cpp with a friendly interface. One command to install, one command to run a model. It exposes an OpenAI-compatible API on localhost.

When to Use It

-You prefer the terminal
-You're developing locally and need a quick model to test against
-You want to run models on a remote server via SSH
-You need an OpenAI-compatible endpoint for local development
-You want offline AI on your laptop (commuting on the MTR, spotty wifi)

How It Works

Install Ollama. Pull a model by name. Run it. Ollama handles downloading, quantization, and serving. It exposes an API at localhost:11434 that's compatible with OpenAI's format — so Cursor, OpenCode, and other tools can connect directly.

Strengths

-Simplest CLI experience
-Huge model library
-OpenAI-compatible API out of the box
-Runs on CPU (no GPU required, though GPU helps)
-Low resource overhead
-Great for Apple Silicon Macs

Limitations

-Single-model at a time (by default)
-Limited batching and throughput optimization
-Not ideal for high-concurrency production
-Basic quantization options compared to manual llama.cpp

Raw llama.cpp

Ollama wraps llama.cpp, but you can use llama.cpp directly for more control. This gives you fine-grained quantization options, custom sampling parameters, and the ability to run on exotic hardware. The tradeoff is more manual setup.

Use raw llama.cpp when:

-You need specific quantization (Q4_K_M, Q5_K_S, etc.)
-You're deploying on unusual hardware
-You need maximum control over inference parameters
-You're building a custom inference pipeline

vLLM: Production Serving

What It Is

vLLM is a high-throughput inference engine designed for production serving. It uses PagedAttention for efficient memory management and supports continuous batching for handling multiple concurrent requests.

When to Use It

-Multiple users hitting the same model
-You need high throughput (hundreds or thousands of requests/hour)
-You're serving a model as an internal API for your team or product
-You have GPU(s) and want maximum utilization
-You need production features: health checks, metrics, auto-scaling

How It Works

Install vLLM (Python package). Point it at a model (Hugging Face ID or local path). It loads the model onto your GPU(s) and starts serving an OpenAI-compatible API.

vLLM's PagedAttention algorithm manages GPU memory like an operating system manages RAM — dynamically allocating and freeing memory blocks as requests come and go. This means it can serve more concurrent requests on the same hardware.

Strengths

-Highest throughput of the three options
-Continuous batching (handles concurrent requests efficiently)
-PagedAttention for memory efficiency
-Tensor parallelism (split one model across multiple GPUs)
-Pipeline parallelism (run multiple models)
-OpenAI-compatible API
-Production-grade reliability

Limitations

-Requires NVIDIA GPU (CUDA)
-More complex setup than Ollama
-Overkill for single-user local development
-Higher baseline resource requirements

The Decision Matrix

Scenario	Tool	Why
Exploring models on your Mac	LM Studio	Visual, easy, no terminal needed
Local dev + testing	Ollama	Simple CLI, fast setup, good enough speed
Offline coding on the MTR	Ollama	Runs on laptop, no internet
Internal team API	vLLM	Handles multiple users, high throughput
Production customer-facing	vLLM	Reliability, batching, monitoring
SSH into remote server	Ollama or vLLM	Ollama for quick tests, vLLM for serving
Maximum control	llama.cpp	Custom quantization, exotic hardware
Cost-sensitive production	vLLM	Better hardware utilization = lower per-query cost

The HK Self-Hosting Stack

Here's a practical HK self-hosting stack:

Developer laptops: Ollama with DeepSeek-Coder or Qwen 7B/14B. Fast enough for coding assistance, runs offline.

Team inference server: vLLM on a machine with an A100 or 4090, serving DeepSeek-V3.2 or Qwen3-235B. The whole team points their tools at this endpoint.

Production: vLLM on cloud GPUs (Alibaba Cloud or Lambda Labs), with load balancing and monitoring. Multiple model replicas for redundancy.

Experimentation: LM Studio on someone's Mac for trying out new models before the team commits to deploying them.

The key insight: you'll probably use more than one of these tools. They serve different purposes at different stages. Ollama for dev, vLLM for prod, LM Studio for exploration. That's not redundancy — that's the right tool for each job.

Sources

Self-hosting AI models in Hong Kong? Share your setup with us. Subscribe to the Hong Kong AI Podcast or reach out at contact@hongkongaipodcast.com.

Stay in the loop

Get notified when we publish new articles and episodes. No spam, just signal.

Something out of date or wrong? AI moves fast and we want to get it right. Let us know at contact@hongkongaipodcast.com