Home/All articles/self-hosting-deepseek-hk
Practitioner Guides

Self-Hosting DeepSeek in Hong Kong: A Practical Guide

Hong Kong AI Podcast/2026-03-07/7 min read/DeepSeekSelf-HostingOllamavLLMHong Kong

You can use DeepSeek's API from Hong Kong without any issues. But some teams want more: data stays on their infrastructure, no API dependency, no per-token costs, no risk of service changes. Self-hosting gives you all of that.

Here's how to run DeepSeek models on your own hardware, from a MacBook to a production GPU cluster.

Choose Your Model

Not every DeepSeek model is practical to self-host. Here's the realistic breakdown:

DeepSeek-Coder-V2 (16B) — Runs on a MacBook Pro with 32GB RAM. Excellent for coding tasks. This is where most people start.

DeepSeek-V2.5 (236B MoE, ~21B active) — Needs a GPU with 24GB+ VRAM (RTX 4090 or A100). Good balance of quality and resource requirements.

DeepSeek-V3.2 (671B MoE, ~37B active) — Needs multiple high-end GPUs or a cloud deployment. Not practical for laptops or single-GPU setups, but feasible for teams with budget.

DeepSeek-R1 — The reasoning model. Various sizes available, from distilled versions that run on consumer hardware to the full model requiring serious compute.

Path 1: Ollama on Your Laptop

The easiest way to get started. Takes about 5 minutes.

What you need: A Mac with Apple Silicon (M1 or later) and at least 16GB RAM. 32GB recommended for larger models. Or a Linux/Windows machine with a decent GPU.

Install Ollama: Download from ollama.com. One installer, no dependencies. (GitHub)

Pull a model: Open your terminal and pull the DeepSeek model. Ollama handles quantization and optimization automatically. The download is a few gigabytes depending on the model.

Run it: Start a chat session from the terminal. Or run the Ollama server and connect Cursor, OpenCode, or any other tool that supports an OpenAI-compatible API endpoint.

What to expect: DeepSeek-Coder at 16B runs comfortably on an M2 MacBook Pro. Responses are slower than the API (a few tokens per second vs. near-instant), but perfectly usable for coding assistance and general queries. You're trading speed for privacy and zero cost.

Path 2: vLLM on a GPU Server

For production use or teams that need faster inference.

What you need: A server or cloud instance with an NVIDIA GPU. Minimum 24GB VRAM for smaller models, 80GB+ for larger ones.

Cloud options accessible from HK:

  • -[Alibaba Cloud ECS](https://www.alibabacloud.com/product/gpu/pricing) with GPU instances (closest geographically)
  • -[Lambda Labs](https://lambda.ai/pricing) (US-based but no geographic restrictions, H100 at $2.99/hr)
  • -[vast.ai](https://vast.ai/pricing) (marketplace for GPU rentals, cheapest option, from $0.06/hr)
  • -Cyberport Supercomputing Centre (if you qualify for access)

Install vLLM: Set up a Python environment, install vLLM. It handles model loading, quantization, and serving. (GitHub)

Serve the model: vLLM exposes an OpenAI-compatible API endpoint. Point your applications at this endpoint just like you'd point at DeepSeek's API or OpenAI's API.

What to expect: vLLM with a 4090 or A100 gives you near-API-speed responses. It supports batching (serving multiple users simultaneously), streaming, and all the features you need for production. Typical cost for a cloud A100: $1-2/hour.

Path 3: llama.cpp for Edge Deployment

For running models on constrained hardware — edge devices, older machines, or minimal cloud instances.

What you need: Almost anything. llama.cpp runs on CPUs, which means any server or laptop can run it. GPU acceleration is optional.

What to expect: Slower than Ollama (which uses llama.cpp internally but with better optimization) for most cases, but llama.cpp gives you the most control over quantization levels and memory usage. Useful when you need to fit a model into tight memory constraints.

Privacy and Compliance Considerations

The main reasons HK teams self-host:

Data sovereignty. Your prompts and data never leave your infrastructure. For finance, healthcare, and legal applications, this can be a regulatory requirement.

No API terms. When you self-host, you're bound by the MIT license (permissive) rather than API terms of service (which can change). You control the model permanently.

Cost at scale. API pricing is per-token. Self-hosting has a fixed cost (hardware or cloud rental). At high usage levels — thousands of requests per day — self-hosting is significantly cheaper.

Availability. No dependency on external services. If DeepSeek's API goes down or changes pricing, your self-hosted model keeps running.

The Hybrid Approach

A practical approach is hybrid — rather than going all-in on self-hosting:

  • -API for development and testing (fast, no infrastructure to manage)
  • -Self-hosted for production (cost control, privacy, reliability)
  • -Ollama on laptops for offline work and experimentation

This gives you the speed of API access during development and the control of self-hosting in production. The OpenAI-compatible API format means switching between API and self-hosted requires changing only the endpoint URL.

What This Costs

Ollama on a laptop: Free (you already own the hardware) vLLM on cloud GPU: $1-3/hour for inference, $5-10/hour for larger models Dedicated GPU server: $2,000-10,000 one-time for a machine with an RTX 4090 or A100

Compare to API costs: DeepSeek API at heavy usage might cost $100-500/month. Self-hosting makes economic sense when your monthly API bill exceeds the amortized cost of hardware.

For most small teams and startups in HK, the sweet spot is: Ollama on laptops for development, DeepSeek API for production, with self-hosting when you hit scale.



Sources

Running your own AI infrastructure in Hong Kong? We'd love to hear about your setup. Subscribe to the Hong Kong AI Podcast or reach out at contact@hongkongaipodcast.com.

Stay in the loop

Get notified when we publish new articles and episodes. No spam, just signal.

Something out of date or wrong? AI moves fast and we want to get it right. Let us know at contact@hongkongaipodcast.com