What is the Hong Kong AI Podcast?

The Hong Kong AI Podcast features real conversations with AI practitioners in Hong Kong, covering research, fintech, robotics, spatial computing, healthcare, and culture. It is grassroots, community-driven, and available in English, Traditional Chinese, and Simplified Chinese.

Who are the hosts of the Hong Kong AI Podcast?

The show is hosted by Tanya Chou (main host) and Augustin Chan (co-host), with Ricky Chan as Director of Photography. Together they bring curiosity and practitioner expertise to every episode.

What topics does the Hong Kong AI Podcast cover?

The podcast covers AI tools available in Hong Kong, Chinese AI models and platforms, the local Hong Kong AI ecosystem, and practical guides for AI practitioners. The blog is organized into four pillars: The HK AI Stack, Chinese AI Landscape, HK AI Scene, and Practitioner Guides.

Is the Hong Kong AI Podcast available in Chinese?

Yes. The website and all articles are available in English, Traditional Chinese (繁體中文), and Simplified Chinese (简体中文). Episodes are recorded primarily in English and Cantonese.

Home/All articles/stepfun-3-5-flash

Chinese AI Landscape

StepFun 3.5 Flash: 11B Active Parameters That Punch Above Their Weight

Hong Kong AI Podcast/2026-03-07/5 min read/StepFunMoEEfficiencyOpen SourceHong Kong

Most AI model announcements focus on getting bigger. StepFun 3.5 Flash is interesting because it focuses on getting smarter about being small.

It's a 196 billion parameter Mixture of Experts model. But here's the trick: only 11 billion parameters are active for any given input. That means it runs fast and cheap while scoring 97.3% on AIME 2025 and 74.4% on SWE-bench. Apache 2.0 licensed.

For Hong Kong developers who care about cost and latency — which is basically everyone building real products — this is the model to watch.

The MoE Trick, Explained Simply

Traditional models activate every parameter for every input. A 200B model does 200B calculations per token. That's powerful but expensive and slow.

Mixture of Experts splits the model into specialized "experts." For each input, a routing mechanism selects a small subset of experts. StepFun 3.5 Flash has 196B total parameters but routes each token through only 11B worth of computation.

The result: you get the knowledge encoded in 196B parameters with the speed and cost of an 11B model. It's the best ratio in the current landscape.

The Benchmarks

These numbers got people's attention:

-AIME 2025: 97.3% — this is a serious math competition benchmark
-SWE-bench: 74.4% — real-world software engineering tasks
-Apache 2.0 license — use it for anything, commercially, no restrictions

To put this in context: 97.3% on AIME puts it in the same league as DeepSeek R1 and other frontier reasoning models. But it runs significantly faster because of the smaller active parameter count.

Why This Matters in Hong Kong

Cost

API pricing follows active parameters, not total parameters. An 11B active model costs a fraction of a 200B+ dense model per token. For startups bootstrapping with limited funding — which describes many startups in HK — this is real money saved.

Latency

Fewer active parameters means faster inference. If you're building user-facing applications where response time matters, StepFun 3.5 Flash gives you frontier-level quality with near-instant responses.

Self-hosting

11B active parameters means you can run this on much more modest hardware than a full dense model. A single high-end GPU can handle inference. For HK teams self-hosting for privacy or compliance reasons, this dramatically lowers the barrier.

The HKEX Connection

StepFun is reportedly seeking an IPO on the Hong Kong Stock Exchange, joining Zhipu (Z.ai) and MiniMax which both listed on HKEX in January 2026. Hong Kong is becoming the listing destination of choice for Chinese AI companies — and having these companies listed locally means better access to their services and ecosystems.

How to Use It

API: Available through StepFun's platform. OpenAI-compatible API format.

Hugging Face: Full weights available for download under Apache 2.0.

NVIDIA NIM: Available as a pre-optimized deployment through NVIDIA's inference microservice platform. This is the easiest path to production deployment.

Ollama/vLLM: Community quantizations are available for local deployment.

When to Choose StepFun 3.5 Flash

Pick it over DeepSeek V3.2 when you need speed and cost efficiency over raw capability. Pick it over smaller models when you need frontier-level accuracy. It sits in a sweet spot that most production applications actually need — good enough quality, fast enough speed, cheap enough cost.

The best model isn't always the biggest one. Sometimes it's the one that gives you 97% of the quality at 10% of the cost. That's StepFun 3.5 Flash.

Sources

Interested in the efficiency frontier of AI models? Subscribe to the Hong Kong AI Podcast for more on the tools and models shaping AI in Hong Kong.

Stay in the loop

Get notified when we publish new articles and episodes. No spam, just signal.

Something out of date or wrong? AI moves fast and we want to get it right. Let us know at contact@hongkongaipodcast.com