StepFun 3.5 Flash: 11B Active Parameters That Punch Above Their Weight
Most AI model announcements focus on getting bigger. StepFun 3.5 Flash is interesting because it focuses on getting smarter about being small.
It's a 196 billion parameter Mixture of Experts model. But here's the trick: only 11 billion parameters are active for any given input. That means it runs fast and cheap while scoring 97.3% on AIME 2025 and 74.4% on SWE-bench. Apache 2.0 licensed.
For Hong Kong developers who care about cost and latency — which is basically everyone building real products — this is the model to watch.
The MoE Trick, Explained Simply
Traditional models activate every parameter for every input. A 200B model does 200B calculations per token. That's powerful but expensive and slow.
Mixture of Experts splits the model into specialized "experts." For each input, a routing mechanism selects a small subset of experts. StepFun 3.5 Flash has 196B total parameters but routes each token through only 11B worth of computation.
The result: you get the knowledge encoded in 196B parameters with the speed and cost of an 11B model. It's the best ratio in the current landscape.
The Benchmarks
These numbers got people's attention:
- -AIME 2025: 97.3% — this is a serious math competition benchmark
- -SWE-bench: 74.4% — real-world software engineering tasks
- -Apache 2.0 license — use it for anything, commercially, no restrictions
To put this in context: 97.3% on AIME puts it in the same league as DeepSeek R1 and other frontier reasoning models. But it runs significantly faster because of the smaller active parameter count.
Why This Matters in Hong Kong
Cost
API pricing follows active parameters, not total parameters. An 11B active model costs a fraction of a 200B+ dense model per token. For startups bootstrapping with limited funding — which describes many startups in HK — this is real money saved.
Latency
Fewer active parameters means faster inference. If you're building user-facing applications where response time matters, StepFun 3.5 Flash gives you frontier-level quality with near-instant responses.
Self-hosting
11B active parameters means you can run this on much more modest hardware than a full dense model. A single high-end GPU can handle inference. For HK teams self-hosting for privacy or compliance reasons, this dramatically lowers the barrier.
The HKEX Connection
StepFun is reportedly seeking an IPO on the Hong Kong Stock Exchange, joining Zhipu (Z.ai) and MiniMax which both listed on HKEX in January 2026. Hong Kong is becoming the listing destination of choice for Chinese AI companies — and having these companies listed locally means better access to their services and ecosystems.
How to Use It
API: Available through StepFun's platform. OpenAI-compatible API format.
Hugging Face: Full weights available for download under Apache 2.0.
NVIDIA NIM: Available as a pre-optimized deployment through NVIDIA's inference microservice platform. This is the easiest path to production deployment.
Ollama/vLLM: Community quantizations are available for local deployment.
When to Choose StepFun 3.5 Flash
Pick it over DeepSeek V3.2 when you need speed and cost efficiency over raw capability. Pick it over smaller models when you need frontier-level accuracy. It sits in a sweet spot that most production applications actually need — good enough quality, fast enough speed, cheap enough cost.
The best model isn't always the biggest one. Sometimes it's the one that gives you 97% of the quality at 10% of the cost. That's StepFun 3.5 Flash.
Sources
- -Step 3.5 Flash — GitHub
- -Step 3.5 Flash — Hugging Face
- -Step 3.5 Flash Official Blog
- -Step 3.5 Flash Technical Report — arXiv
- -Step 3.5 Flash on OpenRouter
- -Benchmarks — DeepWiki
Interested in the efficiency frontier of AI models? Subscribe to the Hong Kong AI Podcast for more on the tools and models shaping AI in Hong Kong.
Get notified when we publish new articles and episodes. No spam, just signal.