Chinese Internet Incident Case Studies (Part 3)

2026/05/04

Kota Kagami

Introduction

Part 2 covered Alibaba Cloud’s network outage and data center fire. In this third installment, we examine three incidents that occurred in Chinese AI services.

Since 2025, AI services like DeepSeek and Qwen (Alibaba) have spread rapidly. AI inference requests carry a per-request computational cost orders of magnitude greater than ordinary web requests, making them structurally more vulnerable to both attacks and sudden demand spikes than traditional web services. Although each of the three cases involves a different type of failure, the underlying thread is this shared “cost structure unique to AI services.”

Case 3: DeepSeek Large-Scale DDoS Attack (January 25–30, 2025)

Details

When DeepSeek-R1 was released on January 20, 2025, its inference performance — rivaling OpenAI’s o1 at a fraction of the cost — sent shockwaves through markets, wiping roughly ¥60 trillion (approximately $400 billion) from NVIDIA’s market cap in a single day on the Nasdaq. Riding that wave of attention, a coordinated DDoS attack began on January 25, unfolding in three waves.

According to analysis by cybersecurity firm NSFOCUS, the attack concentrated on DeepSeek’s API endpoint (api.deepseek.com), combining NTP reflection and Memcached reflection techniques, with each wave lasting an average of 35 minutes. When DeepSeek switched the IP address resolving its API on January 28, attackers immediately tracked the change and launched a new wave — one that exceeded the previous day’s volume by more than 100x.

During the same period, external security researchers discovered that a ClickHouse database had been inadvertently left publicly accessible, exposing user chat histories and other data.

Response Highlights

DeepSeek publicly acknowledged it was under “large-scale malicious attack” on January 27, restricted new user registration to mainland Chinese phone numbers only, and temporarily suspended API services. Notably, the sophistication of the attack stood out: the attackers’ ability to immediately track IP address changes led multiple security firms to characterize this as “an attack by a professional team with advance planning” — distinctly different in character from opportunistic vandalism.

Root Cause

The direct cause was an organized external DDoS attack, but the structural backdrop is AI inference’s cost profile. Each AI inference request carries a computational load orders of magnitude greater than a standard web request, meaning that an attacker’s spoofed requests can readily crowd out legitimate users.

Key Takeaways

Design DDoS resilience for AI services proactively. Bandwidth-based defenses alone are insufficient. Rate limiting calibrated to per-inference cost — at the token level and user level — must be designed in from the start.
Keep status pages and incident communications independent of the main service. Even when the primary service is down, a separate-domain status site and official social media channels must be able to broadcast incident updates. This is the first line of defense against user anxiety spreading out of control.
Audit database exposure before rapid growth hits. The ClickHouse misconfiguration is a textbook case of “growth velocity outrunning security audit velocity.” Security review should be a mandatory step in the design cycle before scaling up.

Case 4: Alibaba Qwen Lunar New Year Campaign Crash (February 6, 2026)

Details

On February 6, 2026 — just before the Chinese Lunar New Year — Alibaba launched a campaign to drive adoption of its AI chatbot Qwen, backed by a budget of 3 billion RMB (approximately ¥43 billion / $4.3 billion). The headline offer: a free bubble tea coupon worth 25 RMB, redeemable via Qwen, at more than 300,000 locations nationwide.

Within just nine hours of launch, over 10 million orders had flooded in, crashing Qwen’s backend. The chatbot began responding with messages like “I am a language model and have no hands. Please order through another delivery app.” Alibaba’s Hong Kong-listed shares fell 2.88% the same day. The company announced the following day that it was “urgently adding server resources” and extended the coupon validity period to February 28 as a compensatory measure.

Response Highlights

The engineering team deployed additional GPU instances within 40 minutes of the outage, partially restoring service. But the damage to user experience — the inability to participate in the campaign — had already occurred, and the coupon extension failed to fully absorb the backlash on social media. It later emerged that Alibaba’s internal teams had in fact identified peak-load risk in advance, but competitive pressure from Tencent, Baidu, and ByteDance had pushed the campaign timeline forward, leaving insufficient time for load testing.

Root Cause

There is a fundamental difference in per-request computational cost between traditional e-commerce systems (capable of processing 80–90 million transactions per day during Singles’ Day) and AI agent requests, and AI-specific demand forecasting models were not in place. Additionally, WeChat’s blocking QR code sharing concentrated inbound traffic onto specific API endpoints, compounding the failure.

Key Takeaways

Build infrastructure assumptions around the scenario where the campaign succeeds. Marketing and engineering teams must operate on the same timeline. Matching campaign scale estimates (expected participants, order volume) against infrastructure capacity should be a mandatory condition for release approval.
AI inference cost cannot be estimated using e-commerce order cost as a proxy. The per-request computational cost of AI agent requests can be two to three orders of magnitude greater than e-commerce order processing. Load testing tools and threshold settings need to be redesigned for an AI inference context.
A clever error message is no substitute for communicating the recovery timeline and compensation policy. Qwen’s responses went viral on social media, but what users actually needed was “when will this work again?” In an outage, user communications should prioritize a clear recovery estimate and compensation policy.

Case 5: DeepSeek’s Largest Outage Since Launch (March 29–30, 2026)

Details

At approximately 21:35 on the evening of March 29, 2026, a major outage began affecting DeepSeek’s chat service. Around two hours later, at 23:23, the incident was marked as resolved — but within less than an hour, service degraded again. A full resolution was not declared until 10:33 AM on March 30, resulting in a total downtime of more than eight hours (8 hours and 13 minutes, according to independent monitoring service StatusGator).

Given that DeepSeek had maintained uptime of approximately 99% for more than 14 months since the R1 model’s release, this incident drew attention as “the largest outage since the service launched.” During the outage, observers noted what appeared to be a change in the internal model identifier away from DeepSeek-V3, fueling speculation that a backend test deployment of a next-generation model (V4) may have triggered the failure. It is also worth noting that StatusGator’s monitoring history already showed a 40-minute outage on March 5 and a 75-minute outage on March 10 — making this 8-hour event interpretable as “an accumulated build-up of warning signs finally breaking through.”

Response Highlights

DeepSeek declared a “Major Outage” on its status page and issued multiple situation updates, but provided no explanation of the root cause. The absence of any “why did this happen” explanation — only “the incident has been resolved” — drew criticism from across the industry. According to reporting by Caixin, daily AI token call volume within China exceeded 140 trillion during this period, suggesting that the gap between surging demand and infrastructure expansion speed was a contributing factor.

Root Cause

DeepSeek has not officially disclosed the cause, and no definitive analysis exists. Three hypotheses circulate in the developer community: a system update in preparation for a next-generation model, a backend architecture change, or capacity being overwhelmed by demand growth.

Key Takeaways

Understand the risk of declaring an incident “resolved” without explaining the cause. Users and developers want to know why it happened and whether it will recur. Publishing a post-mortem report is, now that AI services have penetrated business-critical workflows, not merely a matter of honesty — it is a competitive differentiator.
Build mechanisms to detect gradual deterioration early. The pattern of March 5, March 10, and March 29 suggests accumulated system load surfacing incrementally. Continuously monitoring P99 latency and error rate trends — and escalating when they signal an approaching major failure — is essential.
Design SLAs with the understanding that AI services are infrastructure. Despite being embedded in the daily professional and academic workflows of 355 million users, DeepSeek has published neither an SLA nor a compensation policy. As user dependency deepens, the standard of accountability expected from providers rises accordingly. Leaving this gap unaddressed means that criticism during outages will shift from “temporary frustration” to “loss of trust.”

What the Three Cases Share

All three cases involve AI service outages. The common thread is that the structural reality of AI inference — per-request computational cost orders of magnitude greater than ordinary web requests — amplifies every problem. Vulnerability to DDoS attacks and crashes from sudden demand spikes share the same root.

The frameworks for capacity planning, load testing, and rate limiting developed for traditional e-commerce and web services need to be rethought in the context of AI services. As the DeepSeek cases illustrate, rapidly growing AI services tend to follow a pattern of “quietly accumulating warning signs until crossing a threshold and collapsing abruptly” — and as users begin depending on these services as infrastructure, the standard of accountability expected from providers rises to match.

References

NSFOCUS, “DeepSeek DDoS Attack Analysis” (January 2025): https://securityboulevard.com/2025/01/the-undercurrent-behind-the-rise-of-deepseek-ddos-attacks-in-the-global-ai-technology-game/
Yicai, “Qwen Lunar New Year Campaign Outage” (February 2026): https://www.yicaiglobal.com/news/alibabas-website-crashes-as-chinese-tech-giant-starts-usd4324-million-lunar-new-year-push-for-qwen-ai-app
Caixin, “DeepSeek Outage: 10 Hours Down Amid China’s AI Demand Surge” (March 2026): https://www.caixinglobal.com/2026-03-30/deepseek-goes-out-for-10-hours-amid-chinas-ai-demand-surge-102428989.html

記事一覧を見る