China Internet Incident Response Case Studies (Part 2)

2026/04/06
Kota Kagami

Introduction

In Part 1, we examined the two-tier structure of China’s internet infrastructure — the domestically self-contained design versus dependence on international connectivity — through three incidents: the Alipay discount display bug, the China Unicom local DNS anomaly, and the Great Firewall’s TCP/443 disruption.

Part 2 focuses on two incidents Alibaba Cloud experienced in 2024. One was a short-lived outage caused by a network configuration issue; the other was a prolonged disruption triggered by a physical data center fire. Both cases challenge the assumption that “we’re fine because we’re on the cloud.”

Incident 1 Alibaba Cloud Shanghai AZ-N Outage (July 2, 2024)

What Happened

Between 10:00 and 11:00 AM on July 2, 2024, Alibaba Cloud detected a network access anomaly in Availability Zone N (AZ-N) of its Shanghai region. The fault began at 10:04 AM Beijing time. Bilibili — China’s largest video platform — and RedNote (Xiaohongshu, a popular image and review social app) both deployed their core services in this single zone, and both went down in a cascade.

On Bilibili, content refresh failed, comments became unavailable, and personal pages, along with the messaging system went completely offline. RedNote experienced a total service outage. Alibaba Cloud’s status page records the incident as resolved within 38 minutes.

Response Highlights

Alibaba Cloud’s engineering team detected the fault and activated a pre-configured network flow switching mechanism to reroute traffic away from AZ-N, restoring services incrementally. The fact that this rerouting design existed in advance is what made the 38-minute recovery possible — the team essentially flipped to a prepared failover path rather than improvising a fix.

On the other hand, both Bilibili and RedNote had concentrated their primary services in a single availability zone. The reasoning — headquarters in Shanghai, lowest latency — is rational, but a single-zone design means a single-zone failure produces a total outage. This is a classic trap in cloud architecture.

Root Cause

No formal root cause was disclosed publicly. The most likely explanation is packet loss caused by a network device failure or misconfiguration. Alibaba Cloud has historically been conservative about publishing detailed post-incident analyses, and this case was no exception — no formal postmortem was released.

Key Takeaways

Concentrating in a single zone creates an illusion of high availability.

Distributing services across multiple availability zones within the same region, with automatic failover designed for zone-level failures, is essential. Alibaba Cloud itself officially recommends multi-zone deployment. The additional cost is real, but so is the blast radius when a single zone fails.

Understand what your cloud provider’s failover mechanisms can actually do — before an incident.

The fast recovery here happened because Alibaba Cloud had a rerouting mechanism ready. For every critical failure scenario, you should know which cloud features are in play and how far automatic recovery will take you.

A 38-minute outage still leaves hours of cleanup work. 

Even a brief disruption creates cache inconsistencies, log gaps, and incomplete session states that need manual remediation. Recovery runbooks should explicitly include a post-recovery cleanup phase as a separate step from the incident itself.

Incident 2 Alibaba Cloud Singapore AZ-C Data Center Fire (September 10–16, 2024)

What Happened

At 10:20 AM Beijing time on September 10, 2024, Alibaba Cloud detected network access anomalies in Availability Zone C (AZ-C) of its Singapore region. The root cause was a fire inside the Digital Realty SIN11 data center that Alibaba Cloud used for that zone. Fire suppression systems activated, blocking personnel from entering the building. With no human access, temperatures inside continued to rise and network equipment began failing.

Alibaba Cloud issued an unusually candid warning in its status updates: engineers could not enter the building, there was no way to control the rising temperature, and a complete network blackout of AZ-C was possible. Users with services deployed in AZ-C were instructed to migrate immediately. ByteDance (TikTok’s parent company) services were also affected. As of September 16, hardware recovery was still ongoing, with some equipment requiring careful drying before power-on could be attempted.

Response Highlights

The most notable aspect of Alibaba Cloud’s response was issuing a proactive warning before the worst-case scenario materialized. Telling users to migrate while the situation was still evolving — rather than waiting until complete failure was confirmed — is the right instinct. It gives customers time to act and limits trust erosion if things deteriorate further.

What cannot be overlooked is the structural problem: the control plane (the management layer that orchestrates cloud resources) was concentrated solely in AZ-C. This same issue had been flagged in the 2022 Hong Kong data center incident. When the control plane is in a single zone, a zone failure doesn’t just bring down compute — it takes down the management tooling itself, making recovery operations impossible until physical access is restored. The fact that this pattern repeated after 2022 is a pointed question for the entire industry.

Root Cause

A physical fire in the data center. What mattered more than the fire itself was the downstream consequence: once personnel could not enter the building, there was no way to manage the affected infrastructure. Remote management capabilities and server protection measures in the event of fire suppression system activation were not part of the data center evaluation criteria — and they should have been.

Key Takeaways

Physical risks fall outside cloud SLAs. 

Fire, flooding, and power failure at the physical layer are not covered by cloud service-level agreements. When evaluating data center regions — especially overseas — include fire resistance, water mitigation design, and site access procedures in your assessment criteria.

Making the control plane multi-zone is a non-negotiable priority. 

In your own on-premises infrastructure as well, never co-locate management systems (monitoring, authentication, job orchestration) in the same zone as production. If the zone goes down and takes your management tooling with it, you cannot recover.

Build and test your migration playbook before you need it. 

When an urgent migration notice arrives, your ability to respond depends entirely on preparation done in advance. Know your migration targets, set appropriate DNS TTLs, and verify data portability as part of regular disaster recovery exercises — not as a response to an active incident.

What These Two Incidents Share

The incidents differ sharply in nature. The Shanghai AZ-N failure was a network-layer event resolved in 38 minutes. The Singapore AZ-C fire was a physical event that took nearly a week to fully resolve. But both trace back to the same structural vulnerability: single-zone dependency amplified the impact.

In Incident 1, the problem was on the customer side — Bilibili and RedNote had consolidated their services in one zone. In Incident 2, the problem was on Alibaba Cloud’s side — its own control plane was in a single zone. Together, they raise the same question: who is responsible for availability design, at which layer, and to what extent?

Using a managed cloud service dramatically reduces the operational burden of running infrastructure. But the assumption that “the cloud handles availability” is a liability. Designing across zones and regions, practicing switchover drills regularly, and understanding physical-layer risks are responsibilities that remain with the user — no matter which cloud you’re on.

References