China’s Internet Infrastructure and Three Recent Incident Response Cases (Part 1)
Introduction
System outages in China are never just a case of “an app going down.” Because telecommunications carriers (fixed and mobile), DNS, hyperscale cloud providers, super-app payment platforms, and cross-border network controls are all deeply intertwined, the blast radius of any single failure can rapidly escalate to the level of critical social infrastructure.
This first installment in the series starts with a bird’s-eye view of China’s internet architecture, then examines three incident response cases from 2025. Each one offers rich lessons — not only in restoration, but also in root-cause isolation, user communication, and prevention of recurrence.
1. A High-Level Look at China’s Internet Infrastructure
The Two-Layer Structure: “Domestic” vs. “International”
China’s network is built around massive domestic traffic, and applications are naturally designed to be self-contained within the country. At the same time, many services still have residual dependencies on international connectivity — overseas cloud regions, foreign SaaS, cross-border APIs, foreign CDNs, certificate authorities, and identity platforms.
This means a failure mode that frequently surfaces: everything inside China is fine, but the service breaks because it can’t reach the outside world — a dynamic explored in the TCP/443 case below.
Carrier Dependency and DNS as the “De Facto Gateway”
China’s major telecoms carriers wield enormous influence, and user experience is shaped not just by line quality but by the behavior of local DNS (Local Domain Name System) resolvers.
When DNS becomes unstable, apps and websites enter a state where “the server is alive but unreachable” — which feels like a total outage to end users. Alibaba Cloud (Aliyun) even calls this out explicitly in its FAQ, noting that instability in Local DNS can lead to resolution anomalies.
Treating “Mega-App = Mega-Infrastructure”
Super-apps like WeChat (微信) and Alipay (支付宝) are indispensable to daily life in China — serving as the entry point for payments, transfers, bill payments, and a wide array of civic services. When they fail, the ripple effects span finance, commerce, and public services.
Incident response therefore goes beyond SRE-style recovery workflows; it must also include deterring fraud opportunism (fake SMS, phishing) and establishing clear compensation policies — as illustrated in the Alipay case below.
2. Alipay “Government Subsidy” Misapplication (January 16, 2025)
What Happened
On January 16, 2025, Alipay’s payment screen briefly displayed an erroneous “Government Subsidy (政府补贴)” discount, causing payments to be reduced by approximately 20% for a short window. Reports place the incident between roughly 14:40 and 14:45.
Why the Initial Response Mattered — Containing the Chaos
“Discount or subsidy” failures are particularly tricky because secondary damage tends to arrive simultaneously:
- Viral spread on social media causing a surge of additional access
- Amplified user anxiety (“Will I be charged the difference later?”)
- Opportunistic fraud — fake SMS messages demanding repayment, fake links
This case was no exception: fraudulent text messages claiming Alipay would “claw back” the discount began circulating. Alipay responded by publicly stating it would not pursue repayment from affected users and issuing warnings about the fake messages.
Root Cause
Multiple reports describe the cause as a misconfiguration in the back-end management console for a routine marketing campaign — specifically, selecting the wrong discount type and wrong monetary value from a template. This class of incident typically traces back not to a code bug, but to insufficient access controls, review gates, and configuration guardrails.
Key Takeaways
Business rules for discounts and subsidies are riskier than code. Configuration changes pushed to production are notoriously hard to scope for impact.
Build a safety net to detect anomalous discounts. Monitoring the distribution of discount rates across all transactions — or alerting on a sudden spike in specific labels like “Government Subsidy” — can trigger detection before humans notice.
Treat user communication as a financial incident. Clearly addressing whether users will be charged back, confirming fund safety, and warning about fraud — in plain, concise language — is just as important as the technical recovery.
3. China Unicom Local DNS Anomaly (August 12, 2025)
What Happened
On the evening of August 12, 2025, a widespread outage affecting China Unicom customers — primarily in Beijing — left many users unable to access websites and apps. Social media filled with reports of “tons of sites not loading” and payments dropping mid-transaction.
Technically, external observers noted claims that “DNS had been poisoned and many domains were resolving to 127.0.0.2” — though this was based on third-party observation, not official confirmation.
The Response: Fast Isolation on the Cloud Side
What stands out in this case is that Alibaba Cloud reportedly detected the “Unicom Local DNS anomaly” through its own monitoring at around 19:40 and filed a fault report (报障) with the carrier. The issue was reportedly resolved by around 20:48.
In other words, the cloud provider quickly isolated the problem as external (carrier-side) rather than attributable to their own systems and escalated through the proper channel. When this step is delayed, users and customers keep blaming the cloud provider, wasting response resources on the wrong front.
Key Takeaways
DNS is the entry point for every communication an app makes. Saying “we have redundant DNS, we’re fine” is not sufficient — if the Local DNS resolver that users are pointed to breaks, everything breaks together. Aliyun’s own FAQ acknowledges that Local DNS instability can cause resolution failures.
Practical recommendations:
- Use HTTPDNS, DoH, or DoT — options that bypass the Local DNS resolver (especially important for mobile apps)
- Monitor across multiple carriers and connection types so you can immediately determine whether an incident is carrier-specific
- When DNS is the failure point, consider a static failsafe — a backup routing path for critical domains
4. The “Border-Side” Outage: TCP/443 Blocked for 74 Minutes (August 20, 2025)
What Happened
For approximately 74 minutes — roughly 00:34 to 01:48 Beijing time on August 20, 2025 — HTTPS traffic between China and the outside world failed broadly. The most impacted protocol was TCP port 443, the standard port for HTTPS. Researchers and practitioners in the network analysis community published findings confirming that during this window, connections over port 443 failed to complete, observed from multiple vantage points.
(References: GFW Report analysis, The Register reporting)
Critically, this was not a case of a specific company’s servers going down. The outage had all the hallmarks of the traffic control mechanisms near the network border (commonly associated with the Great Firewall) broadly interfering with HTTPS connection establishment.
Technical Background
TCP is the communication protocol that reliably delivers data over the internet — guaranteeing order and delivery confirmation. Most web and API traffic runs on top of TCP.
Ports are numbered entry points that differentiate traffic types on a shared server. Port 443 is the standard port for HTTPS (encrypted web traffic); port 80 is used for unencrypted HTTP. Because HTTPS is the standard for virtually all modern web and app communication, blocking port 443 has an outsized impact.
HTTPS and TLS: HTTPS is the encrypted communication method you see as https:// in a browser. TLS is the underlying encryption mechanism (certificate verification, key exchange). The sequence is: TCP connects → TLS establishes encrypted session → HTTP exchanges data.
TCP 3-Way Handshake: Before any data flows, TCP requires three steps:
- Client → Server: SYN (“I’d like to connect”)
- Server → Client: SYN+ACK (“Acknowledged, go ahead”)
- Client → Server: ACK (“Connection established”)
TLS negotiation begins after this handshake completes.
RST (Reset): A TCP signal that says “terminate this connection immediately.” When a device receives an RST, the OS or networking library treats it as “the other side has disconnected.”
RST+ACK Injection (the core mechanism): Observations suggest that during this window, forged RST+ACK packets were injected into TCP 443 connections, forcibly terminating them. “Injection” here means a third party — a device sitting on the network path, neither the client nor the server — sends packets impersonating one of the legitimate endpoints, tricking both sides into thinking the connection was dropped. Think of it less as “cutting the wire” and more as “making both ends believe the wire was cut.”
How It Looked in Practice
RST-based outages typically manifest as:
- Browser: Pages won’t load, get stuck partway, or show “Connection was reset” errors
- Apps: Login fails, APIs time out, payments don’t go through
- Monitoring: Failures appear at the TCP connection or TLS handshake layer — before any HTTP status code (200/500) is returned
The key difficulty: because the failure occurs before the server is even reached, the root cause is far harder to diagnose in the early stages. And because port 443 is used by virtually every service, impact spreads across a wide blast radius instantly.
Why This Is Potentially Catastrophic for Businesses
A domestic service can be fully operational — and still be brought down by this scenario.
International API dependencies become a single point of failure. Even nominally domestic services often rely on:
- Authentication (SSO, identity providers, token issuance)
- Payments and fraud detection (lookup APIs, payment gateways)
- Analytics and advertising (event ingestion, tag delivery)
- Distribution (foreign CDNs, feature flags, app update services)
When port 443 dies, all of these calls fail together — producing the confusing symptom: “our domestic infrastructure is healthy but nothing works.”
When operations tooling lives abroad, recovery slows. If your monitoring, ticketing, voice bridges, CI/CD pipelines, or certificate management are all on foreign SaaS platforms, the very tools you need to respond are also affected. Response capacity degrades precisely when you need it most.
74 minutes of downtime keeps paying dividends — in the wrong direction. Even after restoration:
- Batch jobs that failed need to be re-run
- Token refresh gaps emerge (login failures surface hours or days later)
- Monitoring has a blind spot — missing logs and metrics make post-incident analysis difficult
- Data consistency breaks (order states, billing events, inventory counts fall out of sync)
Post-restoration reconciliation often is the real incident.
What Caused It?
The cause has not been officially confirmed. Hypotheses in circulation include:
- Testing / experimental rollout (validating new control logic)
- Misconfiguration (logic applied more broadly than intended)
- New equipment deployment or upgrade (a bug introduced by border device changes)
- Unintended cascade (controls targeting a specific use case expanding to all 443 traffic)
The important framing here is not pinning down the cause, but treating “border-side factors can disrupt port 443 at scale” as a real, documented risk.
Key Takeaways
- Ensure critical paths have China-domestic fallbacks — domestic regions, domestic CDNs, domestic monitoring
- Make external dependencies switchable — multi-region architecture, queuing, retries, and circuit breakers
- Monitor beyond HTTP 200 — measure at the TCP handshake and TLS establishment layers for true end-to-end visibility
- Design for “connectivity is sometimes unavailable” — don’t look for workarounds around the rules; build business continuity planning that includes post-recovery data reconciliation procedures
5. The Response Pattern Common to All Three Cases
Despite their very different natures, the operational skeleton of each response is strikingly similar:
| Phase | Description |
|---|---|
| Detection | Can you get ahead of user reports? (Payment KPIs, DNS resolution rates, TCP/TLS success rates) |
| Isolation | Is this ours, the carrier’s, the border, or a config change? |
| Containment | Stop the misconfiguration, route around the failing domain, limit the impact surface |
| Communication | Short and certain: fund safety, no clawback, estimated recovery time, fraud warnings |
| Prevention | Configuration guardrails, layered monitoring, architectural review of external dependencies |
References
- GFW Report (English): https://gfw.report/blog/gfw_unconditional_rst_20250820/en
- GFW Report (Chinese): https://gfw.report/blog/gfw_unconditional_rst_20250820/zh
- The Register: https://www.theregister.com/2025/08/21/china_port_443_block_outage
- TechRadar: https://www.techradar.com/pro/china-cut-itself-off-from-the-global-internet-for-an-hour-on-wednesday-but-was-it-a-mistake
- SDxCentral: https://www.sdxcentral.com/news/mystery-outage-cuts-china-off-from-global-internet-traffic
- GIGAZINE: https://gigazine.net/gsc_news/en/20250825-china-gfw-block-all-https/
- Economic Times: https://m.economictimes.com/news/international/us/china-cuts-ties-with-the-global-internet-for-60-minutes-world-baffled-heres-what-happened/articleshow/123439249.cms