System Incident Response Examples: A Simplified Summary of Domestic and International Cases ①
Introduction
Hello! In this article, I will briefly introduce five insightful system outages, responses, and recovery cases that occurred both in and outside Japan in August 2025, along with the author’s insights.
1: PayPal Global Outage (August 1)
Overview
On August 1, 2025, a system-wide outage occurred on the payment platform of PayPal, one of the world’s largest payment services. Transaction processing for online shopping and various services was temporarily unavailable, leaving users unable to complete payments. The incident occurred in the morning US time, causing a critical disruption to financial transactions, including preventing users from logging in and processing payments during a vital period.
Cause
A memory management issue within the servers led to database access delays. Specifically, the transaction approval and account management systems were mutually dependent, and a delay in one area significantly degraded the overall platform performance.
Recovery
Recovery involved restarting the problematic servers and clearing the cache. In parallel, load balancing was implemented to redirect traffic to other servers, minimizing the impact of the partial service disruption. Normal operations were subsequently restored.
Downtime
Approximately 1 hour and 20 minutes.
Notable Point
PayPal is an extremely vital payment infrastructure with approximately 438 million users and merchants worldwide. The outage was a serious incident with an extensive global impact. The timing of the failure, occurring at the beginning of the month, which often sees a concentration of regular transfers such as rent and salary payments, made the impact even more severe.
Source
Insight
PayPal is renowned for its highly secure payment services, and the author uses it daily. Given its global reliability and extensive use, this system outage had a considerable impact. It highlights the need for the service to perform scenario testing, anticipating high-load situations during business-critical times like the beginning of the month, to improve preparedness.
2: United Airlines System Outage (August 6)
Overview
On August 6, 2025, a major system failure occurred across the United Airlines network in the United States, leading to widespread flight delays and cancellations at major US airports (including Denver, Houston, and Chicago). The failure specifically hit systems responsible for aircraft weight calculation and flight time tracking, resulting in over 1,000 delayed flights and dozens of cancellations.
Cause
The outage was caused by a failure in United Airlines’ proprietary “Unimatic” system, preventing essential information for aircraft weight calculation and flight tracking from being correctly transmitted to other systems.
Recovery
United Airlines initiated system recovery immediately after the incident and resumed normal operations about four hours later. Recovery efforts included manual aircraft adjustments and flight verifications.
Downtime
Approximately 4 hours.
Notable Point
The failure was limited to United Airlines’ specific system and was unrelated to air traffic control systems. This marks the second similar incident this year, following a comparable outage at Alaska Airlines a few weeks prior.
Source
Insight
A system failure at United Airlines, which holds a top-tier share globally, not only affected passengers but also significantly impacted logistics. The issue arising in a highly dependent, proprietary system exposed vulnerabilities across the airline’s overall system architecture. The recurrence of similar failures, such as the one at Alaska Airlines, suggests a potential systemic risk industry-wide reliance on complex internal systems.
3: Cloudflare Access Outage (August 21)
Overview
On August 21, 2025, a problem occurred with the connection to a specific AWS region (us-east-1) in Cloudflare’s network, a US-based internet service infrastructure provider, due to a sudden surge in client traffic. This resulted in numerous instances of delayed or inaccessible websites and APIs routed through Cloudflare.
Cause
The excessive traffic from a specific customer overburdened the connection with AWS. The unexpected traffic spike exceeded the existing network bandwidth capacity.
Recovery
Cloudflare mitigated the delays by shifting traffic to other regions, manually adjusting routing, and reallocating resources. They also strengthened monitoring tools and developed plans for improving traffic management.
Downtime
Approximately 3 hours.
Notable Point
This incident exposed a potential bottleneck in interconnections between major cloud service providers, emphasizing the renewed importance of sophisticated traffic management.
Source
Insight
Cloudflare is a pillar of the internet, mediating communications for applications and websites worldwide. Even a localized outage significantly impacted numerous services. Specifically, the issue with AWS brought to the surface the interdependence between different major cloud service providers.
4: Yometel Call Assistance Outage (August 22)
Overview
On August 22, 2025, the call assistance application “Yometel,” designed for individuals with hearing impairments, experienced an issue where incoming calls could not be received on iPhone devices. This rendered a critical function unusable for users who rely on call assistance.
Cause
The cause was attributed to a compatibility issue stemming from app settings and a recent iOS update.
Recovery
The problem was resolved by users performing actions such as reinstalling the app or modifying iOS settings. The company apologized for the inconvenience and implemented measures to prevent recurrence.
Downtime
Approximately 1 hour.
Notable Point
This case highlights the significant impact that accessibility-focused application failures can have on a specific user group, as it directly affects a vital aspect of their communication.
Source
Insight
While the outage was resolved in about an hour, the severe impact on the lives of its target users (those with hearing impairments) was profound. The fix, involving reinstallations and setting changes, was relatively minor but could have been technically challenging for some users.
5: eemo Car Sharing Lock Function Failure (August 27)
Overview
On August 27, 2025, eemo, a car-sharing service specializing in electric vehicles, experienced a vehicle locking function failure, preventing users from securing the vehicle upon return. This required remote locking via customer support, causing significant inconvenience to users.
Cause
A glitch in the communication between the IoT-based vehicle management system and the app resulted in locking instructions failing to reach the vehicle. Communication delays and server malfunctions were identified as the primary causes.
Recovery
The issue was resolved through app restarts and vehicle-side reconfigurations. Users were refunded for any associated delay charges, and the system was modified to prevent future occurrences.
Downtime
Several hours (exact duration not publicly disclosed).
Notable Point
This incident stands out as an example of how a communication failure in an IoT-based service can directly disrupt user convenience.
Source
Insight
The failure, a seemingly simple inability to lock the car, was a serious incident for a car-sharing service, potentially leading to user attrition. The fact that the issue could be resolved by relatively simple methods (app restart/reconfiguration) was fortunate. Furthermore, the provision of refunds contributed to maintaining customer satisfaction. However, considering the several-hour downtime and the difficulty for users to resolve the issue independently, the reliance on immediate support highlights a need for greater system autonomy and resilience.