System Incident Response Examples: A Simplified Summary of Domestic and International Cases ①

2025/10/24

今野貴大

Introduction

Hello! In this article, I will briefly introduce five insightful system outages, responses, and recovery cases that occurred both in and outside Japan in August 2025, along with the author’s insights.

1: PayPal Global Outage (August 1)

Overview

On August 1, 2025, a system-wide outage occurred on the payment platform of PayPal, one of the world’s largest payment services. Transaction processing for online shopping and various services was temporarily unavailable, leaving users unable to complete payments. The incident occurred in the morning US time, causing a critical disruption to financial transactions, including preventing users from logging in and processing payments during a vital period.

Cause

A memory management issue within the servers led to database access delays. Specifically, the transaction approval and account management systems were mutually dependent, and a delay in one area significantly degraded the overall platform performance.

Recovery

Recovery involved restarting the problematic servers and clearing the cache. In parallel, load balancing was implemented to redirect traffic to other servers, minimizing the impact of the partial service disruption. Normal operations were subsequently restored.

Downtime

Approximately 1 hour and 20 minutes.

Notable Point

PayPal is an extremely vital payment infrastructure with approximately 438 million users and merchants worldwide. The outage was a serious incident with an extensive global impact. The timing of the failure, occurring at the beginning of the month, which often sees a concentration of regular transfers such as rent and salary payments, made the impact even more severe.

Source

PayPal System Status Report

Insight

PayPal is renowned for its highly secure payment services, and the author uses it daily. Given its global reliability and extensive use, this system outage had a considerable impact. It highlights the need for the service to perform scenario testing, anticipating high-load situations during business-critical times like the beginning of the month, to improve preparedness.

2: United Airlines System Outage (August 6)

Overview

On August 6, 2025, a major system failure occurred across the United Airlines network in the United States, leading to widespread flight delays and cancellations at major US airports (including Denver, Houston, and Chicago). The failure specifically hit systems responsible for aircraft weight calculation and flight time tracking, resulting in over 1,000 delayed flights and dozens of cancellations.

Cause

The outage was caused by a failure in United Airlines’ proprietary “Unimatic” system, preventing essential information for aircraft weight calculation and flight tracking from being correctly transmitted to other systems.

Recovery

United Airlines initiated system recovery immediately after the incident and resumed normal operations about four hours later. Recovery efforts included manual aircraft adjustments and flight verifications.

Downtime

Approximately 4 hours.

Notable Point

The failure was limited to United Airlines’ specific system and was unrelated to air traffic control systems. This marks the second similar incident this year, following a comparable outage at Alaska Airlines a few weeks prior.

Source

The Guardian

Insight

A system failure at United Airlines, which holds a top-tier share globally, not only affected passengers but also significantly impacted logistics. The issue arising in a highly dependent, proprietary system exposed vulnerabilities across the airline’s overall system architecture. The recurrence of similar failures, such as the one at Alaska Airlines, suggests a potential systemic risk industry-wide reliance on complex internal systems.

3: Cloudflare Access Outage (August 21)

Overview

On August 21, 2025, a problem occurred with the connection to a specific AWS region (us-east-1) in Cloudflare’s network, a US-based internet service infrastructure provider, due to a sudden surge in client traffic. This resulted in numerous instances of delayed or inaccessible websites and APIs routed through Cloudflare.

Cause

The excessive traffic from a specific customer overburdened the connection with AWS. The unexpected traffic spike exceeded the existing network bandwidth capacity.

Recovery

Cloudflare mitigated the delays by shifting traffic to other regions, manually adjusting routing, and reallocating resources. They also strengthened monitoring tools and developed plans for improving traffic management.

Downtime

Approximately 3 hours.

Notable Point

This incident exposed a potential bottleneck in interconnections between major cloud service providers, emphasizing the renewed importance of sophisticated traffic management.

Source

Cloudflare Incident Report

Insight

Cloudflare is a pillar of the internet, mediating communications for applications and websites worldwide. Even a localized outage significantly impacted numerous services. Specifically, the issue with AWS brought to the surface the interdependence between different major cloud service providers.

4: Yometel Call Assistance Outage (August 22)

Overview

On August 22, 2025, the call assistance application “Yometel,” designed for individuals with hearing impairments, experienced an issue where incoming calls could not be received on iPhone devices. This rendered a critical function unusable for users who rely on call assistance.

Cause

The cause was attributed to a compatibility issue stemming from app settings and a recent iOS update.

Recovery

The problem was resolved by users performing actions such as reinstalling the app or modifying iOS settings. The company apologized for the inconvenience and implemented measures to prevent recurrence.

Downtime

Approximately 1 hour.

Notable Point

This case highlights the significant impact that accessibility-focused application failures can have on a specific user group, as it directly affects a vital aspect of their communication.

Source

Yomete l Official Website

Insight

While the outage was resolved in about an hour, the severe impact on the lives of its target users (those with hearing impairments) was profound. The fix, involving reinstallations and setting changes, was relatively minor but could have been technically challenging for some users.

5: eemo Car Sharing Lock Function Failure (August 27)

Overview

On August 27, 2025, eemo, a car-sharing service specializing in electric vehicles, experienced a vehicle locking function failure, preventing users from securing the vehicle upon return. This required remote locking via customer support, causing significant inconvenience to users.

Cause

A glitch in the communication between the IoT-based vehicle management system and the app resulted in locking instructions failing to reach the vehicle. Communication delays and server malfunctions were identified as the primary causes.

Recovery

The issue was resolved through app restarts and vehicle-side reconfigurations. Users were refunded for any associated delay charges, and the system was modified to prevent future occurrences.

Downtime

Several hours (exact duration not publicly disclosed).

Notable Point

This incident stands out as an example of how a communication failure in an IoT-based service can directly disrupt user convenience.

Source

eemo Official Website

Insight

The failure, a seemingly simple inability to lock the car, was a serious incident for a car-sharing service, potentially leading to user attrition. The fact that the issue could be resolved by relatively simple methods (app restart/reconfiguration) was fortunate. Furthermore, the provision of refunds contributed to maintaining customer satisfaction. However, considering the several-hour downtime and the difficulty for users to resolve the issue independently, the reliance on immediate support highlights a need for greater system autonomy and resilience.

記事一覧を見る