What NEXCO Central Japan’s Outage Teaches Us About the True Value of IT Operations

2026/04/27
Nakatani Taichi

Introduction: Can You Really Say You Gave It Everything?

The outage that struck NEXCO Central Japan in 2025 disrupted the travel plans of countless people who depend on the expressway as a piece of social infrastructure. In an era where highway toll payments have been “smartified” through ETA systems, many IT engineers who handle day-to-day system operations likely found themselves thinking: What if something like this happened on my watch?

When news like this breaks, engineers naturally gravitate toward questions like: What caused it? How long did recovery take? Those are important questions — but this article deliberately sets the technical details aside.

For years, operations teams have been evaluated primarily on one metric: how fast they restored service. But is that really enough?

Even if a system comes back online, if users remain confused, frontline staff are left exhausted and overwhelmed, and the trust of stakeholders has been eroded — can that response genuinely be called a success?

This article uses the NEXCO Central Japan outage as a lens to examine what it actually means for IT operations to deliver value.

Chapter 1: “Restored = Done” Measures Less Than You Think

Restoration is, of course, essential. A service that stays down leaves users with no options. But restoration is better understood as returning to the starting line — not crossing the finish line.

What matters is what happened along the way.

Consider these situations:

  • The system came back, but customer communication was delayed, leaving end users unable to make informed decisions.
  • Information sharing broke down internally, causing the response to reverse course multiple times.
  • Load concentrated on a small number of people, raising the risk of critical judgment errors.

None of these problems is visible in restoration time alone.

In the NEXCO case, the question of refunds for end users sparked significant public debate. Legally, expressways do not guarantee arrival times — traffic delays don’t warrant refunds, and once you’ve entered the highway, you’ve paid for access to the road itself. That’s the contractually correct position. Emotionally, however, it’s another matter entirely.

Returning to systems: SLA definitions typically mark recovery at the moment the system resumes normal operation. From a contractual standpoint, that’s resolution. But the factors that don’t fit neatly into SLA definitions — the near-misses, the experience of confusion — are precisely what determine the quality of value delivered to customers and end users.

Incident response is not a purely technical exercise. It is the work of sustaining a service in its entirety.

Chapter 2: The “Real Value” Operations Must Protect

Operations work is not just about keeping systems running. Its deeper purpose is to support what users actually do and how society functions.

For an expressway, that means:

  • Users can travel with confidence
  • Logistics flow on schedule
  • Society’s trust in the infrastructure is maintained

These are the conditions under which the service has value at all. When an outage occurs, operations may not be able to preserve all of this value. But teams are still expected to reason through: How much can we protect? Which impacts can we minimize? That reasoning is the foundation of BCP planning.

When an incident hits, the decisions that matter go beyond “fix the system”:

  • Which information should be communicated first?
  • Which functions should be restored in priority?
  • Are users in a position to take alternative action?

In the NEXCO outage, one of the identified problems was that ETC recovery decisions defaulted to manual, on-site judgment — creating inconsistency depending on who was responding, and limiting the effectiveness of the initial response. The post-incident remediation specifically addressed this by establishing a central command structure to clarify authority and coordination.

Chapter 3: Incident Response Is Never One Person’s Job

The larger the outage, the more people it involves:

  • Internal operations teams
  • Internal development teams
  • External vendors
  • Frontline staff managing physical locations
  • End users

Each of these parties sees the situation from a different vantage point. The danger is information asymmetry — what one team treats as critical, another may barely register. Left unaddressed, this leads to delayed decisions and misaligned responses.

This is why operations teams must act as connectors of information:

  • Accurately convey what is happening on the ground
  • Share constraints and limitations clearly
  • Refuse to leave the situation ambiguous

These are unglamorous tasks. But they have an outsized effect on how far an outage spreads and how long it lasts.

Closing these gaps requires two foundational things: a defined communication structure, and a clear definition of what constitutes a “major incident.” The NEXCO remediation report defined a large-scale ETC outage as: “a system failure that causes ETC lane disruptions at multiple toll plazas during the same time period.” Having that definition — and a command structure to activate when it’s triggered — is what enables an organization to respond as a unified body rather than a collection of disconnected individuals.

Chapter 4: Making Decisions Without a Right Answer

The hardest part of incident response is that there often is no correct answer.

  • Prioritize speed, or prioritize certainty?
  • Restore partial functionality, or wait for a full recovery?
  • Proceed on-site judgment, or wait for authorization?

Every choice carries both upside and risk. What makes the difference is having a basis for judgment:

  • Which option minimizes impact on users?
  • Does this choice leave room to course-correct?
  • When information is incomplete, how far can we move on assumptions?

These questions improve decision quality. But deliberating over them in real time during a major incident leads to fragmented, inconsistent action across the organization. That is precisely why pre-incident drills matter.

By developing incident scenarios in advance and running through recovery exercises, teams surface problems they wouldn’t otherwise see: gaps in communication chains, unvalidated fallback points, and ambiguity about which functions to prioritize. Drills also help clarify skill levels across the operations team — informing who should be paired with whom, and how the response roster should be structured.

The NEXCO remediation report reflects exactly this: revised communication structures and formalized incident definitions that didn’t exist before. If you haven’t recently reviewed the incident response setup for your own systems, this is a good moment to do so.

Chapter 5: Value Is Determined by What You Learn

Outages cannot be eliminated. That’s precisely why what you do after one matters so much.

  • Why did the decision take so long?
  • Where did information break down?
  • What parts of the response actually worked?

Reflecting on these questions and translating the answers into systemic improvements is what defines the maturity of an operations function. Not just a retrospective memo — but concrete changes to structure, process, and capability.

This accumulation of learning is what meaningfully changes how an organization handles the next incident.

The NEXCO remediation materials reflect this well. The position on refunds remained legally consistent: expressway fees are payment for access, not a guarantee of arrival — so no blanket refunds. But for cases where ETC system failures prevented the service from delivering its expected value, the documentation now provides for individual case review and potential refunds where warranted.

The deeper cause of the delayed response — insufficient communication infrastructure — is also directly addressed with structural remediation. The internal details of why decisions were slow and where communication broke down haven’t been made public. But whether or not the specifics are available, the prompt remains the same: review your own systems, your own escalation chains, your own definitions of “incident,” before the next one happens.

Closing: Operations Is the Work That Remains

No matter how sophisticated systems become, it is ultimately people who sustain them. And the people at the front of that effort are in operations.

When an outage occurs, all users see is the outcome. But behind that outcome are dozens of judgment calls and coordination efforts that never surface publicly.

“It’s restored, so we’re done” is not sufficient.

The question that matters is: What were we able to protect — and how?

That is the standard that modern IT operations should hold themselves to.

References: