Key Points for Conducting Incident Response Drills
Introduction
Continuing from the previous article, I’d like to use the NEXCO Central Japan incident as a reference point. Rather than diving deep into a technical analysis of the failure itself, this article focuses on what organizations should do to ensure smoother incident response. The 2024 NEXCO Central Japan system outage caused significant confusion across the board — on the ground and at the management level alike. Several recurrence prevention measures were subsequently proposed. Building on that, this article discusses what to actually implement in incident response drills — the kind of structured exercises that make those prevention measures stick.
What to Cover in an Incident Response Drill
This section discusses the practical contents of a drill plan.
The goal is to avoid the scenario where a real incident hits and everyone freezes, unsure of what to do next. Real emergencies bring pressure, tension, and the unexpected — all of which create room for error. That’s precisely why you need a structured drill plan and the discipline to actually run it. Below are the key elements to include.
1. Defining Drill Prerequisites — Scope and Scenario
Scenario settings and example failure types:
- Database failure
- Network failure
- Authentication infrastructure failure
- External service outage
- Ransomware attack
- Single-point failure vs. cascading/compounding failures
- Estimated blast radius
- Business impact severity level
One important note: the scenario details should remain known only to the drill designers. If participants already know where the “failure” is, it stops being a drill and becomes a performance. Treat it like a real incident — unknown origin, unknown scope.
The perspectives to evaluate are also wide-ranging: the validity of your existing BCP, whether SLA targets can be met, product-originated failures, and security-related incidents, among others.
2. Failure Detection Phase
Items to verify:
- Can current monitoring detect the failure?
- Can false positives be distinguished from real alerts?
- Time from detection to escalation
- MTTD (Mean Time to Detect)
- Alert effectiveness
- Whether monitoring thresholds are appropriately calibrated
Unlike a real incident, a drill has a defined start — you know something is “happening.” That said, the drill should still test whether your monitoring infrastructure is actually functional: Are the expected error messages appearing? Are infrastructure-level thresholds breaching correctly? These need to be validated carefully.
3. Initial Response Phase — Command Structure and Crisis Management Activation
Items to verify:
- Incident severity assessment criteria
- Scope and impact triage
- Availability and quality of runbooks
- Who is the incident commander?
- Decision-making authority
- Role assignments
- Operation of the incident command center / war room
In this phase, pay close attention to two things: the quality of the initial analysis and the quality of communication. “Quality of initial analysis” means: can the runbooks actually guide the team to identify the suspected root cause? And is that information being properly reported up to incident command?
Also verify the physical setup: Is there an appropriate space for the incident command center — a large enough meeting room, projection cables, whiteboards? Are there enough tables and chairs? Is the analysis room equipped with the necessary machines and seating?
Small details matter more than you’d think. Running out of whiteboard markers, missing an HDMI adapter — these are the kinds of things that silently degrade your response quality. Checking even the mundane details sharpens your overall incident readiness.
4. Service Continuity Phase — Evaluating and Switching to Alternative Operations
Items to verify:
- Transition to manual operations
- Degraded-mode service operation
- Maintaining priority business functions
- Effectiveness of alternative procedures
- Manual workarounds
- Job net verification for priority operations
If one of your drill objectives is testing whether your BCP actually functions, this phase is critical. BCPs don’t aim to restore everything at once — there’s a defined priority order for which functions come back first. The key question is: can you actually restore those high-priority functions?
In the meantime, you need to determine whether business will continue manually during the outage and prepare the relevant runbooks. If it’s feasible to involve customers in the drill, coordinating with customer-side operators to validate BCP-activation procedures would be ideal.
5. Retrospective and Improvement
Primary review areas:
- BCP revisions
- Monitoring coverage improvements
- SLA revisions
- Training plan updates
Drills shouldn’t end when the scenario does. A structured retrospective is essential — you need to measure whether the drill played out as designed.
For example, if the drill was testing whether minimum viable job nets function during a BCP event: Did those job nets actually run? Did degraded-mode operations work? Was the failover to the disaster recovery site successful? Evaluate everything: runbook accuracy and accessibility, analysis room readiness, resource availability, and communication quality.
Most retrospectives surface at least some issues. When they do, assess the sufficiency of your BCP, the completeness of your monitoring coverage, and the realism of your SLAs — and propose improvements where needed.
This is also the right time to build or update a skills matrix: a map of which team members were able to handle which failure scenarios, across application, infrastructure, and network domains. A well-maintained skills matrix allows you to assemble incident response teams with balanced expertise when the real thing happens.
6. Additional Considerations — Communication and Beyond
- Are you testing ongoing operations, not just recovery?
- Is decision-making being exercised, not just procedures?
- Does the drill include customer-facing communication scenarios?
- Is the vendor included in the drill?
- Are you running live exercises, not just tabletop simulations?
The decision-making question is closely tied to customer communication and vendor involvement. In real incidents, you’re simultaneously explaining the situation to customers, tasking vendors with technical investigations, managing status reports, and navigating multiple communication channels at once. Simulating who reports to whom — and who gets tasked with what — means that when real chaos hits, people know what to do.
A Real-World Drill Example
One well-known example is the annual disaster drill run by Freee, the Japanese accounting software company. Their scenario involves ransomware infiltrating their environment, destroying a database, and then sending a ransom demand directly to the CEO.
The full details are in the reference links below, but what makes this drill notable is how realistic the scenario is. The ransomware infection causes a live database failure — and simultaneously, the CEO receives an actual ransom demand. Critically, the ransom amount was set just below the accounting threshold that would require disclosure, meaning it was plausibly deniable at the CEO level. That design forces a genuine executive decision under pressure, rather than a scripted response.
Not every organization can replicate that level of realism. But the core objectives remain the same: test your communication flows under stress, verify that your expert resources and runbooks are actually ready, and confirm that your BCP holds up when things get messy.
Closing
Continuing from the previous article on the NEXCO Central Japan incident, I want to advocate again for making incident response drills a core part of building a resilient organization. This article has outlined what to think through when planning and running those drills.
No system is immune to failure. The goal isn’t to prevent every incident — it’s to keep delivering value to end users even when incidents happen. Incident response drills are how you build that capability. Run them, learn from them, and use them to create an organization that can take a hit and keep going.
References
- Mutsumi Takahashi, ITmedia (published 2022-03-18).
freee’s Cloud Disaster Drill: Destroying Their Own DB and Sending a Ransom Demand to the CEO — “Employees Were Traumatized.”
https://www.itmedia.co.jp/news/articles/2203/17/news038.html
(Accessed 2026-05-01) - NEXCO Central Japan (2025).
Crisis Management Review Committee on Widespread ETC System Failures.
https://www.c-nexco.co.jp/corporate/pressroom/2025_crisis-management_etc
(Accessed 2026-05-01) - East Nippon Expressway Co., Ltd., Central Nippon Expressway Co., Ltd., West Nippon Expressway Co., Ltd. (2025).
Recurrence Prevention Measures.
https://www.c-nexco.co.jp/corporate/pressroom/2025_crisis-management_etc/pdf/2025_crisis-management_etc08.pdf
(Accessed 2026-05-01)