By Molly Gross, Principal, Power Solutions, LLC
On Thursday May 7, 2026, Amazon Web Services reported what it described as a thermal event at a single Availability Zone in its US-EAST-1 region in Northern Virginia. According to Amazon’s status updates, rising temperatures inside one data center caused loss of power to affected hardware racks. Amazon’s compute and storage services in the affected zone were impaired, along with the dependent services that ran on them. Recovery proceeded slower than Amazon initially anticipated and required bringing additional cooling system capacity online.
Amazon has not disclosed what specifically failed in the cooling system, and we are not speculating about the root cause here. What is verified — and what matters for any organization operating its own data center cooling — is the cascade itself.
The Cascade Is the Lesson
When cooling capacity drops below the heat load it must remove, temperatures rise. When temperatures cross critical thresholds, hardware fails — or powers down out of self-preservation to protect against permanent damage. The compute and storage running on that hardware go offline. The services that depend on that compute and storage degrade or fail.
This is the sequence Amazon itself confirmed in its Friday morning update — that after cooling systems failed in the affected zone, servers automatically shut down when temperatures exceeded operating thresholds to protect the hardware. The cooling failure became a power loss, and the power loss became a service outage.
That is what an outage looks like when it starts as a cooling problem. The headline says outage. The root is in the mechanical room.
Why End-of-Life Cooling Equipment Slows Recovery
How long a cooling failure lasts depends on three operational factors: whether redundant cooling can carry the load while a failed unit is repaired, whether the right spare parts are on hand or accessible quickly, and whether the equipment is covered by a service plan with defined response times.
For end-of-life cooling equipment, all three work against you:
- No new factory service plan coverage. Schneider Electric does not initiate new factory service plans on equipment past end-of-life. Existing coverage may continue for a defined period, but the runway is finite and shrinks as parts inventory depletes.
- Spare parts availability declines. Inventory is finite once a product reaches EOL. What was a same-day or next-day part can become a multi-week lead time sourced through secondary channels.
- Mean time to repair extends. A failure that would resolve in hours on supported equipment can stretch into days when the right part isn’t on hand and the field technician is working without manufacturer support behind them.
The unit is still running. It still cools. Until something fails — and then the clock starts on a much longer recovery than it would on supported, current equipment. Every additional hour the cooling system is down is an additional hour the temperature is climbing toward the threshold where compute starts protecting itself but shutting down.
What Meaningfully Reduces the Risk
Two things significantly reduce the likelihood that a cooling failure cascades into a power and compute outage:
- Current InRow equipment with redundancy designed in. N+1 cooling capacity at the row or room level means a single unit failure does not push the remaining capacity below the load. The other units carry the room while the failed unit is serviced. The current Schneider Electric InRow Direct Expansion and Chilled Water lines are designed for this kind of integration in modern rack-density and AI-influenced environments.
- Comprehensive factory service plan coverage. Equipment under current Schneider Electric factory service coverage gets priority parts availability, scheduled preventive maintenance that catches developing issues before they become failures, and contractual response-time commitments. EcoStruxure IT remote monitoring extends that visibility — temperature trends, status, and predictive alerts surface before a small problem becomes a thermal event.
Neither prevents a thermal event. Both significantly reduce the likelihood that one cascades into an outage — and both shorten the recovery window when a failure does occur.
Find out where your cooling infrastructure stands — before a failure tells you. Call 800-876-9373, or email [email protected] for more information.
Molly Gross, Principal at Power Solutions, LLC, has over 15 years of experience in critical power for enterprise and government applications. She has extensive knowledge of UPS and data center infrastructure with a specialization in services and product lifecycle management. Molly closely follows emerging trends and innovations in the critical power industry with an eye for incorporating leading edge technologies into both new construction and legacy infrastructures. Connect with Molly on LinkedIn.