The Delta Airlines system failure on Aug 8, 2016, forced the cancellation of more than 300 flights, inconveniencing hundreds of thousands of people all over the world, and costing Delta Airlines an estimated $150 million. While Delta has been somewhat vague in explaining the exact cause of the system failure, for power professionals, facilities managers, and electrical contractors, you have to wonder if a failure of that magnitude could strike your organization.
The cause of the Delta system shut down has been described as a cascading series of events in their Atlanta, GA based data center. It started with a small fire and was followed by a power surge that went unmitigated by faulty switchgear. According to some sources, the malfunction started when Delta IT staff ran a routine scheduled test that switched to the backup generator in the early hours of August 8th. The test created a power surge which then caused a fire in an Automatic Transfer Switch (ATS). The surge and fire caused about 500 servers to shut down.
The Atlanta data center is configured for 2N redundancy but when power was lost some critical systems and equipment did not switch over to back up power. That is when they discovered that roughly 300 of their 7,000 data center components were not correctly connected to backup power. According to Delta’s COO, Giles West, “(On) Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power”. West told The Week, “When this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” Once power was restored, communication between servers was slow to come back online, creating further confusion and delays as critical Delta systems remained inoperable and lead to a system malfunction that made worldwide news.
All of this information leaves out the one vulnerability that was the smoking gun. Clearly, the 2N redundancy was not correctly configured for all of the servers in the data center. It could have been as simple as dual input servers plugged into the same UPS or it could have been a larger-scale vulnerability in the data center set-up. So how exactly do you make sure that this same type of catastrophic failure does not happen in your facility? Maybe you have dual inputs plugged into different UPS’s, and a tested disaster recovery plan in place. Perhaps you have regularly scheduled maintenance on all vital components. But simple human error, aging infrastructure, or imperfectly configured backup plans can cause a loss of power and shutdown. If your business has recently merged, expanded, or invested in new equipment inadequate power back-up or integration of legacy systems can form an elaborate house of cards just waiting for someone to bump the table.
Wherever the error occurred in Delta Airlines’ Atlanta data center, a proactive Risk Assessment Evaluation could have identified the critical points of vulnerability and prevented the failure. Discovering the potential points of failure in your data center configuration before the next outage can help you avoid the same fate.
The Risk Assessment Service is designed to identify the possible source or sources of small things that can quickly escalate to a disaster. A comprehensive Risk Assessment service from Power Solutions includes an examination of the mechanical, electrical, security technology, and data center back-up systems. We examine the power path, from where the utility enters the building, all the way through to each user’s workstation. This includes identifying potential single points of failure, confirming circuit breaker trip sequences, and a review of the disaster recovery plan. All findings are presented in a comprehensive, easy-to-read report that includes photographs of each possible problem area, a risk level and vulnerability score for each finding, and a recommended corrective action. The data is presented with numerical and color-coded reporting so it is easy for the end user to prioritize next steps and present a plan to senior management.
Electrical system vulnerability is a concern with aging infrastructure but disasters are avoidable and ensuring power quality and uptime is a constant demand. Don’t let a Delta situation happen to you.
For more information about the Risk Assessment Service from Power Solutions, or our other products and services that can help you avoid a shutdown, call 800-876-9373 for more information.