By Molly Lacerte, Principal, Power Solutions, LLC
The global IT outage caused by a corrupt CrowdStrike update in July of 2024 was a stark reminder of how vulnerable our IT systems are to a single point of failure. A simple software bug took down businesses, governments, and Windows-based systems around the world with a ripple effect that lasted for days. Even computer systems that weren’t running CrowdStrike were affected because cloud-based services or other software providers their systems relied on were impacted.
Our digital world has evolved to the point where we will allow just about any code or software full access behind our firewalls in the name of cybersecurity. The CrowdStrike outage illustrated just how tremendous a vulnerability that access is to global IT infrastructure. It’s a catch 22. To ensure best in class cybersecurity, you need to allow certain software programs full access to your data and systems so they can monitor and block malicious attacks from all possible sources. What we saw with CrowdStrike is what happens when, intentional or not, the program designed to protect the entire system becomes the attacking code.
So, how do we provide robust cybersecurity protection to our data and systems while minimizing the vulnerability to large-scale bugs such as the corrupted CrowdStrike update? For most organizations, that will end up being a risk versus reward analysis that is unique to their specific circumstances, the regulations they are beholden to, and their overall risk tolerance. The key to mitigating that risk is eliminating single points of failure to ensure there is always another way in or around to maintain uptime. In the CrowdStrike case, diversification of operating systems would have enabled organizations to keep non-windows-based computers online and operating properly. For regional weather events, a remote disaster recovery site outside of the impacted area can help maintain uptime.
Since we’ve now seen it once, we can assume we will see it again. Before the next widespread IT outage, assess your power infrastructure to determine what can be done to enhance cybersecurity efforts, reduce potential single points of failure, and plan and prepare for the next potential IT disaster.
Enhance Cybersecurity Efforts by Securing your Network
While your power infrastructure cannot protect your organization from another corrupted software update, it can help secure your network to make cyber-attacks and down time less likely.
It’s important to assess and secure access points to your network. Many companies are quite diligent about securing user access with tools such as VPN. By monitoring and limiting computing devices that have access to the corporate network, companies can control who can log onto their network, when, and from where. But what about all the network connected devices such as UPS equipment, PDUs, and various monitoring devices? Often overlooked as a cybersecurity risk, network management cards allow 2-way access to devices throughout the network. The network protocols that provide the convenience of allowing IT managers to shut down or reboot equipment remotely also create potential access points for malicious code. Eaton’s Gigabit Network Card and Schneider Electric’s Secure NMC Systems Subscriptions powered by EcoStruxure have enhanced security protocols to help prevent unauthorized network access through power infrastructure equipment such as UPSs. Most notably, these latest generation network cards come with a user-configurable firewall to meet specific network security compliance requirements.
Most network connected devices come with factory default log in credentials. These are usually the same username and password from one device to another and its common for IT Managers to skip the step of configuring new and unique log in credentials to these various devices. It is recommended that you check the usernames and passwords for all network connected devices to ensure they have been updated from the factory defaults. Otherwise, anyone with knowledge of a manufacturer’s default credentials could potentially gain access to your network and all the devices connected to it.
Reduce Potential Single Points of Failure
While the power infrastructure has a very limited capability to protect organizations from malicious code, corrupted software updates, and other network borne risks, it does carry the risk of being a point of failure that can cause downtime and disruptions to the business that are potentially just as costly as a cyberattack or errant code. Because of this risk, it is critical that organizations identify and address infrastructure vulnerabilities with power redundancy.
How close an organization gets to true redundancy will require a cost benefit analysis when considering the tolerance for risk of downtime. Some highly critical facilities, such as large regional hospitals, will be connected to a redundant utility source and carry that 2N configuration through their power path all the way to redundant back-up power for dual-corded servers and switches. Other, less critical facilities will consider a single utility source with generator back-up power sufficient but may still configure some highly critical equipment with 2N. Some companies running less critical applications will consider redundancy only for the most critical applications and will tolerate outages for some other, less critical, areas of the facility.
The UPS and PDUs are the most common single points of failure in the power infrastructure and, fortunately, often the easiest to correct. Most major manufacturers offer modular 3-Phase UPS systems, such as the APC Symmetra PX, that can offer N+1 redundancy with extra power modules. N+1 configurations allow for the failure of a power module without interruption to the IT load. Depending on the load size and the capacity of the UPS chassis, these systems can also be configured for N+2 or more redundancy. Replacing a single module UPS with a modular UPS is the simplest way to introduce redundancy into the data center back-up power infrastructure.
On a smaller scale, to fully reap the benefit of dual corded servers and networking gear, each power cord should be plugged into its own backup power supply. Best practice to maintain 2N redundancy for rack mounted dual corded equipment is to have two separate rack PDUs with one input cord plugged into each PDU. Subsequently, each of the PDUs is plugged into and protected by its own rack-mounted UPS. The APC Smart-UPS Online and the Eaton 9PX are both popular for this application. If one UPS, PDU, or input cord fails, the other side is fully protected and there is no downtime.
Plan and Prepare for Widespread Outages
Disasters will occur. Some, like major weather events, will come with a few days’ notice. Others, like the corrupt CrowdStrike update, will occur in an instant without warning. IT managers need to be prepared for both. In most cases, a disaster recovery plan will minimize the adverse effects of the interfering event. For highly critical applications, a well-considered disaster recovery plan often includes a remote Disaster Recovery (DR) site. This is most effective against localized weather events or regional power outages. In the case of the CrowdStrike outage, a DR site would only have been effective if the equipment there was running on a non-Windows operating system such as Linux or Apple.
Other recommended preparations include regular maintenance and testing of the generator and ATS including a check on the fuel supply. UPS, 3-Phase PDUs and Data Center Cooling equipment should have factory service plan coverage with regular preventive maintenance visits. Depending on the installation environment and number of cycles, UPS batteries should be proactively replaced once they are anywhere between three and five years old.
The global IT outage caused by the corrupt CrowdStrike software update was somewhat unique in how it was able to proliferate through so many computer systems around the globe within a matter of minutes. While we may not see an outage of that scale in the immediate future, it serves as a reminder how vulnerable our IT systems and power infrastructure are to single points of failure. Before the next disaster strikes, it’s worth auditing your infrastructure and systems for network access points that may be vulnerable to attack and adding redundancy to potential single points of failure in the power infrastructure. Similarly, it may be time to review and enhance your site’s disaster recovery plan and evaluate your DR site for effectiveness against a range of potential disruptions. If you don’t already have a disaster recovery plan in place, you can start with a data center and power infrastructure Risk Assessment Service to identify potential problem areas.
For more information about Disaster Recovery Planning, Assessment services, and proactive power infrastructure maintenance and factory service plans, call 800-876-9373 or send an email to [email protected].
For more information about Data Center Best Practices,
call 800-876-9373 or email [email protected].