One Year Later: Reflecting on the CrowdStrike Outage and the cybersecurity lessons we can't afford to ignore
How a single software update brought down airlines, banks, and hospitals worldwide
Today marks one year since the CrowdStrike incident that affected roughly 8.5 million Windows systems worldwide, causing what many called the largest outage in the history of information technology.
"The largest outage in the history of information technology.”
As we reflect on this pivotal moment in cybersecurity and IT operations, it's important to focus not just on what went wrong, but on the valuable lessons that have emerged from this experience.
What Happened: A Brief Recap On July 19, 2024, CrowdStrike uploaded a flawed update to its Falcon Endpoint Detection and Response (EDR) software, causing Windows devices to display the "Blue Screen of Death" (BSOD) and disrupting critical infrastructure, businesses, and daily life across the globe.
Here are some key lessons learned from the experience, as shared by Redditors:
Incident Response and Disaster Recovery
Offline Privileged Systems: Having offline privileged systems can be a lifesaver. “Increase the number of offline privileged systems for IT (updated regularly, but otherwise offline).”
LAPS Configuration: Properly configuring and testing Local Administrator Password Solution (LAPS) is crucial. “Fix LAPS... We got AD back up, and some machines have dropped off with no credentials.”
Backup and Recovery: Ensure robust backup and recovery solutions, especially for critical systems. "Consider your backup infrastructure... guess what endpoint protection the backup servers including the management servers were running?"
Communication Channels: Having alternative communication methods, like SMS, can be vital when email is down. "Consider an SMS list for staff (many couldn't access email)."
Security and Deployment
Domain Controller Segregation: Consider segregating a Domain Controller to avoid single points of failure. "Consider separating one Domain Controller (DC) into a different solution to CrowdStrike."
Testing and Rollout: Thorough testing of updates and gradual rollouts can prevent widespread issues. "Test servers always get updated two weeks or so before the production servers."
Documentation: Comprehensive documentation can help in troubleshooting and understanding system interdependencies. "Better documentation with explanation how different components interconnect and session flow goes.”
Key Learnings and Takeaways
1. Resilience Over Prevention The incident reinforced that in our interconnected digital world, outages are inevitable. As the IT ecosystem becomes more complex, IT outages are a common occurrence. The focus must shift from preventing all failures to building systems that can gracefully handle and recover from them.
2. The Importance of Staged Rollouts One of the most critical lessons learned is the need for more robust testing and gradual deployment strategies. CrowdStrike blamed a bug in its test software and promised to take steps to avoid a repeat of such an incident.
3. Business Continuity Planning is Essential Organizations learned the hard way about the importance of having comprehensive business continuity plans. Companies should identify their critical functions to understand what should be recovered first and prioritize their efforts accordingly.
4. Integration of IT and Business Risk IT risks should be integrated with corporate business risk management, as the outage caused decreased productivity, operational efficiency impacts, financial losses, reputational damage, and customer satisfaction issues.
5. The Value of Visibility and Post-Mortems Truly resilient organizations turn disruption into a powerful data source and blueprint for performance assurance by leveraging advanced visibility tools to conduct deeply informative post-mortems.
6. Recovery Validation is Critical Many organizations discovered their recovery time estimates were overly optimistic. Testing and validating recovery procedures before they're needed became a key priority for IT teams worldwide.
Moving Forward: Building a More Resilient Future
The CrowdStrike incident, while disruptive, has served as a wake-up call for the industry. It provides an opportunity for organizations to strengthen their systems and enhance resiliency. The incident highlighted both the fragility and the resilience of our digital infrastructure. As we commemorate this anniversary, let's commit to: - Implementing more robust testing and staged deployment processes - Investing in comprehensive business continuity planning - Building systems with graceful failure modes - Fostering a culture of learning from incidents - Strengthening collaboration between security, IT operations, and business teams
A Testament to Human Resilience
While the outage caused significant disruption, approximately 99% of affected Windows sensors were back online within ten days, demonstrating the remarkable resilience and dedication of IT professionals worldwide who worked around the clock to restore services. The CrowdStrike incident will be remembered not just as a cautionary tale, but as a turning point that made our digital infrastructure stronger, more resilient, and better prepared for the challenges ahead.
What lessons has your organization taken from the CrowdStrike incident?
Share your thoughts and experiences as we continue to build a more resilient digital future together.