Tag Archives: resilience

CrowdStrike IT Outage Explained by a Windows Developer

Understanding the CrowdStrike IT Outage: Insights from a Former Windows Developer

Introduction 

Hey, I’m Dave. Welcome to my shop.

I’m Dave Plummer, a retired software engineer from Microsoft, going back to the MS-DOS and Windows 95 days. Thanks to my time as a Windows developer, today I’m going to explain what the CrowdStrike issue actually is, the key difference in kernel mode, and why these machines are bluescreening, as well as how to fix it if you come across one.

Now, I’ve got a lot of experience waking up to bluescreens and having them set the tempo of my day, but this Friday was a little different. However, first off, I’m retired now, so I don’t debug a lot of daily blue screens. And second, I was traveling in New York City, which left me temporarily stranded as the airlines sorted out the digital carnage.

But that downtime gave me plenty of time to pull out the old MacBook and figure out what was happening to all the Windows machines around the world. As far as we know, the CrowdStrike bluescreens that we have been seeing around the world for the last several days are the result of a bad update to the CrowdStrike software. But why? Today I want to help you understand three key things.

Key Points

  • Why the CrowdStrike software is on the machines at all.
  • What happens when a kernel driver like CrowdStrike fails.
  • Precisely why the CrowdStrike code faults and brings the machines down, and how and why this update caused so much havoc.

Handling Crashes at Microsoft 

As systems developers at Microsoft in the 1990s, handling crashes like this was part of our normal bread and butter. Every dev at Microsoft, at least in my area, had two machines. For example, when I started in Windows NT, I had a Gateway 486 DX 250 as my main dev machine, and then some old 386 box as the debug machine. Normally you would run your test or debug bits on the debug machine while connected to it as the debugger from your good machine.

Anti-Stress Process 

On nights and weekends, however, we did something far more interesting. We ran a process called Anti-Stress. Anti-Stress was a bundle of tests that would automatically download to the test machines and run under the debugger. So every night, every test machine, along with all the machines in the various labs around campus, would run Anti-Stress and put it through the gauntlet.

The stress tests were normally written by our test engineers, who were software developers specially employed back in those days to find and catch bugs in the system. For example, they might write a test to simply allocate and use as many GDI brush handles as possible. If doing so causes the drawing subsystem to become unstable or causes some other program to crash, then it would be caught and stopped in the debugger immediately.

The following day, all of the crashes and assertions would be tabulated and assigned to an individual developer based on the area of code in which the problem occurred. As the developer responsible, you would then use something like Telnet to connect to the target machine, debug it, and sort it out.

Debugging in Assembly Language 

All this debugging was done in assembly language, whether it was Alpha, MIPS, PowerPC, or x86, and with minimal symbol table information. So it’s not like we had Visual Studio connected. Still, it was enough information to sort out most crashes, find the code responsible, and either fix it or at least enter a bug to track it in our database.

Kernel Mode versus User Mode 

The hardest issues to sort out were the ones that took place deep inside the operating system kernel, which executes at ring zero on the CPU. The operating system uses a ring system to bifurcate code into two distinct modes: kernel mode for the operating system itself and user mode, where your applications run. Kernel mode does tasks such as talking to the hardware and the devices, managing memory, scheduling threads, and all of the really core functionality that the operating system provides.

Application code never runs in kernel mode, and kernel code never runs in user mode. Kernel mode is more privileged, meaning it can see the entire system memory map and what’s in memory at any physical page. User mode only sees the memory map pages that the kernel wants you to see. So if you’re getting the sense that the kernel is very much in control, that’s an accurate picture.

Even if your application needs a service provided by the kernel, it won’t be allowed to just run down inside the kernel and execute it. Instead, your user thread will reach the kernel boundary and then raise an exception and wait. A kernel thread on the kernel side then looks at the specified arguments, fully validates everything, and then runs the required kernel code. When it’s done, the kernel thread returns the results to the user thread and lets it continue on its merry way.

Why Kernel Crashes Are Critical 

There is one other substantive difference between kernel mode and user mode. When application code crashes, the application crashes. When kernel mode crashes, the system crashes. It crashes because it has to. Imagine a case where you had a really simple bug in the kernel that freed memory twice. When the kernel code detects that it’s about to free already freed memory, it can detect that this is a critical failure, and when it does, it blue screens the system, because the alternatives could be worse.

Consider a scenario where this double freed code is allowed to continue, maybe with an error message, maybe even allowing you to save your work. The problem is that things are so corrupted at this point that saving your work could do more damage, erasing or corrupting the file beyond repair. Worse, since it’s the kernel system that’s experiencing the issue, application programs are not protected from one another in the same way. The last thing you want is solitaire triggering a kernel bug that damages your git enlistment.

And that’s why when an unexpected condition occurs in the kernel, the system is just halted. This is not a Windows thing by any stretch. It is true for all modern operating systems like Linux and macOS as well. In fact, the biggest difference is the color of the screen when the system goes down. On Windows, it’s blue, but on Linux it’s black, and on macOS, it’s usually pink. But as on all systems, a kernel issue is a reboot at a minimum.

What Runs in Kernel Mode 

Now that we know a bit about kernel mode versus user mode, let’s talk about what specifically runs in kernel mode. And the answer is very, very little. The only things that go in the kernel mode are things that have to, like the thread scheduler and the heap manager and functionality that must access the hardware, such as the device driver that talks to a GPU across the PCIe bus. And so the totality of what you run in kernel mode really comes down to the operating system itself and device drivers.

And that’s where CrowdStrike enters the picture with their Falcon sensor. Falcon is a security product, and while it’s not just simply an antivirus, it’s not that far off the mark to look at it as though it’s really anti-malware for the server. But rather than just looking for file definitions, it analyzes a wide range of application behavior so that it can try to proactively detect new attacks before they’re categorized and listed in a formal definition.

CrowdStrike Falcon Sensor 

To be able to see that application behavior from a clear vantage point, that code needed to be down in the kernel. Without getting too far into the weeds of what CrowdStrike Falcon actually does, suffice it to say that it has to be in the kernel to do it. And so CrowdStrike wrote a device driver, even though there’s no hardware device that it’s really talking to. But by writing their code as a device driver, it lives down with the kernel in ring zero and has complete and unfettered access to the system, data structures, and the services that they believe it needs to do its job.

Everybody at Microsoft and probably at CrowdStrike is aware of the stakes when you run code in kernel mode, and that’s why Microsoft offers the WHQL certification, which stands for Windows Hardware Quality Labs. Drivers labeled as WHQL certified have been thoroughly tested by the vendor and then have passed the Windows Hardware Lab Kit testing on various platforms and configurations and are signed digitally by Microsoft as being compatible with the Windows operating system. By the time a driver makes it through the WHQL lab tests and certifications, you can be reasonably assured that the driver is robust and trustworthy. And when it’s determined to be so, Microsoft issues that digital certificate for that driver. As long as the driver itself never changes, the certificate remains valid.

CrowdStrike’s Agile Approach 

But what if you’re CrowdStrike and you’re agile, ambitious, and aggressive, and you want to ensure that your customers get the latest protection as soon as new threats emerge? Every time something new pops up on the radar, you could make a new driver and put it through the Hardware Quality Labs, get it certified, signed, and release the updated driver. And for things like video cards, that’s a fine process. I don’t actually know what the WHQL turnaround time is like, whether that’s measured in days or weeks, but it’s not instant, and so you’d have a time window where a zero-day attack could propagate and spread simply because of the delay in getting an updated CrowdStrike driver built and signed.

Dynamic Definition Files 

What CrowdStrike opted to do instead was to include definition files that are processed by the driver but not actually included with it. So when the CrowdStrike driver wakes up, it enumerates a folder on the machine looking for these dynamic definition files, and it does whatever it is that it needs to do with them. But you can already perhaps see the problem. Let’s speculate for a moment that the CrowdStrike dynamic definition files are not merely malware definitions but complete programs in their own right, written in a p-code that the driver can then execute.

In a very real sense, then the driver could take the update and actually execute the p-code within it in kernel mode, even though that update itself has never been signed. The driver becomes the engine that runs the code, and since the driver hasn’t changed, the cert is still valid for the driver. But the update changes the way the driver operates by virtue of the p-code that’s contained in the definitions, and what you’ve got then is unsigned code of unknown provenance running in full kernel mode.

All it would take is a single little bug like a null pointer reference, and the entire temple would be torn down around us. Put more simply, while we don’t yet know the precise cause of the bug, executing untrusted p-code in the kernel is risky business at best and could be asking for trouble.

Post-Mortem Debugging 

We can get a better sense of what went wrong by doing a little post-mortem debugging of our own. First, we need to access a crash dump report, the kind you’re used to getting in the good old NT days but are now hidden behind the happy face blue screen. Depending on how your system is configured, though, you can still get the crash dump info. And so there was no real shortage of dumps around to look at. Here’s an example from Twitter, so let’s take a look. About a third of the way down, you can see the offending instruction that caused the crash.

It’s an attempt to move data to register nine by loading it from a memory pointer in register eight. Couldn’t be simpler. The only problem is that the pointer in register eight is garbage. It’s not a memory address at all but a small integer of nine c hex, which is likely the offset of the field that they’re actually interested in within the data structure. But they almost certainly started with a null pointer, then added nine c to it, and then just dereferenced it.

CrowdStrike driver woes

Now, debugging something like this is often an incremental process where you wind up establishing, “Okay, so this bad thing happened, but what happened upstream beforehand to cause the bad thing?” And in this case, it appears that the cause is the dynamic data file downloaded as a sys file. Instead of containing p-code or a malware definition or whatever was supposed to be in the file, it was all just zeros.

We don’t know yet how or why this happened, as CrowdStrike hasn’t publicly released that information yet. What we do know to an almost certainty at this point, however, is that the CrowdStrike driver that processes and handles these updates is not very resilient and appears to have inadequate error checking and parameter validation.

Parameter validation means checking to ensure that the data and arguments being passed to a function, and in particular to a kernel function, are valid and good. If they’re not, it should fail the function call, not cause the entire system to crash. But in the CrowdStrike case, they’ve got a bug they don’t protect against, and because their code lives in ring zero with the kernel, a bug in CrowdStrike will necessarily bug check the entire machine and deposit you into the very dreaded recovery bluescreen.

Windows Resilience 

Even though this isn’t a Windows issue or a fault with Windows itself, many people have asked me why Windows itself isn’t just more resilient to this type of issue. For example, if a driver fails during boot, why not try to boot next time without it and see if that helps?

And Windows, in fact, does offer a number of facilities like that, going back as far as booting NT with the last known good registry hive. But there’s a catch, and that catch is that CrowdStrike marked their driver as what’s known as a bootstart driver. A bootstart driver is a device driver that must be installed to start the Windows operating system.

Most bootstart drivers are included in driver packages that are in the box with Windows, and Windows automatically installs these bootstart drivers during their first boot of the system. My guess is that CrowdStrike decided they didn’t want you booting at all without their protection provided by their system, but when it crashes, as it does now, your system is completely borked.

Fixing the Issue 

Fixing a machine with this issue is fortunately not a great deal of work, but it does require physical access to the machine. To fix a machine that’s crashed due to this issue, you need to boot it into safe mode, because safe mode only loads a limited set of drivers and mercifully can still contend without this boot driver.

You’ll still be able to get into at least a limited system. Then, to fix the machine, use the console or the file manager and go to the path window like windows, and then system32/drivers/crowdstrike. In that folder, find the file matching the pattern c and then a bunch of zeros 291 sys and delete that file or anything that’s got the 291 in it with a bunch of zeros. When you reboot, your system should come up completely normal and operational.

The absence of the update file fixes the issue and does not cause any additional ones. It’s a fair bet that the update 291 won’t ever be needed or used again, so you’re fine to nuke it.

Conclusion 

Further references 

 CrowdStrike IT Outage Explained by a Windows DeveloperYouTube · Dave’s Garage13 minutes, 40 seconds2 days ago

The Aftermath of the World’s Biggest IT Outage

The Great Digital Blackout: Fallout from the CrowdStrike-Microsoft Outage

i. Introduction 

On a seemingly ordinary Friday morning, the digital world shuddered. A global IT outage, unprecedented in its scale, brought businesses, governments, and individuals to a standstill. The culprit: a faulty update from cybersecurity firm CrowdStrike, clashing with Microsoft Windows systems. The aftershocks of this event, dubbed the “Great Digital Blackout,” continue to reverberate, raising critical questions about our dependence on a handful of tech giants and the future of cybersecurity.

ii. The Incident

A routine software update within Microsoft’s Azure cloud platform inadvertently triggered a cascading failure across multiple regions. This outage, compounded by a simultaneous breach of CrowdStrike’s security monitoring systems, created a perfect storm of disruption. Within minutes, critical services were rendered inoperative, affecting millions of users and thousands of businesses worldwide. The outage persisted for 48 hours, making it one of the longest and most impactful in history.

iii. Initial Reports and Response

The first signs that something was amiss surfaced around 3:00 AM UTC when users began reporting issues accessing Microsoft Azure and Office 365 services. Concurrently, Crowdstrike’s Falcon platform started exhibiting anomalies. By 6:00 AM UTC, both companies acknowledged the outage, attributing the cause to a convergence of system failures and a sophisticated cyber attack exploiting vulnerabilities in their systems.

Crowdstrike and Microsoft activated their incident response protocols, working around the clock to mitigate the damage. Microsoft’s global network operations team mobilized to isolate affected servers and reroute traffic, while Crowdstrike’s cybersecurity experts focused on containing the breach and analyzing the attack vectors.

iv. A Perfect Storm: Unpacking the Cause

A. The outage stemmed from a seemingly innocuous update deployed by CrowdStrike, a leading provider of endpoint security solutions. The update, intended to bolster defenses against cyber threats, triggered a series of unforeseen consequences. It interfered with core Windows functionalities, causing machines to enter a reboot loop, effectively rendering them unusable.

B. The domino effect was swift and devastating. Businesses across various sectors – airlines, hospitals, banks, logistics – found themselves crippled. Flights were grounded, financial transactions stalled, and healthcare operations were disrupted.

C. The blame game quickly ensued. CrowdStrike, initially silent, eventually acknowledged their role in the outage and apologized for the inconvenience. However, fingers were also pointed at Microsoft for potential vulnerabilities in their Windows systems that allowed the update to wreak such havoc.

v. Immediate Consequences (Businesses at a Standstill)

The immediate impact of the outage was felt by businesses worldwide. 

A. Microsoft: Thousands of companies dependent on Microsoft’s Azure cloud services found their operations grinding to a halt. E-commerce platforms experienced massive downtimes, losing revenue by the minute. Hospital systems relying on cloud-based records faced critical disruptions, compromising patient care.

Businesses dependent on Azure’s cloud services for their operations found themselves paralyzed. Websites went offline, financial transactions were halted, and communication channels were disrupted. 

B. Crowdstrike: Similarly, Crowdstrike’s clientele, comprising numerous Fortune 500 companies, grappled with the fallout. Their critical security monitoring and threat response capabilities were significantly hindered, leaving them vulnerable.

vi. Counting the Costs: Beyond Downtime

The human and economic toll of the Great Digital Blackout is still being calculated. While initial estimates suggest billions of dollars in lost productivity, preliminary estimates suggest that the outage resulted in global economic losses exceeding $200 billion, the true cost extends far beyond financial figures. Businesses across sectors reported significant revenue losses, with SMEs particularly hard-hit. Recovery and mitigation efforts further strained financial resources, and insurance claims surged as businesses sought to recoup their losses.

  • Erosion of Trust: The incident exposed the fragility of our increasingly digital world, eroding trust in both CrowdStrike and Microsoft. Businesses and organizations now question the reliability of security solutions and software updates.
  • Supply Chain Disruptions: The interconnectedness of global supply chains was thrown into disarray.Manufacturing, shipping, and logistics faced delays due to communication breakdowns and the inability to process orders electronically.
  • Cybersecurity Concerns: The outage highlighted the potential for cascading effects in cyberattacks. A seemingly minor breach in one system can have a devastating ripple effect across the entire digital ecosystem.

vii. Reputational Damage

Both Microsoft and CrowdStrike suffered severe reputational damage. Trust in Microsoft’s Azure platform and CrowdStrike’s cybersecurity solutions was shaken. Customers, wary of future disruptions, began exploring alternative providers and solutions. The incident underscored the risks of over-reliance on major service providers and ignited discussions about diversifying IT infrastructure.

viii. Regulatory Scrutiny

In the wake of the outage, governments and regulatory bodies worldwide called for increased oversight and stricter regulations. The incident highlighted the need for robust standards to ensure redundancy, effective backup systems, and rapid recovery protocols. In the United States, discussions about enhancing the Cybersecurity Maturity Model Certification (CMMC) framework gained traction, while the European Union considered expanding the scope of the General Data Protection Regulation (GDPR) to include mandatory resilience standards for IT providers.

ix. Data Security and Privacy Concerns

One of the most concerning aspects of the outage was the potential exposure of sensitive data. Both Microsoft and Crowdstrike store vast amounts of critical and confidential data. Although initial investigations suggested that the attackers did not exfiltrate data, the sheer possibility raised alarms among clients and regulatory bodies worldwide.

Governments and compliance agencies intensified their scrutiny, reinforcing the need for robust data protection measures. Customers demanded transparency about what data, if any, had been compromised, leading to an erosion of trust in cloud services.

x. Root Causes and Analysis

Following the containment of the outage, both Crowdstrike and Microsoft launched extensive investigations to determine the root causes. Preliminary reports cited a combination of factors:

A. Zero-Day Exploits: The attackers leveraged zero-day vulnerabilities in both companies’ systems, which had not been previously detected or patched.   

B. Supply Chain Attack: A key supplier providing backend services to both companies was compromised, allowing the attackers to penetrate deeper into their networks.

C. Human Error: Configuration errors and lack of stringent security checks at critical points amplified the impact of the vulnerabilities.

D. Coordinated Attack: Cybersecurity analysts suggested that the attack bore the hallmarks of a highly coordinated and well-funded group, potentially a nation-state actor, given the sophistication and scale. The alignment of the outage across multiple critical services pointed to a deliberate and strategic attempt to undermine global technological infrastructure.

xi. Response Strategies

A. CrowdStrike’s Tactics

  • Swift Containment: Immediate action was taken to contain the breach. CrowdStrike’s incident response teams quickly identified and isolated the compromised segments of their network to prevent further penetration.
  • Vulnerability Mitigation: Patches were rapidly developed and deployed to close the exploited security gaps. Continuous monitoring for signs of lingering threats or additional vulnerabilities was intensified.
  • Client Communication: Transparency became key. CrowdStrike maintained open lines of communication with its clients, providing regular updates, guidance on protective measures, and reassurance to mitigate the trust deficit.

B. Microsoft’s Actions

  • Global Response Scaling: Leveraging its extensive resources, Microsoft scaled up its global cybersecurity operations. Frantic efforts were made to stabilize systems, restore services, and strengthen defenses against potential residual threats.
  • Service Restoration: Microsoft prioritized the phased restoration of services. This approach ensured that each phase underwent rigorous security checks to avoid reintroducing vulnerabilities.
  • Collaboration and Information Sharing: Recognizing the widespread impact, Microsoft facilitated collaboration with other tech firms, cybersecurity experts, and government agencies. Shared intelligence helped in comprehending the attack’s full scope and in developing comprehensive defense mechanisms.

xii. Broad Implications 

A. Evolving Cyber Threat Landscape

  • Increased Sophistication: The attack underscored the evolving sophistication of cyber threats. Traditional security measures are proving insufficient against highly organized and well-funded adversaries.
  • Proactive Security Posture: The event emphasized the need for a proactive security stance, which includes real-time threat intelligence, continuous system monitoring, and regular vulnerability assessments.

B. Trust in Cloud Computing

  • Cloud Strategy Reevaluation: The reliance on cloud services came under scrutiny. Organizations began rethinking their cloud strategies, weighing the advantages against the imperative of reinforcing security protocols.
  • Strengthened Security Measures: There is a growing emphasis on bolstering supply chain security. Companies are urged to implement stringent controls, cross-verify practices with their vendors, and engage in regular security audits.

xiii. A Catalyst for Change: Lessons Learned

The Great Digital Blackout serves as a stark reminder of the need for a comprehensive reevaluation of our approach to cybersecurity and technology dependence. Here are some key takeaways:

  • Prioritize Security by Design: Software development and security solutions need to prioritize “security by design” principles. Rigorous testing and vulnerability assessments are crucial before deploying updates.
  • Enhanced Cybersecurity: The breach of CrowdStrike’s systems highlighted potential vulnerabilities in cybersecurity frameworks. Enhanced security measures and continuous monitoring are vital to prevent similar incidents.
  • Diversity and Redundancy: Over-reliance on a few tech giants can be a vulnerability. Diversifying software and service providers, coupled with built-in redundancies in critical systems, can mitigate the impact of such outages.
  • Redundancy and Backup: The incident underscored the necessity of having redundant systems and robust backup solutions. Businesses are now more aware of the importance of investing in these areas to ensure operational continuity during IT failures.
  • Disaster Recovery Planning: Effective disaster recovery plans are critical. Regular drills and updates to these plans can help organizations respond more efficiently to disruptions.
  • Communication and Transparency: Swift, clear communication during disruptions is essential. Both CrowdStrike and Microsoft initially fell short in this area, causing confusion and exacerbating anxieties.
  • Regulatory Compliance: Adhering to evolving regulatory standards and being proactive in compliance efforts can help businesses avoid penalties and build resilience.
  • International Collaboration: Cybersecurity threats require an international response. Collaboration between governments, tech companies, and security experts is needed to develop robust defense strategies and communication protocols.

xiv. The Road to Recovery: Building Resilience

The path towards recovery from the Great Digital Blackout is multifaceted. It involves:

  • Post-Mortem Analysis: Thorough investigations by CrowdStrike, Microsoft, and independent bodies are needed to identify the root cause of the outage and prevent similar occurrences.
  • Investing in Cybersecurity Awareness: Educating businesses and individuals about cyber threats and best practices is paramount. Regular training and simulation exercises can help organizations respond more effectively to future incidents.
  • Focus on Open Standards: Promoting open standards for software and security solutions can foster interoperability and potentially limit the impact of individual vendor issues.

xv. A New Era of Cybersecurity: Rethinking Reliance

The Great Digital Blackout serves as a wake-up call. It underscores the need for a more robust, collaborative, and adaptable approach to cybersecurity. By diversifying our tech infrastructure, prioritizing communication during disruptions, and fostering international cooperation, we can build a more resilient digital world.

The event also prompts a conversation about our dependence on a handful of tech giants. While these companies have revolutionized our lives, the outage highlighted the potential pitfalls of such concentrated power.

xvi. Conclusion 

The future of technology may involve a shift towards a more decentralized model, with greater emphasis on data sovereignty and user control. While the full impact of the Great Digital Blackout is yet to be fully understood, one thing is certain – the event has irrevocably altered the landscape of cybersecurity, prompting a global conversation about how we navigate the digital age with greater awareness and resilience.

This incident serves as a stark reminder of the interconnected nature of our digital world. As technology continues to evolve, so too must our approaches to managing the risks it brings. The lessons learned from this outage will undoubtedly shape the future of IT infrastructure, making it more robust, secure, and capable of supporting the ever-growing demands of the digital age.

xvii. Further references 

Microsoft IT outages live: Dozens more flights cancelled …The Independenthttps://www.independent.co.uk › tech › microsoft-crow…

Helping our customers through the CrowdStrike outageMicrosofthttps://news.microsoft.com › en-hk › 2024/07/21 › helpi…

CrowdStrike-Microsoft Outage: What Caused the IT MeltdownThe New York Timeshttps://www.nytimes.com › 2024/07/19 › business › mi…

Microsoft IT outage live: Millions of devices affected by …The Independenthttps://www.independent.co.uk › tech › microsoft-outa…

What’s next for CrowdStrike, Microsoft after update causes …USA Todayhttps://www.usatoday.com › story › money › 2024/07/20

CrowdStrike and Microsoft: What we know about global IT …BBChttps://www.bbc.com › news › articles

Chaos persists as IT outage could take time to fix …BBChttps://www.bbc.com › news › live

Huge Microsoft Outage Linked to CrowdStrike Takes Down …WIREDhttps://www.wired.com › Security › security

CrowdStrike’s Role In the Microsoft IT Outage, ExplainedTime Magazinehttps://time.com › Tech › Internet

Crowdstrike admits ‘defect’ in software update caused IT …Euronews.comhttps://www.euronews.com › Next › Tech News

Microsoft: CrowdStrike Update Caused Outage For 8.5 …CRNhttps://www.crn.com › news › security › microsoft-cro…

It could take up to two weeks to resolve ‘teething issues …Australian Broadcasting Corporationhttps://www.abc.net.au › news › microsoft-says-crowdst…

Microsoft-CrowdStrike Outage Causes Chaos for Flights …CNEThttps://www.cnet.com › Tech › Services & Software

Building Resilience Against Cyber Threats 

Establishing Resilience Against Cyber Threats

Building resilience against cyber threats is not just an IT concern; it requires the organization as a whole to participate actively. 

Such resilience means implementing a comprehensive approach that combines IT solutions, policy, education, and awareness. 

i. Building resilience against these threats becomes crucial for individuals, organizations, and even entire nations. 

Here are some key steps to consider:

A. Understand the Landscape:

o Identify potential threats: Familiarize yourself with the different types of cyberattacks, vulnerabilities they exploit, and potential consequences. Analyze your specific environment and its unique risks.

o Know your assets: Inventory your devices, networks, data, and online presences. Classify them based on sensitivity and criticality to prioritize protection efforts.

B. Fortify Your Defenses:

o Implement strong security measures: Install antivirus software, firewalls, and intrusion detection/prevention systems. Maintain updated software and operating systems to patch vulnerabilities.

o Practice good password hygiene: Use strong, unique passwords for each account and enable multi-factor authentication wherever possible.

o Secure your network: Encrypt sensitive data, segment your network to limit potential damage, and use secure protocols for communication.

C. Prepare for the Inevitable:

o Develop an incident response plan: Define roles, responsibilities, and communication protocols in case of a cyberattack. Practice your plan regularly and test its effectiveness.

o Back up your data regularly: Maintain secure backups of your data offline and off-site to ensure rapid recovery in case of an attack or technical failure.

o Implement disaster recovery procedures: Have a plan for resuming operations quickly and minimizing disruptions in case of a cyberattack or other incident.

D. Build a Culture of Security:

o Train your people: Educate your employees, family members, and anyone involved in your systems about cyber threats and best practices for staying safe online.

o Foster a culture of awareness: Encourage open communication about security concerns and suspicious activity. Make reporting these issues easy and accessible.

o Embrace continuous improvement: Keep up-to-date on the latest threats and vulnerabilities, and continually update your security measures and practices.

E. Seek Outside Help:

o Partner with cybersecurity professionals: Utilize experts to audit your security posture, conduct penetration testing, and provide ongoing guidance.

o Stay informed: Follow reputable cybersecurity news sources and alerts to stay aware of emerging threats and vulnerabilities.

o Share information: Collaborate with other individuals and organizations to share best practices and intelligence about cyber threats.

ii. In a world where cyber threats are an ever-present risk, here are some essential steps that organizations can follow:

A. Risk Assessment: The first step towards building resilience is identifying potential vulnerabilities within the system. Regularly conducting risk assessments helps in highlighting areas of weakness and loopholes within the systems that may be exploited by hackers. 

B. Develop a Cybersecurity Framework: Leaning on frameworks such as those developed by the National Institute of Standards and Technology (NIST), an organization can develop its internal guidelines. The framework should involve identifying existing security measures, implementing protective safeguards, detecting anomalies, responding to incidents, and a plan for recovery post-incident.

C. Implement Robust Security Measures: Utilize the latest cybersecurity tools, such as state-of-the-art firewalls, intrusion detection systems, antivirus software, and encryption methods for data protection. Monitor all device connections and ensure IoT devices are secured. Regularly update and patch software and systems to reduce vulnerabilities.

D. Endpoint Security: Implement robust endpoint protection measures, including antivirus software, intrusion detection systems, and regular software updates to secure devices from potential threats.

E. Network Security: Establish a strong network security posture by using firewalls, intrusion prevention systems (IPS), and regularly monitoring network traffic for suspicious activities.

F. Employee Training and Awareness: Cybersecurity education and awareness should be a part of all employees’ training, as human error often leads to security breaches. Regular training sessions on identifying phishing attempts, proper password practices, and safe internet usage can significantly improve the organization’s cyber resilience.

G. Multi-Factor Authentication: Implement multi-factor authentication for all internal systems and processes, significantly reducing the chance of unauthorized access to sensitive information.

H. Data Encryption: Implement end-to-end encryption for sensitive data to protect it during transmission and storage. This ensures that even if unauthorized access occurs, the data remains unreadable.

I. Implement Strong Cyber Hygiene Practices:

   o Regularly update and patch systems and software to eliminate vulnerabilities.

   o Enforce strong password policies and use multi-factor authentication.

J. Secure Configuration:

   o Harden systems by configuring security settings appropriately.

   o Limit the number of privileged accounts and monitor their activity.

K. Incident Response Plan: Have a clear incident response plan in place. In the event of a breach, time is of the essence to minimize damage. A well-prepared plan would include roles and responsibilities, communication plan, and recovery steps. 

L. Data Backups and Recovery Plan: Regularly back up critical data in multiple locations, including offline storage. In event of a breach or ransomware attack, backups will help the organization recover without paying ransom or losing vital data.

M. Cyber Insurances: Consider adopting cyber insurance policies. While these don’t prevent attacks, they can certainly mitigate financial losses in case of a significant cybersecurity incident.

N. Vendor Security Assessment: Assess the security measures of third-party vendors and partners. Ensure they adhere to high cybersecurity standards, as weaknesses in their systems can impact your organization.

O. Continuous Monitoring: Implement continuous monitoring of your IT infrastructure and network. This involves real-time analysis of security events to detect and respond to threats promptly.

P. Governance and Compliance: Establish strong governance policies and ensure compliance with industry regulations and standards. This provides a structured framework for maintaining a secure environment.

Q. Business Continuity and Disaster Recovery:

    o Create and test a business continuity plan that includes strategies for dealing with cyber incidents.

    o Set up redundant systems and data backups to maintain operations during and after an attack.

R. Regular Audits and Tests: Regular cybersecurity audits and penetration tests help identify weaknesses in the existing systems and ensure the organization’s defenses can withstand attempted breaches.

S. Threat Intelligence: Stay informed about emerging cyber threats and vulnerabilities by leveraging threat intelligence sources. This knowledge helps in proactively adjusting security measures.

T. Collaboration and Information Sharing: Collaborate with industry peers and participate in information-sharing initiatives. Understanding the threat landscape and learning from others’ experiences can enhance your resilience.

U. Stay Updated: Cyber threats are constantly evolving. Keep abreast of the latest developments, threat vectors, and protective measures. 

iii. Conclusion

Building resilience against cyber threats is not a one-time effort, but rather an ongoing process. By embracing these steps and fostering a proactive approach, you can significantly reduce your risk, minimize potential damage, and create a more secure environment for yourself and those around you. 

In conclusion, building resilience against cyber threats requires a holistic approach including technology, people and processes working together to anticipate, prevent, detect and respond to cyber threats.

Additionally, adopting a framework like NIST Cybersecurity Framework can help in organizing and prioritizing the efforts to build resilience against cyber threats. It’s important to stay informed about emerging threats and continuously evolve your cybersecurity practices to address new challenges.

iv. Further references 

10 Tips for Creating a Cyber Resilience Strategy – CybeReady

TechTargethttps://www.techtarget.com › tipBuild a strong cyber-resilience strategy with existing tools

Ernst & Younghttps://www.ey.com › cybersecurityBuilding Resilience: Safeguarding Financial Institutions from Modern Cyber …

LinkedInhttps://www.linkedin.com › adviceHow can you develop resilience in the face of cyber threats?

LinkedIn · Rainbow Secure6 reactionsUnderstanding Cyber Resilience: Protecting Your Business Against Cyber Threats

The Business Continuity Institutehttps://www.thebci.org › news › bca…Building a cyber resilient culture — how to embed a culture of cyber resilience in your …

InformationWeekhttps://www.informationweek.com › …How to Build True Cyber Resilience

Forbeshttps://www.forbes.com › 2023/10/24Cyber Resilience And Risk Management: Forces Against Cyber Threats

ISACAhttps://www.isaca.org › resourcesStrengthening Collaboration for Cyber Resilience: The Key to a Secure and …

Forbeswww.forbes.comFrom Awareness To Resilience: The Evolution Of People-Centric Cybersecurity

Geopolitical resilience: The new board imperative

Geopolitical Resilience: The New Board Imperative

In today’s increasingly complex and interconnected world, geopolitical risks are rising sharply. From trade wars and sanctions to cyberattacks and climate change, companies face a multitude of potential disruptions that can impact their operations, supply chains, and bottom line. 

This is where geopolitical resilience comes into play.

i. What is Geopolitical Resilience?

Geopolitical resilience refers to a company’s ability to anticipate, withstand, and adapt to unforeseen geopolitical events. It’s about proactively assessing and managing risks arising from the ever-evolving global landscape, minimizing their impact on the organization’s performance and long-term viability.

ii. Why is it a Board Imperative?

Traditionally, the management team has been responsible for navigating geopolitical risks. However, the increasing volatility and interconnectedness of the global environment makes it an issue that demands board-level attention. Boards are ultimately responsible for the company’s long-term success and sustainability, and geopolitical risks can pose significant threats to these goals.

iii. How can Boards Build Geopolitical Resilience?

Here are some key ways boards can contribute to building and ensuring geopolitical resilience:

A. Sharpen their understanding of the geopolitical landscape: Boards should stay informed about major geopolitical trends, emerging risks, and potential flashpoints around the world. This requires regular briefings, scenario planning exercises, and engagement with external experts.

B. Monitor developments and exercise oversight: Boards need to actively monitor how major geopolitical events unfold and assess their potential impact on the company’s operations. This includes oversight of risk management plans, scenario-based responses, and contingency measures.

C. Champion a culture of risk awareness: Boards should set the tone for a strong risk management culture within the organization. This involves encouraging regular risk assessments, transparent communication about potential threats, and proactive implementation of mitigation strategies.

D. Hold management accountable: Boards must hold management accountable for developing and implementing effective geopolitical risk management strategies. This includes ensuring adequate resources are allocated, expertise is available, and contingency plans are regularly tested and updated.

iv. Boards should prioritize the following strategies:

A. Risk Assessment:

   o Regularly conduct comprehensive geopolitical risk assessments to identify potential threats to the business.

   o Assess the impact of geopolitical events on supply chains, markets, and regulatory environments.

B. Scenario Planning:

   o Develop scenario plans to anticipate and respond to different geopolitical situations.

   o Consider the potential effects on operations, finances, and stakeholder relationships.

C. Diversification and Redundancy:

   o Diversify supply chains and key partnerships to reduce vulnerability to geopolitical disruptions.

   o Establish redundancy in critical operations to ensure continuity during periods of geopolitical uncertainty.

D. Regulatory Compliance:

   o Stay informed about changing global regulations and compliance requirements.

   o Adjust business strategies to align with evolving geopolitical landscapes and regulatory frameworks.

E. Stakeholder Engagement:

   o Foster strong relationships with governments, local communities, and international partners.

   o Proactively engage with stakeholders to navigate geopolitical challenges collaboratively.

F. Cybersecurity Preparedness:

   o Enhance cybersecurity measures to protect against geopolitical threats, including cyber-attacks from state-sponsored actors.

   o Implement robust data protection and privacy measures to comply with varying international standards.

G. Talent Management:

   o Build a diverse and adaptable workforce capable of navigating geopolitical complexities.

   o Provide cross-cultural training to employees operating in regions prone to geopolitical tensions.

H. Financial Resilience:

   o Maintain financial flexibility to withstand economic and geopolitical shocks.

   o Consider currency risks and fluctuations in financial planning and decision-making.

I. Monitoring and Early Warning Systems:

   o Establish monitoring systems to track geopolitical developments and receive early warnings.

   o Utilize intelligence networks and data analytics for timely risk detection.

J. Adaptability and Agility:

    o Foster an organizational culture that values adaptability and agility.

    o Develop flexible business models capable of adjusting to geopolitical shifts quickly.

K. Communication Strategy:

    o Develop a robust communication strategy to address stakeholders during times of geopolitical uncertainty.

    o Ensure transparency and clarity in conveying the organization’s position and response plans.

L. Sustainability and ESG Focus:

    o Embrace sustainability practices and maintain a strong focus on Environmental, Social, and Governance (ESG) factors.

    o Demonstrate commitment to responsible business practices amid geopolitical challenges.

By integrating these strategies, boards can enhance geopolitical resilience, ensuring the organization is well-prepared to navigate the complexities of an ever-changing global landscape.

v. Resources and Tools:

Several resources and tools can help boards in their quest for geopolitical resilience:

o McKinsey’s “Geopolitical Resilience: The New Board Imperative” report: This report provides a comprehensive framework for boards to navigate geopolitical risks and build resilience.

o World Economic Forum’s Global Risks Report: This annual report offers insights into the top global risks, including geopolitical ones, and can help boards prioritize their focus.

o External geopolitical risk advisory firms: Several firms specialize in providing companies and boards with tailored geopolitical risk analysis and mitigation strategies.

vi. Conclusion:

Building geopolitical resilience is no longer a luxury but a necessity for companies operating in today’s turbulent world. 

By actively engaging with this issue, boards can play a crucial role in safeguarding their organization’s future and ensuring its long-term success in the face of an uncertain geopolitical landscape.

Building Fraud Resilience in the Digital Era

Building fraud resilience in the digital era requires organizations to adopt a comprehensive approach that encompasses prevention, detection, and response.

This involves implementing robust security measures, fostering a culture of fraud awareness, and leveraging technology to proactively identify and mitigate fraud risks.

Here are key strategies to enhance fraud resilience:

A. Risk Assessment and Understanding:

   – Conduct a comprehensive risk assessment to identify potential fraud risks specific to your organization. Understand the nature of digital fraud threats, including phishing, account takeovers, and social engineering.

B. Establish a Fraud Risk Management Program: Develop a centralized fraud risk management program that outlines the organization’s approach to identifying, assessing, and managing fraud risks. This program should include clear roles and responsibilities, risk assessment methodologies, and incident response procedures.

C. Identify and Assess Fraud Risks: Regularly conduct fraud risk assessments to identify and prioritize potential fraud threats. This involves analyzing historical data, industry trends, and emerging fraud tactics to understand the organization’s vulnerability to various fraud schemes.

D. Implement Robust Security Controls: Implement a layered security architecture that encompasses physical, network, and application security controls to protect sensitive data and prevent unauthorized access. This includes firewalls, intrusion detection systems, data encryption, and access controls.

E. Foster a Culture of Fraud Awareness: Raise awareness among employees about fraud risks and common fraud schemes. Provide training and education to help employees recognize and report suspicious activity, reducing the risk of insider fraud.

F. Establish a Collaborative Fraud Response Process: Develop a clear and well-defined incident response plan to effectively handle fraud incidents. This plan should include communication protocols, escalation procedures, and law enforcement coordination.

G. Stay Informed About Fraud Regulations: Keep abreast of emerging fraud regulations and industry standards to ensure compliance and implement appropriate measures to address evolving regulatory requirements.

H. Behavioral Biometrics: Explore the use of behavioral biometrics to authenticate users based on their unique patterns of behavior. This can include keystroke dynamics, mouse movements, and other behavioral markers, adding an extra layer of security.

I. Multi-Factor Authentication (MFA): Enforce multi-factor authentication to enhance user verification. Require users to provide multiple forms of identification, such as passwords, biometrics, or one-time passcodes, to access sensitive information or conduct transactions.

J. Utilize Fraud Detection Technologies: Leverage fraud detection technologies, such as machine learning and artificial intelligence, to analyze transaction data and identify anomalous patterns that may indicate fraudulent activity. These technologies can provide real-time insights and enable proactive fraud prevention.

K. Implement strong access controls: Precise control over who has access to certain data can play a crucial role in preventing fraudulent activities. The principle of least privilege (PoLP) should be applied, where users are given the minimum levels of access necessary to complete their tasks.

L. Real-Time Transaction Monitoring: Implement real-time monitoring of transactions to quickly detect and respond to potentially fraudulent activities. Automated systems can analyze transaction patterns and flag suspicious behavior for further investigation.

M. Collaboration and Information Sharing: Collaborate with industry peers, law enforcement agencies, and cybersecurity organizations to share information about emerging fraud trends and tactics. Collective intelligence enhances your ability to stay ahead of evolving threats.

N. Secure Development Practices: Integrate secure development practices into software and application development processes. This includes regular security assessments, code reviews, and adherence to best practices for mitigating vulnerabilities.

O. Identity Verification Solutions: Utilize robust identity verification solutions to ensure that individuals accessing your systems or services are who they claim to be. This can involve document verification, biometric authentication, or knowledge-based authentication.

P. Customer Authentication Controls: Provide customers with controls to manage and customize their authentication settings. This empowers users to set preferences for security features and receive alerts for suspicious activities.

Q. Customer Education and Awareness: Educate customers about common fraud tactics and best practices for protecting their accounts. Promote awareness of phishing emails, social engineering attempts, and the importance of secure password practices.

R. Create a culture of accountability and transparency: Foster a positive workplace culture where ethical behavior is valued and rewarded. Make it easy and safe for employees to report suspicious behavior.

S. Incident Response Plan: Have a robust incident response plan in place. In the event of fraud, the organization should be prepared to respond swiftly to minimize damage, gather evidence, and take necessary legal steps.

T. Customer Support and Communication: Maintain effective customer support channels to address inquiries and reports of potential fraud. Clear communication with customers during incidents helps build trust and confidence in your organization’s commitment to security.

U. Continuous Monitoring and Adaptation: Implement continuous monitoring of your fraud prevention measures. Regularly reassess and adapt strategies based on the evolving threat landscape and changes in the digital environment.

Building fraud resilience in the digital era is an ongoing process that requires a combination of technological measures, user education, and strategic planning. By adopting a proactive and adaptive approach, organizations can strengthen their defenses against digital fraud and protect both their assets and the trust of their customers.

https://www.mckinsey.com/capabilities/risk-and-resilience/our-insights/cybersecurity-in-a-digital-era

https://m.grazitti.com/blog/building-cyber-resilience-in-the-age-of-digital-transformation/

https://www2.deloitte.com/content/dam/Deloitte/my/Documents/risk/my-risk-sustainability-risk-fraud-resilient-organisations.pdf

https://www.pasai.org/blog/2020/5/29/cybersecurity-building-digital-resilience-in-a-virtual-world

https://www.ey.com/en_ao/how-digital-transformation-increases-consumer-and-retail-fraud-risks

How to audit Business Continuity?

Auditing business continuity involves assessing an organization’s plans and strategies to keep its operations functional in the event of a disaster or any significant disruption. 

Here are steps on how to audit business continuity:

A. Establish the Audit Scope: Determine what aspects of the organization’s continuity plan will be evaluated. This could include risk assessments, business impact analyses, recovery strategies and procedures, communication structures, or rehearsal and testing procedures.

B. Understand the Business Continuity Policy: Review the company’s policy on business continuity to understand what strategies and standards the organization has set. Understand the objectives of the business continuity plan.

C. Review the Business Continuity Plan (BCP): This plan should outline the organization’s strategy for maintaining operations during a disruption. The plan should have clear objectives, recovery strategies, and a comprehensive list of roles and responsibilities. Make sure it’s up to date and relevant to the organization needs.

D. Interview Key Personnel: Interview those involved in the creation and execution of the business continuity plan to understand their roles and responsibilities. This could include top management, department leaders, or designated crisis response team members.

E. Review Processes and Procedures: Examine the steps laid out for responding to a disruption. This can be anything from data backups, supply chain alternatives, customer communication procedures, to staff duties.

F. Check for Regulatory Compliance: Ensure that the business continuity plan adheres to all necessary laws and regulations specific to your industry.

G. Examine Risk Assessments: The organization should have conducted a risk assessment that identifies potential threats and vulnerabilities. Review this assessment to make sure all risks have been considered and that the BCP has strategies in place to mitigate those risks.

H. Business Impact Analysis (BIA): Evaluate the organization’s BIA, which should identify critical business functions and their dependencies. This analysis should also estimate the impact of these functions failing and the maximum acceptable outage time.

I. Check Training and Awareness Programs: Verify if the organization has training programs in place to educate employees about the BCP. Employees should be aware of their responsibilities during a disruption, and there should be regular drills to test the plan.

J. Evaluate Testing and Maintenance Procedures: Examine the process of testing the continuity plan and maintaining its relevance over time. This includes checking if regular tests are carried out, if there’s a procedure for updating the plan, and if lessons from any past disruptions were incorporated.

K. Evaluate Incident Management Plan: The plan should clearly outline the procedures to handle an incident, including communication strategies, escalation procedures, and recovery steps.

L. Test the Plan: The most effective way to evaluate a BCP is to conduct a mock disaster exercise. This will help identify any gaps or weaknesses in the plan. Make sure the organization conducts these exercises regularly and updates the BCP based on the results.

M. Investigate Resources and Tools: Take note of any resources or tools in place to support the continuity plan. This could include IT systems for data recovery, emergency supplies, or alternative work sites.

N. Assess Documentation: Check that all elements of the business continuity plan are properly documented and easily accessible by all relevant personnel.

O. Review Previous Audit Reports: If there have been previous audits of the BCP, review these reports for any unresolved issues that should be addressed.

P. Provide Recommendations: After identifying strengths and weaknesses of the plan, provide clear, actionable recommendations for improvement.

Q. Document and Report Findings: All findings from the audit should be documented and communicated back to the organization. This report should include any areas of non-compliance, risks identified, suggested improvements, and good practices observed.

Here are some additional considerations for auditing business continuity:

o Alignment with Business Objectives: Ensure the BCP aligns with the organization’s overall business objectives and risk tolerance levels.

o Regularity of Audits: Conduct regular audits to ensure the BCP remains current and effective in addressing evolving risks.

o Continuous Improvement: Encourage a culture of continuous improvement in business continuity planning and response capabilities.

o Management Commitment: Secure strong management commitment and support for business continuity initiatives.

o Training and Awareness: Provide regular training and awareness programs for employees on business continuity procedures and their roles in responding to disruptions.

The goal of auditing business continuity is not to point out failures or mistakes, but rather should aim to enhance the organization’s resilience and ensure they can weather any disruptions and recover effectively.