Category Archives: Governance Risk & Compliance

Risk Management and Business Impact Analysis (BIA)

October 3, 2024Business Impact Analysis, Governance Risk & Compliance, Risk managementBusiness Impact Analysis, Risk Managementadmin

Risk Management and Business Impact Analysis (BIA) are both crucial steps in forming an Information Security Program. Which One Comes First Can Be A Point Of Debate.

Risk Management vs. Business Impact Analysis: Which Comes First in Information Security Programs?

In the ever-evolving landscape of information security, organizations face a multitude of threats and vulnerabilities. To effectively safeguard their assets, they must establish robust information security programs that encompass various strategies and methodologies. Among these, risk management and business impact analysis (BIA) are two critical components.

However, the question of which should come first often sparks debate among professionals in the field.

Understanding Risk Management and BIA

Risk Management

Risk management involves identifying, assessing, and prioritizing risks to minimize the impact of unforeseen events on an organization. This process is fundamental for establishing a secure environment, as it allows organizations to anticipate potential threats, allocate resources efficiently, and implement controls to mitigate identified risks.

Key components of risk management include:

Risk Identification: Determining what risks could affect the organization.
Risk Assessment: Evaluating the likelihood and potential impact of identified risks.
Risk Mitigation: Developing strategies to reduce or eliminate risks.

Business Impact Analysis (BIA)

BIA, on the other hand, focuses on understanding the potential effects of business disruptions. It assesses how critical various business functions are to the organization and the consequences of those functions being interrupted. By identifying essential operations and their dependencies, BIA helps organizations prioritize recovery efforts and resources.

Key components of BIA include:

Identifying Critical Functions: Determining which business operations are vital for organizational continuity.
Assessing Impact: Evaluating the consequences of disruptions on these critical functions.
Developing Recovery Strategies: Creating plans to restore essential operations in the event of a disruption.

The Debate: Which Comes First?

The debate over whether risk management or BIA should come first is rooted in the different objectives and methodologies of each process.

Advocates for Risk Management First

Proponents of starting with risk management argue that understanding risks is essential before delving into business impacts. They believe that risk management lays the groundwork for effective BIA by identifying the threats and vulnerabilities that could affect critical business functions. Without a clear understanding of risks, conducting a BIA might not yield relevant insights, as it may not account for all potential threats that could disrupt operations.

Advocates for BIA First

Conversely, those who advocate for conducting BIA first contend that understanding the potential impact of disruptions is crucial for prioritizing risks. They argue that without knowing which functions are most critical to the organization, risk assessments may lack focus and relevance. Starting with BIA allows organizations to align their risk management efforts with their most vital operations, ensuring that resources are directed toward protecting what matters most.

Here’s a perspective that draws from established standards and frameworks, such as the NIST Cybersecurity Framework and ISO 27001, to shed light on this topic.

Perspective on Order: Risk Management First

In my view, risk management should come before business impact analysis (BIA) for several reasons:

Foundation for Informed Decision-Making:
- Risk management involves identifying, assessing, and prioritizing risks to organizational assets, including information systems. By understanding the various risks that could affect the organization, decision-makers can better determine which assets are most critical to protect.
- A comprehensive risk assessment helps to pinpoint vulnerabilities, threats, and potential impacts, which sets the stage for a more effective BIA.
Contextualizing the BIA:
- The BIA’s primary goal is to assess the potential impacts of disruptions on business operations. Having a thorough understanding of the risks enables the BIA to focus on the most relevant scenarios and prioritize business functions that could be most affected by these risks.
- When BIA is performed with a risk-informed approach, it becomes more aligned with actual threats and vulnerabilities that the organization faces, leading to more realistic planning and resource allocation.
Alignment with Standards:
- Frameworks like NIST SP 800-30 (Guide for Conducting Risk Assessments) and ISO 27005 (Information Security Risk Management) emphasize the need for risk assessments as a precursor to business continuity planning, which includes BIA.
- The NIST Cybersecurity Framework encourages organizations to identify and assess risks before developing their response strategies, further highlighting the importance of risk management in establishing an effective security posture.

Some authoritative sources and frameworks that support the perspective that risk management should come before business impact analysis (BIA)

In forming an information security program:

NIST Cybersecurity Framework (NIST CSF):
- This framework emphasizes the importance of identifying risks as the first step in developing a comprehensive cybersecurity program. It guides organizations to understand their risks to inform their overall strategy.
- Source: NIST Cybersecurity Framework
NIST SP 800-30: Guide for Conducting Risk Assessments:
- This publication outlines the process of risk assessment, which includes identifying and assessing risks to inform subsequent security decisions and planning, including BIA.
- Source: NIST SP 800-30
ISO/IEC 27001: Information Security Management Systems (ISMS):
- This international standard provides a systematic approach to managing sensitive company information, including risk assessment as a foundational component for establishing security controls and business continuity plans, which include BIA.
- Source: ISO/IEC 27001
ISO 22301: Societal Security – Business Continuity Management Systems:
- This standard provides requirements for a business continuity management system (BCMS) and emphasizes the need for risk assessments to inform the BIA process.
- Source: ISO 22301
COBIT 2019: A Business Framework for the Governance and Management of Enterprise IT:
- COBIT emphasizes governance and management practices that include risk management as a critical process to ensure alignment with organizational objectives, which informs BIA and other security-related decisions.
- Source: COBIT 2019
Business Continuity Institute (BCI) Good Practice Guidelines:
- The BCI provides guidance on best practices for business continuity management, highlighting the need for risk assessment before conducting a BIA.
- Source: BCI Good Practice Guidelines

Conclusion

Both risk management and business impact analysis are crucial steps in forming a robust information security program. The question of which should come first remains a point of debate, highlighting the complexity of managing risks and impacts within an organization. Engaging in this discussion can provide valuable insights that help organizations effectively integrate these processes to enhance their overall security posture.

Further References

Business Impact Analysis vs Risk Assessment: How do They …Continuity2https://continuity2.com › Blog Page

Understanding Risk Assessment Vs. Business Impact …Planet Compliancehttps://www.planetcompliance.com › Articles

Business Impact Assessment vs. Risk AssessmentCentraleyeshttps://www.centraleyes.com

Risk Assessment or Business Impact AnalysisGMH Continuity Architectshttp://www.gmhasia.com

Business Impact Analysis (BIA): A Practical ApproachIT Governance EUhttps://www.itgovernance.eu

Using Business Impact Analysis to Inform Risk Prioritization …National Institute of Standards and Technology (.gov)https://nvlpubs.nist.gov

Key Risk and Control Indicators for Supplier, Supply Chain, and Third-Party Risks

September 12, 2024Governance Risk & Compliance, Risk Analysis, Risk Assessment, Risk management, Risks, Third-Party, Vendor, Vendorscompliance, control, governance, riskadmin

To effectively monitor supplier risk, supply chain risk, or third-party risk, you can utilize Key Risk Indicators (KRIs) and Key Control Indicators (KCIs). These metrics will help you assess potential risks and the effectiveness of controls, while also providing insights into trends over time.

Key Risk and Control Indicators for Supplier, Supply Chain, and Third-Party Risks

KRIs are metrics used to signal potential risks and vulnerabilities within the supply chain or third-party relationships. They are forward-looking and help you anticipate problems before they become critical. Examples of KRIs in this area include:

Supplier Financial Stability:

KRI Example: Credit rating changes, late payment history, or declining revenue.
Scoring: Quantitative (e.g., credit score) or Qualitative (e.g., High, Medium, Low).
Trend Analysis: Monitor credit score trends over time, showing potential risks of insolvency.

Delivery Performance:

KRI Example: On-time delivery rate, missed deadlines, or lead time variability.
Scoring: Quantitative (% on-time deliveries, days late).
Trend Analysis: Plot historical delivery data to identify degradation in supplier performance.

Compliance and Regulatory Risk:

KRI Example: Frequency of compliance violations, audit findings, or changes in regulatory status.
Scoring: Quantitative (number of violations) or Qualitative (compliance level: Full, Partial, Non-compliant).
Trend Analysis: Track compliance issues over time to spot recurring or increasing risks.

Supplier Concentration Risk:

KRI Example: Percentage of critical goods supplied by a single supplier.
Scoring: Quantitative (% of total procurement from one supplier).
Trend Analysis: Monitor shifts in supplier concentration to identify dependency risks.

Geopolitical Risk Exposure:

KRI Example: Number of suppliers in high-risk regions (e.g., war zones, politically unstable areas).
Scoring: Qualitative (High, Medium, Low based on geographic stability) or Quantitative (number of suppliers).
Trend Analysis: Track geopolitical risk exposure by region and its potential impact on the supply chain.

Key Control Indicators (KCIs) for Supplier, Supply Chain, and Third-Party Risks

KCIs monitor the effectiveness of controls that are in place to mitigate risks. They help ensure that appropriate actions are being taken to manage identified risks.

Supplier Audits and Assessments:

KCI Example: Number and frequency of completed supplier audits.
Scoring: Quantitative (number of audits) or Qualitative (audit compliance level: High, Medium, Low).
Trend Analysis: Track the completion of audits and any recurring issues or improvements.

Contractual Compliance:

KCI Example: Percentage of suppliers in compliance with contractual terms.
Scoring: Quantitative (% of compliant contracts).
Trend Analysis: Measure trends in contract compliance to assess control effectiveness over time.

Supply Chain Visibility:

KCI Example: Availability of real-time data across the supply chain (e.g., inventory levels, shipping status).
Scoring: Quantitative (real-time data integration rate) or Qualitative (High, Medium, Low visibility).
Trend Analysis: Monitor the improvement or decline in supply chain transparency and control effectiveness.

Incident Response Time:

KCI Example: Average time to resolve supply chain disruptions (e.g., cyberattacks, natural disasters).
Scoring: Quantitative (time to resolve incidents in hours or days).
Trend Analysis: Track the response time to incidents to measure the efficiency of controls.

Training and Certification:

KCI Example: Percentage of supplier personnel trained in cybersecurity or compliance.
Scoring: Quantitative (% of trained personnel).
Trend Analysis: Track training completion rates over time to ensure ongoing compliance and control enhancement.

Scoring KRIs and KCIs:

Both quantitative and qualitative scoring methods can be used to measure KRIs and KCIs, depending on the specific risk or control being monitored:

Quantitative Scoring: Often numeric, providing hard data (e.g., % on-time deliveries, number of incidents). This is useful for trend analysis and visualizing data over time.
Qualitative Scoring: Typically categorical (e.g., High, Medium, Low). Useful when the data is more subjective or when precise numbers are not available but can still indicate trends.

Trend Analysis for KRIs and KCIs:

Use dashboards and reporting tools (e.g., Power BI, Tableau) to visualize trends in both KRIs and KCIs.
Track the progression of indicators over time to detect improvements, deteriorations, or sudden changes.
Plot both KRIs and KCIs together to show how risk exposure relates to the effectiveness of controls.

By using this combination of KRIs and KCIs, and applying either quantitative or qualitative scoring methods, you can develop a robust framework for assessing supplier and third-party risks while also tracking control effectiveness. This allows for ongoing monitoring and the ability to display trends over time, providing a clear picture of risk management performance.

Tools for KRIs and KCIs for supplier risk, supply chain, and third-party risk management:

Here are several tools that provide Key Risk Indicators (KRIs) and Key Control Indicators (KCIs) for supplier risk, supply chain, and third-party risk management, along with built-in capabilities for scoring and trend analysis with minimal tuning:

A. Prevalent Third-Party Risk Management Platform

Capabilities: Prevalent offers automated risk assessments, scoring, and monitoring of third-party risks, including Key Risk Indicators. It includes features like dynamic risk scoring, risk reports, and dashboards that allow you to track trends over time.
Out-of-the-box features: Prebuilt templates and vendor risk profiles for faster deployment. Minimal tuning is needed to customize risk indicators.
Scoring: Quantitative and qualitative scoring based on risk factors like financial health, regulatory compliance, and operational risk.
Trend analysis: Visualize risk trends across multiple suppliers over time through dashboards.
Website: Prevalent

B. RiskWatch

Capabilities: RiskWatch provides automated risk assessments, scoring, and monitoring for supply chain and third-party risks. It includes out-of-the-box templates for creating KRIs and KCIs across different domains like cybersecurity, operational risk, and vendor compliance.
Out-of-the-box features: Prebuilt assessments for vendor risk management and supply chain risk with customizable KPIs.
Scoring: Quantitative scoring (e.g., risk scores) and qualitative ratings (e.g., high, medium, low) to reflect overall risk.
Trend analysis: Automatically generate risk trend reports to view changes over time.
Website: RiskWatch

C. MetricStream Third-Party Risk Management

Capabilities: MetricStream offers a comprehensive platform for managing third-party and supply chain risk. It supports KRIs and KCIs through built-in risk scoring, assessments, and monitoring.
Out-of-the-box features: Pre-configured templates and risk indicators for third-party risk assessments with options to customize.
Scoring: Quantitative and qualitative scoring based on factors like vendor performance, compliance, and cybersecurity.
Trend analysis: Visual dashboards for monitoring key risk/control trends and risk assessments over time.
Website: MetricStream

D. Aravo Third-Party Risk Management

Capabilities: Aravo provides a platform for third-party risk management with built-in KRIs and KCIs to assess and monitor suppliers and vendors. The platform comes with automated workflows for risk assessments and tracking.
Out-of-the-box features: Pre-configured templates for third-party risk with minimal need for customization.
Scoring: Both quantitative and qualitative risk scoring, based on compliance and operational risks.
Trend analysis: Real-time dashboards and reports that track and display risk trends and the effectiveness of control measures.
Website: Aravo

E. RSA Archer Third-Party Governance

Capabilities: RSA Archer offers robust tools for managing third-party and supply chain risks, with pre-configured KRIs, KCIs, and workflows. It also has reporting capabilities that visualize risks and control trends over time.
Out-of-the-box features: Configurable dashboards, risk templates, and automated workflows for supplier risk management.
Scoring: Quantitative risk scoring through risk indicators, combined with qualitative evaluations of vendor performance and compliance.
Trend analysis: Visual dashboards for monitoring trends, and built-in analytics for historical comparisons of risk and control performance.
Website: RSA Archer

F. LogicManager

Capabilities: LogicManager includes features for managing third-party risk, with configurable risk indicators and automated reporting for tracking trends in both risk exposure and control effectiveness.
Out-of-the-box features: Prebuilt templates and automated workflows to quickly deploy risk assessments and KRIs.
Scoring: Risk scores based on vendor financial health, regulatory compliance, and operational risk. Both qualitative and quantitative metrics can be incorporated.
Trend analysis: Trend analysis via automated reporting and dashboards that track risk factors over time.
Website: LogicManager

G. Coupa Risk Assess

Capabilities: Coupa’s risk management module provides an integrated platform for assessing supply chain and third-party risks. It uses KRIs and KCIs to monitor supplier performance and compliance with minimal setup.
Out-of-the-box features: Preconfigured risk indicators and reporting templates for quick implementation.
Scoring: Quantitative scoring using real-time risk data from multiple sources, as well as qualitative assessments.
Trend analysis: Built-in analytics and visualizations that show trends in supplier risk over time.
Website: Coupa

H. ProcessUnity Vendor Risk Management

Capabilities: ProcessUnity provides a vendor risk management solution that tracks key risks and controls for third-party suppliers. It includes KRIs and KCIs, real-time reporting, and analytics to display trends in vendor risk.
Out-of-the-box features: Ready-to-use templates for vendor risk scoring and assessments.
Scoring: Quantitative and qualitative risk assessments based on supplier financial health, regulatory compliance, and cybersecurity posture.
Trend analysis: Customizable dashboards to visualize risk and control trends over time.
Website: ProcessUnity

Conclusion:

These tools are designed to automate the tracking of Key Risk Indicators (KRIs) and Key Control Indicators (KCIs), providing both quantitative and qualitative scoring, as well as trend analysis. Many of them come with out-of-the-box templates and minimal tuning requirements, making them suitable for quickly implementing a risk management framework for supplier and third-party risk.

Other References:

https://www.prevalent.net/blog/use-nist-sp-800-53-for-third-party-supply-chain-risk-management

https://www.upguard.com/blog/kpis-to-measure-tprm

https://www.venminder.com/blog/examples-key-risk-indicators-third-party-management

Risk-Based Assessment of Privileged Access Rights: Distinguishing Permissions by Type and Impact

September 11, 2024Accountability, Governance Risk & Compliance, Identity Access Management, Impact, Privileged Access Management, Risk-Based, Uncategorizedbest, compliance, governance, How to, identity access management, methodology, practices, privileged access management, risk, risk-basedadmin

Evaluating Privileged Access Rights: A Risk-Based Approach to Categorizing Permissions by Type and Impact

In today’s complex security landscape, effectively managing privileged access rights is essential to protecting an organization’s sensitive data and infrastructure. A risk-based assessment approach helps organizations identify and prioritize risks linked to various types of access permissions.

By categorizing permissions based on their type and potential impact, security teams can better allocate resources and implement controls to mitigate high-risk access. This approach not only strengthens security but also ensures that privileged access is granted and monitored according to its actual risk, reducing the chances of unauthorized use or exploitation.

A key element of a comprehensive risk-based assessment model is distinguishing between different types of privileged access rights. Each type of permission carries its own level of risk, and not all privileged access is equally risky.

Let’s break down how you might distinguish between privileged access rights based on specific types of permissions:

Types of Permissions and Privileged Access:

Administrative Control Rights:
- System Administrator Access: This is typically the highest level of privilege, where a user has full control over the system, including the ability to modify configurations, manage users, install software, and make system-wide changes. This type of access poses the greatest risk and must be subject to strict control and monitoring.
- Network Administrator Access: Similar to system admin access, network administrators can configure and control network devices (routers, switches, firewalls). This access is critical for maintaining security and operational integrity and is considered high-risk due to the potential to disrupt network operations.

Data Access Permissions:
- Read-Only Privilege: Access to view sensitive data without the ability to modify or delete it is still considered privileged but poses a lower risk compared to write or execute privileges. This access is common in scenarios where users need to analyze or audit information but don’t require editing capabilities.
- Read/Write/Modify Privilege: Access to alter or modify sensitive data (e.g., financial records, HR data, customer information) significantly increases the risk of data integrity and privacy violations. These permissions require additional oversight to prevent misuse or unauthorized changes.
- Delete/Destroy Data: Permissions that allow users to delete critical data pose the highest risk, as they could lead to irrecoverable loss. This should be categorized as a highly privileged access right.

Security and Audit Privileges:
- Audit Log Access: Access to view and manage security logs can be classified as privileged since it may allow users to conceal unauthorized activities by deleting or altering audit trails. This requires close monitoring, as tampering with logs can hinder security investigations.
- Security Policy Management: Users who can configure or alter security settings (e.g., firewall rules, encryption keys, access control policies) hold highly privileged roles. Their actions can directly affect the organization’s security posture.

Escalation and Override Rights:
- Privilege Escalation: Some accounts have the ability to grant themselves or others additional permissions (e.g., temporarily elevating their own access to an administrative level). This ability to escalate privileges poses a significant risk and should be strictly controlled.
- Override/Bypass Security Controls: Access to disable or bypass critical security mechanisms (e.g., antivirus, DLP, encryption) should be considered highly privileged as it exposes systems to potential compromise.

Risk-Based Distinction by Type of Privilege:

When designing the risk-based assessment, the model should assign different risk weights to these types of permissions:

Administrative controls would carry the highest risk, due to the potential for widespread system impact.
Data modification permissions would carry moderate to high risk, depending on the sensitivity of the data.
Read-only permissions would be assessed as lower risk, as they do not allow users to alter or manipulate data but could still lead to data leakage if exposed.
Security management and privilege escalation should be assessed as high-risk, due to the potential to undermine security mechanisms.

Scoring Privileged Access Based on Permission Type:

Each type of permission should be integrated into your risk-scoring model as part of the overall assessment:

Control Privileges: High-risk score (e.g., 5/5)
Modification Privileges: Moderate to high-risk score (e.g., 3-4/5)
Read-Only Privileges: Low to moderate risk score (e.g., 2/5)
Escalation/Override Rights: High-risk score (e.g., 5/5)

The assessment model should consider not just the role or account type, but also the nature of the permission granted to the user. By evaluating these different permission levels, you can more effectively determine which access rights are truly privileged and require heightened security measures and scrutiny.

Conclusion:

In conclusion, managing privileged access rights is a critical component of safeguarding an organization’s sensitive data and infrastructure in today’s complex security environment. Adopting a risk-based assessment approach enables organizations to identify and address risks associated with different access permissions more effectively.

By classifying permissions based on their potential impact, security teams can prioritize high-risk areas, implement targeted controls, and ensure that access is monitored according to its true risk level. This strategy not only fortifies the organization’s security posture but also minimizes the potential for unauthorized access or misuse of critical systems.

Design a Risk-Based Method

https://www.oneidentity.com/community/blogs/b/privileged-access-management/posts/how-to-conduct-a-privileged-access-management-risk-assessment

Design a Risk-Based Method

September 10, 2024Best Practices, Governance Risk & Compliance, How To, Identity Access Management, Methodology, Risk-Basedbest practices, Governance Risk & Compliance, How to, identity access management, methodology, risk-basedadmin

How To

Designing a risk-based method to assess whether an access right is considered privileged requires a structured approach that evaluates the access’s potential impact, sensitivity, and criticality. The method should focus on identifying high-risk access points that could significantly affect the organization if misused.

Here’s a step-by-step guide:

A. Define Privileged Access Criteria

First, define what constitutes “privileged access” within the organization. Typically, privileged access includes:

Access that grants administrative rights, like system or database administrator roles.
Access to modify security settings or configurations.
Access to critical or sensitive systems (e.g., financial systems, customer databases).
Access to override, bypass, or disable security mechanisms.

B. Categorize Access Levels

Classify access rights into categories based on potential risk:

Standard Access: Rights that allow basic, day-to-day operations without security or administrative privileges.
Elevated Access: Rights that grant users access to additional resources or functions but are not critical or highly sensitive.
Privileged Access: Rights that involve significant control over systems, networks, or sensitive data, which could affect organizational security if misused.

C. Risk Factors for Privileged Access

To assess whether an access right should be considered privileged, consider the following risk factors:

Scope of Control: Does the access allow the user to change system configurations or security settings? Broad access to system resources indicates higher risk.
Impact of Misuse: What would be the consequence of misuse? High-risk access can cause significant financial, reputational, or operational damage.
Data Sensitivity: Does the access provide visibility or control over sensitive data (e.g., personal information, financial data, intellectual property)?
User Autonomy: Is the user able to bypass security controls or escalate privileges? If so, it is likely privileged access.

D. Create a Risk-Based Scoring Model

Develop a scoring model that assigns a risk score based on the factors above. This model can use a numeric scale (e.g., 1-5) or categories like “Low,” “Medium,” and “High.” Each access type would be evaluated based on:

Criticality of the system (e.g., critical business functions vs. non-essential services).
Sensitivity of the data (e.g., personally identifiable information (PII) vs. non-sensitive data).
Impact of abuse or compromise (e.g., financial loss, regulatory non-compliance).

For example:

Low-risk access: Viewing non-sensitive data with no ability to modify.
Medium-risk access: Access to modify specific data but without broad control over systems.
High-risk access (Privileged): Full control over systems or access to sensitive data with the ability to modify or delete critical assets.

E. Automate and Review Regularly

Automate this risk-based model where possible using identity and access management (IAM) tools to continuously evaluate and reclassify access based on the risk level. The system should flag accounts with high-risk privileges for additional monitoring or review.

F. Implement Controls for Privileged Access

For access deemed privileged:

Apply Enhanced Controls: Use multi-factor authentication (MFA), session monitoring, and audit logs to track activities performed by privileged users.
Conduct Periodic Reviews: Regularly review privileged access rights to ensure they are still necessary and aligned with job roles.
Principle of Least Privilege: Always assign the least amount of access necessary to perform the role.

G. Incorporate Organizational Input

Collaborate with system owners, security teams, and risk management personnel to understand the specific context of access rights within your organization. This will help in fine-tuning the criteria and scoring model based on the business impact.

Example Model:

Factor	Score (1-5)	Weight	Description
Scope of Control	1-5	30%	Admin privileges, system settings access
Data Sensitivity	1-5	30%	Access to PII, financial data, critical IP
Impact of Misuse	1-5	25%	Potential damage caused by abuse of the access
User Autonomy	1-5	15%	Ability to bypass security or escalate privileges
Total Score	Weighted score		Sum of weighted scores

Access with a higher total score would be classified as privileged and subject to additional controls.

Conclusion

This risk-based approach to determining privileged access ensures that access rights are evaluated not just based on the role or function but also on the potential risk and impact they pose. Regular reviews and automation further strengthen the assessment, keeping access rights in line with the organization’s evolving security posture.

Risk-Based Assessment of Privileged Access Rights: Distinguishing Permissions by Type and Impact

Cybersecurity Checklist for 2024, Transition to ISO 27001:2022

August 16, 2024Adoption, Alignment, Best Practices, Certifications, Compliance, Controls, Cybersecurity, Framework, Governance Risk & Compliance, GRC, How To, Implementation, Information Security, Information Security Management, ISO/IEC 27001, Methodology, Security, Security Operationsadmin

2024 Cybersecurity Guide: Adapting to ISO 27001:2022

In the ever-evolving world of cybersecurity, staying ahead of emerging threats and ensuring compliance with international standards is paramount. With the release of ISO 27001:2022, organizations are now tasked with transitioning to the updated standard to maintain their Information Security Management Systems (ISMS). This transition is not just about updating policies and procedures; it involves a thorough review and alignment of security practices with the new requirements. Below is a comprehensive cybersecurity checklist to guide your organization through the transition to ISO 27001:2022, ensuring you remain compliant and resilient in 2024.

A. Understand the Key Changes in ISO 27001:2022

Action: Familiarize yourself with the updates in ISO 27001:2022, particularly the changes in Annex A controls, which now align with ISO 27002:2022.
Key Changes Include:
- Reduction of control categories from 14 to 4: Organizational, People, Physical, and Technological controls.
- Introduction of new controls, such as threat intelligence, information security for cloud services, and data masking.
- Enhanced focus on risk management and more granular requirements for control objectives.

B. Update Your Risk Assessment Process

Action: Revisit your risk assessment process to ensure it aligns with the updated standard’s focus on risk management.
Steps to Take:
- Identify new threats and vulnerabilities introduced by changes in technology, regulations, and business operations.
- Ensure that risk assessments are performed regularly and that results are documented and communicated to relevant stakeholders.
- Update your risk treatment plan to address newly identified risks and ensure that controls are implemented accordingly.

C. Review and Update Information Security Policies

Action: Conduct a thorough review of all information security policies to ensure they reflect the new requirements of ISO 27001:2022.
Focus Areas:
- Incorporate the new controls introduced in ISO 27001:2022 into your policies.
- Ensure that policies address the use of cloud services, remote work, and mobile devices, which have become increasingly prevalent.
- Align policies with the organization’s risk appetite and ensure they are communicated effectively across the organization.

D. Enhance Security Awareness and Training Programs

Action: Update your security awareness and training programs to reflect the new standard’s emphasis on people controls.
Training Should Cover:
- The importance of information security and each employee’s role in maintaining it.
- New and emerging threats, including phishing, social engineering, and ransomware.
- Best practices for secure communication, data handling, and remote work.

E. Strengthen Technical Controls and Cybersecurity Measures

Action: Assess and enhance your technical controls to ensure they meet the requirements of ISO 27001:2022.
Key Technical Controls:
- Threat Intelligence: Implement systems to gather, analyze, and respond to threat intelligence, enabling proactive defense against cyber threats.
- Data Masking and Encryption: Ensure that sensitive data is masked and encrypted, both in transit and at rest, to protect against unauthorized access.
- Cloud Security: Review and strengthen the security measures for cloud services, ensuring compliance with the new standard’s requirements.

F. Conduct a Gap Analysis and Internal Audit

Action: Perform a gap analysis to identify areas where your current ISMS falls short of the ISO 27001:2022 requirements.
Steps to Follow:
- Compare your existing controls and processes against the new standard.
- Document any gaps and create an action plan to address them.
- Conduct an internal audit to verify that the updated ISMS meets the new standard and is ready for external certification.

G. Update Incident Response and Business Continuity Plans

Action: Review and update your incident response and business continuity plans to ensure they align with the new requirements.
Key Considerations:
- Ensure that the plans address new and emerging threats, including advanced persistent threats (APTs) and supply chain attacks.
- Test the effectiveness of your incident response plan through regular drills and simulations.
- Update recovery time objectives (RTOs) and recovery point objectives (RPOs) to reflect the organization’s current risk environment.

H. Engage Leadership and Stakeholders

Action: Ensure that leadership is actively involved in the transition process and understands the implications of the new standard.
Steps to Take:
- Present the benefits and challenges of transitioning to ISO 27001:2022 to senior management.
- Secure necessary resources and support for the transition, including budget allocation and personnel.
- Regularly update stakeholders on the progress of the transition and address any concerns.

I. Prepare for External Certification

Action: Engage with a certified external auditor to schedule your ISO 27001:2022 certification audit.
Preparation Tips:
- Ensure that all documentation is up-to-date and reflects the new standard’s requirements.
- Conduct a pre-audit review to identify any remaining issues or areas for improvement.
- Ensure that all employees are prepared for the audit and understand their roles in maintaining compliance.

J. Monitor, Review, and Improve

Action: Establish a continuous monitoring and improvement process to maintain compliance with ISO 27001:2022.
Key Activities:
- Regularly review the effectiveness of your controls and update them as needed.
- Stay informed about new threats, vulnerabilities, and best practices in cybersecurity.
- Foster a culture of continuous improvement, ensuring that the organization remains resilient in the face of evolving risks.

Conclusion

Transitioning to ISO 27001:2022 is a critical step in ensuring that your organization’s cybersecurity posture remains strong and compliant with international standards. By following this comprehensive checklist, you can navigate the complexities of the transition process, address emerging threats, and maintain a robust Information Security Management System that meets the demands of 2024 and beyond. Stay proactive, engage leadership, and commit to continuous improvement to achieve lasting success in your cybersecurity efforts.

Other references

ISO 27001:2022 Transition Guidance For ClientsNQAhttps://www.nqa.com › en-my › transitions › iso-27001…

A Guide to Transitioning to ISO 27001:2022IT Governance USAhttps://www.itgovernanceusa.com › blog › what-you-nee…

ISO 27001:2022 – How to update to the ISO27001 latest …Dataguard.co.ukhttps://www.dataguard.co.uk › knowledge › iso-27001

ISO/IEC 27001 Transition: What You Should KnowSGS SAhttps://www.sgs.com › en-hk › news › 2024/05 › iso-ie…

Transition to ISO IEC 27001:2022 – DNVdnv.comhttps://www.dnv.com › Management-Systems › new-iso

Ultimate Guide to ISO 27001 Compliance [Updated 2024]Sprintohttps://sprinto.com › blog › iso-27001-compliance

ISO 27001:2022 Transition GuideJohanson Group, LLPhttps://www.johansonllp.com › iso-27001-transition-guide

Artificial Intelligence and Information Systems Auditors: Developing New Skills and Competencies for the Future

August 13, 2024AI, Artificial, Artificial Intelligence, Audit, Competence, Digital Skills, Future, Governance Risk & Compliance, Information, Information Technology, Intelligence, Skill set, Skills Gap, Soft Skills, Systemsadmin

The Evolving Role of IS Auditors in the Age of AI: Emerging Skills and Competencies

As the AI revolution continues to reshape industries, the role of Information Systems (IS) Auditors is undergoing significant transformation. New competencies are emerging as essential, driven by the increasing integration of AI technologies into business processes. The demands of the IS auditing profession are shifting, requiring auditors to develop expertise in several critical areas.

Key Competencies at the Forefront of the Shift

A. Advanced Data Analytics with AI Techniques

IS Auditors must now be proficient in advanced data analytics, focusing on AI-specific techniques and big data handling. This expertise is crucial for assessing AI-driven systems, ensuring that data integrity, accuracy, and reliability are maintained.

B. AI Governance and Risk Management Frameworks

Understanding and applying AI governance frameworks are becoming central to the auditor’s role. IS Auditors must be capable of evaluating AI governance structures, ensuring that AI implementations adhere to risk management protocols and align with business objectives.

C. Explainable AI and Algorithmic Auditing

As AI systems become more complex, the need for explainability grows. IS Auditors must develop the ability to audit AI algorithms, ensuring that they are transparent, fair, and accountable. This competency is vital for maintaining trust in AI systems and for complying with regulatory requirements.

D. Evolving Regulatory Landscape

The regulatory environment around AI is rapidly evolving, with new laws and frameworks like the EU AI Act and the NIST AI Risk Management Framework. IS Auditors must stay informed about these developments and understand how to integrate AI-specific regulations with existing standards.

E. Ethical AI

Ensuring that AI systems are developed and deployed ethically is becoming a core responsibility of IS Auditors. This involves assessing AI for potential biases, fairness, and the overall impact on society.

Additional Competencies for the AI-Driven Era

As IS Auditors adapt to this new landscape, several additional competencies will be essential:

A. AI Lifecycle Management

Auditors need to understand the complete AI lifecycle, from data collection and model training to deployment and ongoing monitoring. This knowledge is crucial for assessing risks at every stage of AI development.

B. AI Security and Cyber Threats

With AI systems becoming integral to business operations, IS Auditors must be knowledgeable about AI-specific cybersecurity threats, such as adversarial attacks and AI algorithm manipulation.

C. Continuous Learning Systems Auditing

Traditional auditing frameworks may not fully apply to AI systems that continuously learn and adapt. IS Auditors must develop expertise in auditing these dynamic systems to ensure ongoing compliance and risk management.

D. Human-AI Collaboration Auditing

Understanding how AI and human decision-makers interact is crucial. Auditors must evaluate the effectiveness of AI-human collaboration, ensuring that AI supports rather than undermines human judgment.

E. Data Privacy and AI

As AI systems often require vast amounts of data, IS Auditors need in-depth knowledge of data privacy regulations as they apply to AI, ensuring compliance while balancing the need for high-quality data.

F. AI Ethics and Bias Detection

Proficiency in identifying and mitigating biases within AI systems is essential. IS Auditors must ensure that AI deployments align with ethical standards, promoting fairness and equity.

G. Cross-Disciplinary Knowledge

The complexity of AI requires auditors to draw on knowledge from disciplines beyond traditional IT, including law, ethics, and behavioral sciences, to fully understand AI’s implications.

H. Stakeholder Communication and AI Literacy

IS Auditors must effectively communicate complex AI concepts to non-technical stakeholders, ensuring transparency and understanding across the organization.

I. AI Tool Proficiency

Familiarity with AI tools and platforms used for data analysis, model development, and AI auditing is essential. Practical experience with these tools enables auditors to provide accurate and actionable insights.

J. Scenario Planning and AI Impact Assessment

Skills in scenario planning and assessing AI’s broader impacts on business processes, compliance, and risk are crucial for providing comprehensive oversight.

Conclusion

The role of IS Auditors is rapidly evolving in response to the growing influence of AI technologies. The competencies highlighted here, along with the additional skills outlined, will form the foundation of IS auditing in the AI-driven era. Engaging in ongoing discussions and staying informed about these emerging requirements will ensure that the profession continues to adapt and thrive in this new landscape.

https://www.researchgate.net/publication/375920565_ANALYZING_THE_ROLE_OF_ARTIFICIAL_INTELLIGENCE_IN_IT_AUDIT_CURRENT_PRACTICES_AND_FUTURE_PROSPECTS

How to identify risk areas in GDPR compliance

August 12, 2024Best Practices, Data Privacy, GDPR, Governance Risk & Compliance, How To, Privacy, Risk Analysis, Risk Assessment, Risk managementGDPR, identifyadmin

Identifying risk areas in GDPR compliance involves a systematic approach to understanding where personal data may be vulnerable and where an organization might not fully meet the requirements set out by the regulation. Here’s a step-by-step thought process to help identify these risk areas:

1. Understand the Scope of GDPR:

Identify Personal Data: Determine what constitutes personal data within your organization. This includes any information that can directly or indirectly identify an individual (e.g., names, email addresses, IP addresses, etc.).
Mapping Data Flows: Understand how personal data flows through your organization. Identify where data is collected, processed, stored, and transferred, both within and outside the organization.

2. Conduct a Data Inventory:

Data Collection Points: Identify all points where personal data is collected, whether online (e.g., websites, apps) or offline (e.g., paper forms).
Data Processing Activities: Document the various processes where personal data is used (e.g., customer relationship management, HR processes, marketing activities).
Third-Party Relationships: Identify third parties (e.g., vendors, service providers) that have access to or process personal data on your behalf.

3. Assess Legal Basis for Data Processing:

Review Consent Mechanisms: Ensure that consent is obtained in a GDPR-compliant manner, meaning it is freely given, specific, informed, and unambiguous.
Alternative Legal Bases: For data processing activities not based on consent, ensure there is a valid legal basis (e.g., contract necessity, legitimate interest, legal obligation).

4. Evaluate Data Subject Rights:

Access to Data: Check if you have mechanisms in place for data subjects to access their personal data.
Rectification and Erasure: Ensure processes exist for correcting inaccurate data and fulfilling requests for data deletion (“right to be forgotten”).
Portability and Restriction: Evaluate your ability to provide data portability and to restrict processing when requested by the data subject.

5. Review Data Security Measures:

Technical Safeguards: Assess whether your organization has adequate technical measures (e.g., encryption, access controls) to protect personal data.
Organizational Measures: Ensure that policies, procedures, and training are in place to mitigate the risk of data breaches.
Incident Response: Review your procedures for detecting, reporting, and responding to data breaches, ensuring they align with GDPR requirements (e.g., 72-hour notification window).

6. Evaluate Data Transfer Practices:

International Data Transfers: Identify any transfers of personal data outside the EU/EEA. Ensure that appropriate safeguards are in place (e.g., Standard Contractual Clauses, Binding Corporate Rules).
Data Localization Laws: Be aware of any local laws that may impact data transfers and ensure compliance with those as well.

7. Assess Data Retention and Minimization:

Retention Policies: Review your data retention policies to ensure that personal data is kept no longer than necessary for the purposes for which it was collected.
Data Minimization: Evaluate whether you are collecting and processing only the minimum amount of personal data necessary for your purposes.

8. Governance and Accountability:

Data Protection Officer (DPO): Determine if your organization requires a DPO and ensure that the role is fulfilled by someone with the necessary expertise and independence.
Record Keeping: Ensure that records of processing activities are maintained and can be provided upon request.
GDPR Training: Evaluate whether employees, particularly those handling personal data, have received adequate training on GDPR requirements.

9. Monitor Regulatory Changes and Case Law:

Stay Updated: Regularly review updates to GDPR guidelines, case law, and enforcement actions to identify new or evolving risk areas.
Regulatory Engagement: Engage with Data Protection Authorities (DPAs) when necessary to clarify compliance expectations.

10. Conduct Regular Audits and Risk Assessments:

Internal Audits: Regularly audit your GDPR compliance processes to identify gaps or areas of improvement.
Risk Assessments: Conduct Data Protection Impact Assessments (DPIAs) for processing activities that are likely to result in high risks to individuals’ rights and freedoms.

11. Engage with Stakeholders:

Cross-Functional Collaboration: Work with various departments (e.g., IT, Legal, HR, Marketing) to identify risks from their specific perspectives.
Third-Party Risk: Engage with third parties to ensure their compliance with GDPR, especially if they process data on your behalf.

12. Develop a Mitigation Plan:

Prioritize Risks: Based on the identified risks, prioritize them based on their potential impact and likelihood.
Action Plan: Develop and implement an action plan to mitigate these risks, including updating policies, enhancing security measures, and providing additional training.

Conclusion:

Identifying risk areas in GDPR compliance is an ongoing process that requires a thorough understanding of the regulation, continuous monitoring of data practices, and active collaboration across the organization. By systematically addressing each aspect of GDPR, organizations can better manage compliance risks and protect the personal data they handle.

https://www.techtarget.com/searchdatamanagement/tip/Six-data-risk-management-steps-for-GDPR-compliance

Cybersecurity Takedowns

August 11, 2024Best Practices, Business Continuity, Cyber Attacks, Cyber risk, Cybersecurity, Governance Risk & Compliance, GRC, Resilience, Risk Analysis, Risk Assessment, Risk management, Strategic Cybersecurity, StrengthBusiness Continuity, cybersecurityadmin

Information security programs are not easy or totally successful on a global scale. In fact, performing a takedown—that is, successfully removing or blocking malware implemented on a vast scale and/or stopping malicious individuals or organizations that create and disseminate it—is very difficult for many reasons. Examining several cybersecurity response programs, evaluating their levels of success and describing various common malware programs can help reveal methods to help combat cyber-incidents.

https://www.isaca.org/resources/isaca-journal/issues/2019/volume-6/cybersecurity-takedowns

Based on the information from the article “Cybersecurity Takedowns,” here are some additional, new, recommendations that align with the latest frameworks, standards, and guidelines for improving cybersecurity measures:

Enhanced Coordination and Collaboration:
- Foster stronger coordination among software vendors, internet service providers, and internet malware researchers to stop malicious activities before they escalate.
- Establish and support focused groups dedicated to consistent software solutions and updates across vendors.
Timely Updates and Patch Management:
- Ensure timely updates of antivirus software and regular patch management to mitigate zero-day vulnerabilities.
- Encourage organizations to adopt automated patch management systems to ensure consistency and timeliness.
Improved Threat Detection and Response:
- Utilize AI and machine learning technologies to enhance the detection of cyber anomalies and respond to threats more effectively.
- Implement robust intrusion detection and prevention systems that can quickly identify and mitigate zero-day and AI-driven attacks.
Regular Penetration Testing:
- Conduct frequent penetration testing to assess the strength of cyber defenses and identify vulnerabilities before they can be exploited.
- Use results from penetration tests to prioritize and remediate critical vulnerabilities.
Comprehensive Cyberhygiene Practices:
- Promote good cyberhygiene practices across all organizations, regardless of size, to ensure data protection and security.
- Implement secure configurations for all devices, maintain mobile device management policies, and ensure the use of approved software and applications only.
Network and Device Security Enhancements:
- Protect the network by implementing segmentation, user-access controls, multifactor authentication, and continuous network monitoring.
- Secure all devices through standardized configurations, regular maintenance, and real-time scanning for sensitive data movements.
Data Protection Measures:
- Use data encryption for data at rest and in transit to safeguard sensitive information.
- Regularly back up data and test restoration processes to ensure data integrity and availability in case of a breach or ransomware attack.
Supply Chain Security:
- Conduct security reviews and assessments of supply chain partners to ensure uniform security standards.
- Implement random inspections and tests to verify compliance with access and authentication controls.
Strengthening Legal and Enforcement Measures:
- Advocate for stronger penalties and standardized laws across countries to deter cybercriminal activities.
- Improve international cooperation for cybercrime investigations and takedowns through coordinated efforts and information sharing.
Addressing Emerging Threats:
- Develop and deploy tools to recognize and mitigate threats from the Internet of Things (IoT) devices, which are often poorly secured.
- Prepare for weaponized artificial intelligence threats by investing in advanced detection and mitigation technologies.

By implementing these recommendations, organizations can strengthen their cybersecurity posture and be better prepared to respond to the ever-evolving landscape of cyber threats.

CrowdStrike IT Outage Explained by a Windows Developer

July 24, 2024Aftermath, Attack Technology, Best Practices, Beyond, Business Continuity, Controls, Crisis, Crisis Management, Cyber Attacks, Cybersecurity, Digital Blackout, Digital era, Global, Governance Risk & Compliance, GRC, Impact, Implications, Incident, Information Technology, IT Governance, IT Outage, IT Service Management, ITSM, Resilience, Strategic Cybersecurity, Technology Trends, Threatsaftermath, attack technology, best practices, beyond, Business Continuity, controls, crisis management, Cyber Attacks, cybersecurity, Digital Blackout, digital era, global, Governance Risk & Compliance, GRC, impact, implications, incident, Information Technology, it governance, IT Outage, IT Service Management, ITSM, regulatory, resilience, strategic cybersecurity, technology trends, threatsadmin

Understanding the CrowdStrike IT Outage: Insights from a Former Windows Developer

Introduction

Hey, I’m Dave. Welcome to my shop.

I’m Dave Plummer, a retired software engineer from Microsoft, going back to the MS-DOS and Windows 95 days. Thanks to my time as a Windows developer, today I’m going to explain what the CrowdStrike issue actually is, the key difference in kernel mode, and why these machines are bluescreening, as well as how to fix it if you come across one.

Now, I’ve got a lot of experience waking up to bluescreens and having them set the tempo of my day, but this Friday was a little different. However, first off, I’m retired now, so I don’t debug a lot of daily blue screens. And second, I was traveling in New York City, which left me temporarily stranded as the airlines sorted out the digital carnage.

But that downtime gave me plenty of time to pull out the old MacBook and figure out what was happening to all the Windows machines around the world. As far as we know, the CrowdStrike bluescreens that we have been seeing around the world for the last several days are the result of a bad update to the CrowdStrike software. But why? Today I want to help you understand three key things.

Key Points

Why the CrowdStrike software is on the machines at all.
What happens when a kernel driver like CrowdStrike fails.
Precisely why the CrowdStrike code faults and brings the machines down, and how and why this update caused so much havoc.

Handling Crashes at Microsoft

As systems developers at Microsoft in the 1990s, handling crashes like this was part of our normal bread and butter. Every dev at Microsoft, at least in my area, had two machines. For example, when I started in Windows NT, I had a Gateway 486 DX 250 as my main dev machine, and then some old 386 box as the debug machine. Normally you would run your test or debug bits on the debug machine while connected to it as the debugger from your good machine.

Anti-Stress Process

On nights and weekends, however, we did something far more interesting. We ran a process called Anti-Stress. Anti-Stress was a bundle of tests that would automatically download to the test machines and run under the debugger. So every night, every test machine, along with all the machines in the various labs around campus, would run Anti-Stress and put it through the gauntlet.

The stress tests were normally written by our test engineers, who were software developers specially employed back in those days to find and catch bugs in the system. For example, they might write a test to simply allocate and use as many GDI brush handles as possible. If doing so causes the drawing subsystem to become unstable or causes some other program to crash, then it would be caught and stopped in the debugger immediately.

The following day, all of the crashes and assertions would be tabulated and assigned to an individual developer based on the area of code in which the problem occurred. As the developer responsible, you would then use something like Telnet to connect to the target machine, debug it, and sort it out.

Debugging in Assembly Language

All this debugging was done in assembly language, whether it was Alpha, MIPS, PowerPC, or x86, and with minimal symbol table information. So it’s not like we had Visual Studio connected. Still, it was enough information to sort out most crashes, find the code responsible, and either fix it or at least enter a bug to track it in our database.

Kernel Mode versus User Mode

The hardest issues to sort out were the ones that took place deep inside the operating system kernel, which executes at ring zero on the CPU. The operating system uses a ring system to bifurcate code into two distinct modes: kernel mode for the operating system itself and user mode, where your applications run. Kernel mode does tasks such as talking to the hardware and the devices, managing memory, scheduling threads, and all of the really core functionality that the operating system provides.

Application code never runs in kernel mode, and kernel code never runs in user mode. Kernel mode is more privileged, meaning it can see the entire system memory map and what’s in memory at any physical page. User mode only sees the memory map pages that the kernel wants you to see. So if you’re getting the sense that the kernel is very much in control, that’s an accurate picture.

Even if your application needs a service provided by the kernel, it won’t be allowed to just run down inside the kernel and execute it. Instead, your user thread will reach the kernel boundary and then raise an exception and wait. A kernel thread on the kernel side then looks at the specified arguments, fully validates everything, and then runs the required kernel code. When it’s done, the kernel thread returns the results to the user thread and lets it continue on its merry way.

Why Kernel Crashes Are Critical

There is one other substantive difference between kernel mode and user mode. When application code crashes, the application crashes. When kernel mode crashes, the system crashes. It crashes because it has to. Imagine a case where you had a really simple bug in the kernel that freed memory twice. When the kernel code detects that it’s about to free already freed memory, it can detect that this is a critical failure, and when it does, it blue screens the system, because the alternatives could be worse.

Consider a scenario where this double freed code is allowed to continue, maybe with an error message, maybe even allowing you to save your work. The problem is that things are so corrupted at this point that saving your work could do more damage, erasing or corrupting the file beyond repair. Worse, since it’s the kernel system that’s experiencing the issue, application programs are not protected from one another in the same way. The last thing you want is solitaire triggering a kernel bug that damages your git enlistment.

And that’s why when an unexpected condition occurs in the kernel, the system is just halted. This is not a Windows thing by any stretch. It is true for all modern operating systems like Linux and macOS as well. In fact, the biggest difference is the color of the screen when the system goes down. On Windows, it’s blue, but on Linux it’s black, and on macOS, it’s usually pink. But as on all systems, a kernel issue is a reboot at a minimum.

What Runs in Kernel Mode

Now that we know a bit about kernel mode versus user mode, let’s talk about what specifically runs in kernel mode. And the answer is very, very little. The only things that go in the kernel mode are things that have to, like the thread scheduler and the heap manager and functionality that must access the hardware, such as the device driver that talks to a GPU across the PCIe bus. And so the totality of what you run in kernel mode really comes down to the operating system itself and device drivers.

And that’s where CrowdStrike enters the picture with their Falcon sensor. Falcon is a security product, and while it’s not just simply an antivirus, it’s not that far off the mark to look at it as though it’s really anti-malware for the server. But rather than just looking for file definitions, it analyzes a wide range of application behavior so that it can try to proactively detect new attacks before they’re categorized and listed in a formal definition.

CrowdStrike Falcon Sensor

To be able to see that application behavior from a clear vantage point, that code needed to be down in the kernel. Without getting too far into the weeds of what CrowdStrike Falcon actually does, suffice it to say that it has to be in the kernel to do it. And so CrowdStrike wrote a device driver, even though there’s no hardware device that it’s really talking to. But by writing their code as a device driver, it lives down with the kernel in ring zero and has complete and unfettered access to the system, data structures, and the services that they believe it needs to do its job.

Everybody at Microsoft and probably at CrowdStrike is aware of the stakes when you run code in kernel mode, and that’s why Microsoft offers the WHQL certification, which stands for Windows Hardware Quality Labs. Drivers labeled as WHQL certified have been thoroughly tested by the vendor and then have passed the Windows Hardware Lab Kit testing on various platforms and configurations and are signed digitally by Microsoft as being compatible with the Windows operating system. By the time a driver makes it through the WHQL lab tests and certifications, you can be reasonably assured that the driver is robust and trustworthy. And when it’s determined to be so, Microsoft issues that digital certificate for that driver. As long as the driver itself never changes, the certificate remains valid.

CrowdStrike’s Agile Approach

But what if you’re CrowdStrike and you’re agile, ambitious, and aggressive, and you want to ensure that your customers get the latest protection as soon as new threats emerge? Every time something new pops up on the radar, you could make a new driver and put it through the Hardware Quality Labs, get it certified, signed, and release the updated driver. And for things like video cards, that’s a fine process. I don’t actually know what the WHQL turnaround time is like, whether that’s measured in days or weeks, but it’s not instant, and so you’d have a time window where a zero-day attack could propagate and spread simply because of the delay in getting an updated CrowdStrike driver built and signed.

Dynamic Definition Files

What CrowdStrike opted to do instead was to include definition files that are processed by the driver but not actually included with it. So when the CrowdStrike driver wakes up, it enumerates a folder on the machine looking for these dynamic definition files, and it does whatever it is that it needs to do with them. But you can already perhaps see the problem. Let’s speculate for a moment that the CrowdStrike dynamic definition files are not merely malware definitions but complete programs in their own right, written in a p-code that the driver can then execute.

In a very real sense, then the driver could take the update and actually execute the p-code within it in kernel mode, even though that update itself has never been signed. The driver becomes the engine that runs the code, and since the driver hasn’t changed, the cert is still valid for the driver. But the update changes the way the driver operates by virtue of the p-code that’s contained in the definitions, and what you’ve got then is unsigned code of unknown provenance running in full kernel mode.

All it would take is a single little bug like a null pointer reference, and the entire temple would be torn down around us. Put more simply, while we don’t yet know the precise cause of the bug, executing untrusted p-code in the kernel is risky business at best and could be asking for trouble.

Post-Mortem Debugging

We can get a better sense of what went wrong by doing a little post-mortem debugging of our own. First, we need to access a crash dump report, the kind you’re used to getting in the good old NT days but are now hidden behind the happy face blue screen. Depending on how your system is configured, though, you can still get the crash dump info. And so there was no real shortage of dumps around to look at. Here’s an example from Twitter, so let’s take a look. About a third of the way down, you can see the offending instruction that caused the crash.

It’s an attempt to move data to register nine by loading it from a memory pointer in register eight. Couldn’t be simpler. The only problem is that the pointer in register eight is garbage. It’s not a memory address at all but a small integer of nine c hex, which is likely the offset of the field that they’re actually interested in within the data structure. But they almost certainly started with a null pointer, then added nine c to it, and then just dereferenced it.

CrowdStrike driver woes

Now, debugging something like this is often an incremental process where you wind up establishing, “Okay, so this bad thing happened, but what happened upstream beforehand to cause the bad thing?” And in this case, it appears that the cause is the dynamic data file downloaded as a sys file. Instead of containing p-code or a malware definition or whatever was supposed to be in the file, it was all just zeros.

We don’t know yet how or why this happened, as CrowdStrike hasn’t publicly released that information yet. What we do know to an almost certainty at this point, however, is that the CrowdStrike driver that processes and handles these updates is not very resilient and appears to have inadequate error checking and parameter validation.

Parameter validation means checking to ensure that the data and arguments being passed to a function, and in particular to a kernel function, are valid and good. If they’re not, it should fail the function call, not cause the entire system to crash. But in the CrowdStrike case, they’ve got a bug they don’t protect against, and because their code lives in ring zero with the kernel, a bug in CrowdStrike will necessarily bug check the entire machine and deposit you into the very dreaded recovery bluescreen.

Windows Resilience

Even though this isn’t a Windows issue or a fault with Windows itself, many people have asked me why Windows itself isn’t just more resilient to this type of issue. For example, if a driver fails during boot, why not try to boot next time without it and see if that helps?

And Windows, in fact, does offer a number of facilities like that, going back as far as booting NT with the last known good registry hive. But there’s a catch, and that catch is that CrowdStrike marked their driver as what’s known as a bootstart driver. A bootstart driver is a device driver that must be installed to start the Windows operating system.

Most bootstart drivers are included in driver packages that are in the box with Windows, and Windows automatically installs these bootstart drivers during their first boot of the system. My guess is that CrowdStrike decided they didn’t want you booting at all without their protection provided by their system, but when it crashes, as it does now, your system is completely borked.

Fixing the Issue

Fixing a machine with this issue is fortunately not a great deal of work, but it does require physical access to the machine. To fix a machine that’s crashed due to this issue, you need to boot it into safe mode, because safe mode only loads a limited set of drivers and mercifully can still contend without this boot driver.

You’ll still be able to get into at least a limited system. Then, to fix the machine, use the console or the file manager and go to the path window like windows, and then system32/drivers/crowdstrike. In that folder, find the file matching the pattern c and then a bunch of zeros 291 sys and delete that file or anything that’s got the 291 in it with a bunch of zeros. When you reboot, your system should come up completely normal and operational.

The absence of the update file fixes the issue and does not cause any additional ones. It’s a fair bet that the update 291 won’t ever be needed or used again, so you’re fine to nuke it.

Conclusion

Further references

CrowdStrike IT Outage Explained by a Windows DeveloperYouTube · Dave’s Garage13 minutes, 40 seconds2 days ago

The Aftermath of the World’s Biggest IT Outage

July 22, 2024Aftermath, Attack Technology, Best Practices, Beyond, Business Continuity, Controls, Crisis Management, Cyber Attacks, Cybersecurity, Digital Blackout, Digital era, Global, Governance Risk & Compliance, GRC, Impact, Implications, Incident, Information Technology, IT Governance, IT Outage, IT Service Management, ITSM, Regulatory, Resilience, Strategic Cybersecurity, Technology Trends, Threatsaftermath, attacktechnology, bestpractices, beyond, businesscontinuity, compliance, controls, crisismanagement, cyberattacks, cybersecurity, digitalblackout, digitalera, global, governance, GRC, impact, implications, incident, informationtechnology, ITgovernance, itoutage, ITServiceManagement, ITSM, regulatory, resilience, risk, strategiccybersecurity, technologytrends, threatsadmin

The Great Digital Blackout: Fallout from the CrowdStrike-Microsoft Outage

i. Introduction

On a seemingly ordinary Friday morning, the digital world shuddered. A global IT outage, unprecedented in its scale, brought businesses, governments, and individuals to a standstill. The culprit: a faulty update from cybersecurity firm CrowdStrike, clashing with Microsoft Windows systems. The aftershocks of this event, dubbed the “Great Digital Blackout,” continue to reverberate, raising critical questions about our dependence on a handful of tech giants and the future of cybersecurity.

ii. The Incident

A routine software update within Microsoft’s Azure cloud platform inadvertently triggered a cascading failure across multiple regions. This outage, compounded by a simultaneous breach of CrowdStrike’s security monitoring systems, created a perfect storm of disruption. Within minutes, critical services were rendered inoperative, affecting millions of users and thousands of businesses worldwide. The outage persisted for 48 hours, making it one of the longest and most impactful in history.

iii. Initial Reports and Response

The first signs that something was amiss surfaced around 3:00 AM UTC when users began reporting issues accessing Microsoft Azure and Office 365 services. Concurrently, Crowdstrike’s Falcon platform started exhibiting anomalies. By 6:00 AM UTC, both companies acknowledged the outage, attributing the cause to a convergence of system failures and a sophisticated cyber attack exploiting vulnerabilities in their systems.

Crowdstrike and Microsoft activated their incident response protocols, working around the clock to mitigate the damage. Microsoft’s global network operations team mobilized to isolate affected servers and reroute traffic, while Crowdstrike’s cybersecurity experts focused on containing the breach and analyzing the attack vectors.

iv. A Perfect Storm: Unpacking the Cause

A. The outage stemmed from a seemingly innocuous update deployed by CrowdStrike, a leading provider of endpoint security solutions. The update, intended to bolster defenses against cyber threats, triggered a series of unforeseen consequences. It interfered with core Windows functionalities, causing machines to enter a reboot loop, effectively rendering them unusable.

B. The domino effect was swift and devastating. Businesses across various sectors – airlines, hospitals, banks, logistics – found themselves crippled. Flights were grounded, financial transactions stalled, and healthcare operations were disrupted.

C. The blame game quickly ensued. CrowdStrike, initially silent, eventually acknowledged their role in the outage and apologized for the inconvenience. However, fingers were also pointed at Microsoft for potential vulnerabilities in their Windows systems that allowed the update to wreak such havoc.

v. Immediate Consequences (Businesses at a Standstill)

The immediate impact of the outage was felt by businesses worldwide.

A. Microsoft: Thousands of companies dependent on Microsoft’s Azure cloud services found their operations grinding to a halt. E-commerce platforms experienced massive downtimes, losing revenue by the minute. Hospital systems relying on cloud-based records faced critical disruptions, compromising patient care.

Businesses dependent on Azure’s cloud services for their operations found themselves paralyzed. Websites went offline, financial transactions were halted, and communication channels were disrupted.

B. Crowdstrike: Similarly, Crowdstrike’s clientele, comprising numerous Fortune 500 companies, grappled with the fallout. Their critical security monitoring and threat response capabilities were significantly hindered, leaving them vulnerable.

vi. Counting the Costs: Beyond Downtime

The human and economic toll of the Great Digital Blackout is still being calculated. While initial estimates suggest billions of dollars in lost productivity, preliminary estimates suggest that the outage resulted in global economic losses exceeding $200 billion, the true cost extends far beyond financial figures. Businesses across sectors reported significant revenue losses, with SMEs particularly hard-hit. Recovery and mitigation efforts further strained financial resources, and insurance claims surged as businesses sought to recoup their losses.

Erosion of Trust: The incident exposed the fragility of our increasingly digital world, eroding trust in both CrowdStrike and Microsoft. Businesses and organizations now question the reliability of security solutions and software updates.
Supply Chain Disruptions: The interconnectedness of global supply chains was thrown into disarray.Manufacturing, shipping, and logistics faced delays due to communication breakdowns and the inability to process orders electronically.
Cybersecurity Concerns: The outage highlighted the potential for cascading effects in cyberattacks. A seemingly minor breach in one system can have a devastating ripple effect across the entire digital ecosystem.

vii. Reputational Damage

Both Microsoft and CrowdStrike suffered severe reputational damage. Trust in Microsoft’s Azure platform and CrowdStrike’s cybersecurity solutions was shaken. Customers, wary of future disruptions, began exploring alternative providers and solutions. The incident underscored the risks of over-reliance on major service providers and ignited discussions about diversifying IT infrastructure.

viii. Regulatory Scrutiny

In the wake of the outage, governments and regulatory bodies worldwide called for increased oversight and stricter regulations. The incident highlighted the need for robust standards to ensure redundancy, effective backup systems, and rapid recovery protocols. In the United States, discussions about enhancing the Cybersecurity Maturity Model Certification (CMMC) framework gained traction, while the European Union considered expanding the scope of the General Data Protection Regulation (GDPR) to include mandatory resilience standards for IT providers.

ix. Data Security and Privacy Concerns

One of the most concerning aspects of the outage was the potential exposure of sensitive data. Both Microsoft and Crowdstrike store vast amounts of critical and confidential data. Although initial investigations suggested that the attackers did not exfiltrate data, the sheer possibility raised alarms among clients and regulatory bodies worldwide.

Governments and compliance agencies intensified their scrutiny, reinforcing the need for robust data protection measures. Customers demanded transparency about what data, if any, had been compromised, leading to an erosion of trust in cloud services.

x. Root Causes and Analysis

Following the containment of the outage, both Crowdstrike and Microsoft launched extensive investigations to determine the root causes. Preliminary reports cited a combination of factors:

A. Zero-Day Exploits: The attackers leveraged zero-day vulnerabilities in both companies’ systems, which had not been previously detected or patched.

B. Supply Chain Attack: A key supplier providing backend services to both companies was compromised, allowing the attackers to penetrate deeper into their networks.

C. Human Error: Configuration errors and lack of stringent security checks at critical points amplified the impact of the vulnerabilities.

D. Coordinated Attack: Cybersecurity analysts suggested that the attack bore the hallmarks of a highly coordinated and well-funded group, potentially a nation-state actor, given the sophistication and scale. The alignment of the outage across multiple critical services pointed to a deliberate and strategic attempt to undermine global technological infrastructure.

xi. Response Strategies

A. CrowdStrike’s Tactics

Swift Containment: Immediate action was taken to contain the breach. CrowdStrike’s incident response teams quickly identified and isolated the compromised segments of their network to prevent further penetration.
Vulnerability Mitigation: Patches were rapidly developed and deployed to close the exploited security gaps. Continuous monitoring for signs of lingering threats or additional vulnerabilities was intensified.
Client Communication: Transparency became key. CrowdStrike maintained open lines of communication with its clients, providing regular updates, guidance on protective measures, and reassurance to mitigate the trust deficit.

B. Microsoft’s Actions

Global Response Scaling: Leveraging its extensive resources, Microsoft scaled up its global cybersecurity operations. Frantic efforts were made to stabilize systems, restore services, and strengthen defenses against potential residual threats.
Service Restoration: Microsoft prioritized the phased restoration of services. This approach ensured that each phase underwent rigorous security checks to avoid reintroducing vulnerabilities.
Collaboration and Information Sharing: Recognizing the widespread impact, Microsoft facilitated collaboration with other tech firms, cybersecurity experts, and government agencies. Shared intelligence helped in comprehending the attack’s full scope and in developing comprehensive defense mechanisms.

xii. Broad Implications

A. Evolving Cyber Threat Landscape

Increased Sophistication: The attack underscored the evolving sophistication of cyber threats. Traditional security measures are proving insufficient against highly organized and well-funded adversaries.
Proactive Security Posture: The event emphasized the need for a proactive security stance, which includes real-time threat intelligence, continuous system monitoring, and regular vulnerability assessments.

B. Trust in Cloud Computing

Cloud Strategy Reevaluation: The reliance on cloud services came under scrutiny. Organizations began rethinking their cloud strategies, weighing the advantages against the imperative of reinforcing security protocols.
Strengthened Security Measures: There is a growing emphasis on bolstering supply chain security. Companies are urged to implement stringent controls, cross-verify practices with their vendors, and engage in regular security audits.

xiii. A Catalyst for Change: Lessons Learned

The Great Digital Blackout serves as a stark reminder of the need for a comprehensive reevaluation of our approach to cybersecurity and technology dependence. Here are some key takeaways:

Prioritize Security by Design: Software development and security solutions need to prioritize “security by design” principles. Rigorous testing and vulnerability assessments are crucial before deploying updates.
Enhanced Cybersecurity: The breach of CrowdStrike’s systems highlighted potential vulnerabilities in cybersecurity frameworks. Enhanced security measures and continuous monitoring are vital to prevent similar incidents.
Diversity and Redundancy: Over-reliance on a few tech giants can be a vulnerability. Diversifying software and service providers, coupled with built-in redundancies in critical systems, can mitigate the impact of such outages.
Redundancy and Backup: The incident underscored the necessity of having redundant systems and robust backup solutions. Businesses are now more aware of the importance of investing in these areas to ensure operational continuity during IT failures.
Disaster Recovery Planning: Effective disaster recovery plans are critical. Regular drills and updates to these plans can help organizations respond more efficiently to disruptions.
Communication and Transparency: Swift, clear communication during disruptions is essential. Both CrowdStrike and Microsoft initially fell short in this area, causing confusion and exacerbating anxieties.
Regulatory Compliance: Adhering to evolving regulatory standards and being proactive in compliance efforts can help businesses avoid penalties and build resilience.
International Collaboration: Cybersecurity threats require an international response. Collaboration between governments, tech companies, and security experts is needed to develop robust defense strategies and communication protocols.

xiv. The Road to Recovery: Building Resilience

The path towards recovery from the Great Digital Blackout is multifaceted. It involves:

Post-Mortem Analysis: Thorough investigations by CrowdStrike, Microsoft, and independent bodies are needed to identify the root cause of the outage and prevent similar occurrences.
Investing in Cybersecurity Awareness: Educating businesses and individuals about cyber threats and best practices is paramount. Regular training and simulation exercises can help organizations respond more effectively to future incidents.
Focus on Open Standards: Promoting open standards for software and security solutions can foster interoperability and potentially limit the impact of individual vendor issues.

xv. A New Era of Cybersecurity: Rethinking Reliance

The Great Digital Blackout serves as a wake-up call. It underscores the need for a more robust, collaborative, and adaptable approach to cybersecurity. By diversifying our tech infrastructure, prioritizing communication during disruptions, and fostering international cooperation, we can build a more resilient digital world.

The event also prompts a conversation about our dependence on a handful of tech giants. While these companies have revolutionized our lives, the outage highlighted the potential pitfalls of such concentrated power.

xvi. Conclusion

The future of technology may involve a shift towards a more decentralized model, with greater emphasis on data sovereignty and user control. While the full impact of the Great Digital Blackout is yet to be fully understood, one thing is certain – the event has irrevocably altered the landscape of cybersecurity, prompting a global conversation about how we navigate the digital age with greater awareness and resilience.

This incident serves as a stark reminder of the interconnected nature of our digital world. As technology continues to evolve, so too must our approaches to managing the risks it brings. The lessons learned from this outage will undoubtedly shape the future of IT infrastructure, making it more robust, secure, and capable of supporting the ever-growing demands of the digital age.