Max Potential Effectiveness - Mixed structural models for decision making under uncertainty us

number of staff resources applied to manage the control, or the actual acceptance of the control by users, etc., although any control has a chance to perform at least as well as was intended by the average or expected effect. In the above example, a ‘High’ control which is 95% effective on average would be expected to perform no worse than 80% of the average and possibly up to 140% of the average to a maximum of 100%. This permits the model to employ controls with lower or diverse average effects (e.g. inferior grade policies or software controls) but to also capture the idea that a ‘higher’ or better level or quality of deployment is expected to narrow the variance of the control’s actual performance results regardless of the anticipated average effectiveness.

4) The resulting effectiveness range (Min = 76%; Average = 95%; Max = 100%) is then converted to a PERT distribution⁵⁵ (Figure 80) which is typically used for the estimation of probability distributions with limited information based on expert judgment (Malcolm, Roseboom et al. 1959; Vose 2008).

5) The PERT distribution is then used to randomize the effectiveness value of the control on each Monte Carlo simulation round (i.e. daily) where most of the time the value will reflect performance around the average but may vary as low or as high as the minimum or maximum respectively. In this example, the randomized value on the day is 94% (Figure 80). This permits the actual effectiveness of the control to vary within the anticipated range of effectiveness and reflects the inherent uncertainty of the control’s effectiveness in use despite its intended effectiveness.

6) The randomized effectiveness value (76%-100%) is then used as the input probability in a Bernoulli function to calculate whether the controls covered by the widget were in fact effective for the simulation round (Figure 82). In this example, the Web Application Secure Coding Practices consist of 4 sub controls, each with its own individual model dependencies: ‘Black Box testing’; Static Code Analysis’; ‘Type Safe API’; and ‘Developer Security Training’. While as a group these controls are considered 94% effective, only 3 of the 4 controls have Bernoulli values resulting in 1 for this simulation round: the sub-control ‘TypeSafeAPI’ is considered to be ineffective for that simulation

55 PERT stands for Program Evaluation and Review Technique (PERT) (Malcolm, Roseboom et al. 1959). See

https://reference.wolfram.com/language/ref/PERTDistribution.html for a description of the specification and use of the PERT function.

round (Value = ‘F’ or FALSE) and increases the probability that an attack on a vulnerability associated with that control will be successful for that round.

Figure 82 - Conversion of Control Effectiveness Level to Bernoulli Values

Attribution of Confidentiality, Integrity and Availability Events

The system then attributes security incidents to specific successful exploits which are inferred to result in associated confidentiality (C), integrity (I) or availability (A) impacts to the component based on CVE guidance on⁵⁶. The total number of potential ‘breaches’ in C, I or A per unit of simulation is determined by the number of successful exploits multiplied by the chance of exploit. In the following Figure, only the

‘Exploit Command Injection’ was successful out of the four possible exploits, and has resulted in the possibility of both one Integrity breach and one Availability breach:

Figure 83 – Calculation of C, I or A breaches from Successful (non-prevented) Attacks

Once the possible types of breaches are determined, the system adds up all of the ‘non-prevented’ breaches and passes these to an algorithm which then determines whether the breaches are detected within the simulation round. Non-detected breaches are added to the following simulation round’s count of non-prevented breaches which introduces a one-period lag or system memory effect at this stage in the time series. This is considered more realistic than a system in which we assume all non-prevented breaches are fully detected in each round and makes the system security performance more sensitive to the level of detection controls.

Detection of Successful Exploits

Detection of breaches is a function of a range of monitoring controls that affect the organization’s ability to identify breaches as they are occurring and before they have a chance to have substantial impact on the system and the associated dependent business processes. Successful detection lowers the number of breaches that can possibly impact the system within the current simulation round. The number of detected breaches out of the total number of non-prevented breaches is computed based on the Detection control level set by the user and the resulting random control effectiveness percentage calculated in a manner similar to the Prevention Controls described above, where each non-prevented breach has a chance of being detected in proportion to the effectiveness of the Detection control. Successfully detected breaches are passed to the Counter Phase which attempts to eliminate the breach. ‘Non-detected’ breaches are subsequently added to the next day’s newly non-prevented breaches and are subject to detection again in the next simulation round.

Modelling undetected breaches in this manner introduces a realistic one-period lag effect on the resulting time series of losses where undetected breaches continually degrade system performance and must be detected before counter controls can be applied. The system at this stage appropriately simulates as

13.20% 1.24% 8.31%

PsiMean 5.69% 8.97% 5.84%

0.00% 1.24% 8.31%

Max Intrusions C I A Attack? Bern Input Psi Mean

2 2 0 1 1 1 ExploitCommandInjection T 1 0.47 0.230

3 0 1 1 1 0 ExploitXSS F 0 0.47 0.226

3 0 1 1 1 0 ExploitRemoteFileInclusion F 0 0.00 0.185

2 0 1 1 0 0 ExploitSQLInjection F 0 0.00 0.186

Total 2 0 1 1 FindPublicCommand InjectionVulnerability T 1 1.00 0.800

psimean 2.08219178 FindPublicCrossSiteScriptingVulnerability T 1 1.00 0.751

expected: if insufficient resources are spent on Detection, the number of undetected breaches increases over time and eventually overwhelm the Detection resources which cannot catch up. Underspending on

detection can therefore cause the number of undetected breaches to increase to the point where the attributed impacts exceed a threshold (e.g. system availability dips below a operationally viable level) and the business experiences catastrophic losses. Figure 84 indicates the conversion of successful C, I, or A breaches into detected and non-detected security incidents respectively:

Figure 84 - Conversion of Non-Prevented Breaches to Detected and Non-Detected Breaches

1. New successful (i.e. ‘non-prevented’) C, I, or A breaches in the current simulation round are summed across all exploited system components (



e.g. 8 Availability breaches)

2. Last period’s ‘Undetected breaches’ are added to the current round’s ‘non-prevented’ breaches 3. Total current round breaches are then subject to ‘Detection’ based on the effectiveness level of the

‘Detection’ controls

4. Successfully detected breaches in current round are passed to the ‘Counter’ phase and do not immediately affect system availability in the current round unless subsequently ‘not-countered’.

Undetected breaches in the current round impact System availability in the current round (see section 7 below)

1 2 3

4

Detected breaches proceed to

Counter

Countering Detected Breaches

In the Counter Phase, detected breaches have a chance of being countered or interrupted by support staff assigned to eliminate detected threats in the environment. Similar to the Detection Control, Counter efforts add a lag effect to the resulting time series of losses since un-countered breaches in one period must be successfully countered in a subsequent period before Recovery controls can be applied and any resulting uncountered breaches have a chance to degrade system performance.

For this control, available counter resources (staff) are assumed to be able to attempt to counter all detected breaches within the day⁵⁷. Similar to the treatment for Detection, the number of ‘detected but not

countered’ breaches is then computed based on the Counter control level and the resulting effectiveness percentage set for the simulation. The total number of ‘detected-but-not-countered’ breaches is calculated and then added to the next day’s newly detected breaches; countered breaches are passed to the Recovery Phase. The system at this stage appropriately simulates as expected: if Counter controls are insufficient to address the number of detected breaches, the number of un-countered breaches increases over time and eventually overwhelms the Counter resources which cannot catch up. Underspending on Countering can therefore cause the number of un-countered breaches to increase to the point where the attributed impacts exceed a threshold (e.g. system availability dips below an operationally viable level) and the business experiences catastrophic losses. Figure 85 illustrates the conversion detected breaches into countered and non-countered incidents:

57 In a real world setting, individual staff would typically be assigned multiple detected breach ‘tickets’ to work on at the same time but may not be able to attempt to counter each detected breach within a given day. In that case, some detected breaches would remain automatically ‘uncountered’ (since countering was not even attempted) and would be directly added to the next day’s detected breaches. In the current model we assume that we can attempt to counter each detected breach within the day regardless of the number of resources assigned to countering. This factor could be easily adjusted to simulate increased or decreased staffing levels, ‘coverage’

or throughput by countering personnel, aside from the overall effectiveness assigned to the Counter control itself. For example, if we assumed that staff could effectively attempt to counter one breach per day, if the number of detected breaches exceeded the number of available staff assigned to counter breaches on the day, the difference between the number of staff and the number of detected breaches needing to be countered could be automatically added to the next day’s newly detected breaches. In sensitivity tests of that version of the model however, even at high control effectiveness, low non-attempted numbers of breaches quickly add up over several rounds and eventually overwhelm the system. Additional tuning of the model would therefore be required to reflect the difference between overall Counter control effectiveness vs. resource coverage of open breaches.

Figure 85 - Conversion of Detected Breaches to Countered and Non-Countered Security Breaches

1. Current round detected breaches proceed to the ‘Counter’ phase.

2. Last round’s ‘detected but not countered’ breaches are added to the current round’s detected breaches to determine total number of potentially counterable breaches.

3. The number of countered and non-countered breaches for this round is determined based on the effectiveness level of the Counter control.

4. Countered breaches proceed to ‘Recover’ phase. Non-countered breaches are added to next round’s potentially counterable breaches and affect system Availability.

1

3

4

Countered breaches proceed

to Recovery Detected

breaches proceed to

Counter

2

Recovering from Countered Breaches

Finally, in the Recovery Phase, the system has a chance to recover system functionality from detected and countered breaches by support staff assigned to the Recovery control function. Similar to the Detection and Counter Controls, Recovery efforts add a lag effect to the resulting time series of losses since any

Countered but un-recovered breaches must be addressed in the next round of Recovery. Similar to the treatment for Counter controls, the number of ‘countered but not recovered’ breaches is computed based on the Recovery control level and the resulting effectiveness percentage set by the user in a manner similar to the Prevention, Detection and Counter Controls. Each Countered breach addressed by the assigned

Recovery staff member has a chance of being not recovered in inverse proportion to the effectiveness of the control. The total number of number of ‘countered but not recovered’ breaches is then added to the next day’s newly countered breaches. Recovered breaches are then subtracted from the total number of breaches in the day that have a chance to affect C, I or A factors:

Figure 86 - Conversion of Countered Breaches to Recovered and Non-Recovered Security Breaches

1. Current round countered breaches proceed to the ‘Recover’ phase.

2. Last round’s ‘countered but not recovered’ breaches are added to the current round’s countered breaches to determine total number of potentially recoverable breaches.

3. The number of recovered and non-recovered breaches for this round is determined based on the effectiveness level of the Recover control. Recovered breaches terminate and are not counted as affecting system Availability. Non-recovered breaches are added to next round’s potentially recoverable breaches and affect system Availability.

The following figure illustrates the combination of undetected, detected-not-countered and countered-not- recovered breaches for a typical High Controls system for one year:

1

Countered breaches proceed to

Recovery

2 3

Figure 87 – One Year Time Series of System Availability and Daily Business Losses based on Effective Breach Counts

Calculating Availability (A) Impacts

In the current model although we calculate breach events attributable to Confidentiality (C), Integrity (I) or Availability (A) impacts, we have focused solely on Availability impacts for the purposes of calculating business losses. Availability is commonly measured as a percentage of system computing capacity from zero to 100% where operating standards are typically expected to be above 95% and where ‘high

availability’ systems incorporating redundancy are expected to operate in excess of 99.9% or higher. (Gray and Siewiorek 1991)Availability generally affects the throughput of computing transactions where in cases of less than 100% availability, the average user would experience a range of work slowdown or interruption effects (e.g. computational delays leading to the clock cursor, or intermittent lack of connectivity, etc.). In contrast, Integrity events are assumed to typically result in computing errors such that transactions would require rework.

To calculate the Availability effects, we take a weighted average of the individual Availability impacts of the two sets of impactful breaches: 1) Non-Detected and Detected-but-not-Countered breaches (which are assumed to have similar availability impact profiles since both are not yet countered in the system); and 2) Countered-but-not-Recovered breaches which are assumed to have a different ‘impact rate’ than undetected or non-countered breaches where countering controls halt and then diminish the effect of the breach over time (Tjoa, Jakoubi et al. 2011). The following lookup table is used to individually determine the impact on Availability of the two sets of breaches. The marginal impact decrement from 100% is an exponential function of the impact rate, where the impact rate is assumed to be half a large for breaches in Recovery as compared to undetected or detected but not countered breaches:

Figure 88 – Availability Impact Lookup Table (Undetected and Detected-but-not-Countered Breaches)

The resulting Availability impacts are uniquely calculated for each type of breach, for each simulation round, based on the number of breaches in that round:

Figure 89 – Availability Impacts based on # of Undetected + Detected-but-not-Countered Breaches

Figure 90 - Availability Impacts based on # of Countered-but-not-Recovered Breaches

The weighted average Availability % is then calculated based on the number of number of ‘Undetected + Not Countered’ and ‘Countered-not-Recovered’ breaches respectively:

(Undetected + Detected but not Countered)

UnPrevented+Lag Detected not Countered+Lag Countered not Recovered−Recovered (6.1)

× Availability % (Undetected + Not Countered)

(Not Recovered)

UnPrevented+Lag Detected not Countered+Lag Countered not Recovered−Recovered (6.2)

× Availability % (Countered not Recovered)

Trial # ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶

1 100.00% 99.75% 99.57% 99.53% 99.04% 98.95% 98.52% 98.43% 98.28% 98.07% 97.63% 96.76% 96.52% 96.00% 95.83% 94.81% 94.23%

2 100.00% 99.50% 99.18% 98.94% 97.93% 97.25% 96.94% 96.61% 96.50% 96.28% 96.28% 96.02% 95.56% 95.51% 95.22% 94.77% 94.71%

3 ^100.00% ^99.98% ^99.96% ^99.48% ^99.38% ^98.81% ^98.59% ^98.43% ^98.38% ^97.95% ^97.83% ^97.65% ^97.45% ^97.37% ^96.95% ^96.43% ^96.19%

4 100.00% 99.40% 99.15% 99.12% 99.09% 98.97% 98.75% 98.61% 98.20% 96.09% 96.04% 95.65% 95.52% 95.06% 95.00% 94.81% 94.77%

5 100.00% 99.79% 99.68% 99.61% 99.31% 99.07% 99.05% 98.96% 98.91% 98.80% 98.75% 98.20% 97.80% 97.68% 97.21% 97.15% 97.11%

6 100.00% 99.80% 99.42% 98.24% 98.13% 97.84% 97.66% 97.63% 97.52% 97.27% 97.09% 96.96% 96.13% 96.07% 95.65% 95.56% 94.85%

7 100.00% 99.97% 99.83% 99.82% 99.49% 99.11% 99.01% 98.67% 98.58% 98.52% 98.20% 97.54% 97.50% 97.42% 97.25% 97.00% 96.97%

8 100.00% 100.00% 98.54% 98.50% 98.33% 97.64% 97.27% 96.37% 96.04% 95.43% 95.30% 95.22% 94.79% 94.78% 93.96% 93.94% 93.92%

9 100.00% 99.99% 99.50% 99.16% 99.00% 98.28% 97.94% 97.73% 97.51% 97.31% 96.83% 96.34% 96.31% 95.96% 95.52% 95.16% 95.11%

10 100.00% 98.47% 97.85% 97.77% 97.63% 97.47% 97.05% 96.03% 95.71% 95.66% 95.42% 95.26% 94.99% 94.79% 94.68% 94.48% 93.96%

# of Breaches

Trial # ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶

1 100.00% 99.76% 99.76% 99.61% 99.55% 99.53% 99.45% 99.29% 99.01% 98.90% 98.79% 98.79% 98.56% 98.27% 98.11% 97.92% 97.88%

2 100.00% 99.93% 99.67% 99.48% 99.19% 99.07% 99.03% 98.67% 98.63% 98.60% 98.51% 98.43% 97.93% 97.92% 97.16% 97.15% 97.01%

3 100.00% 99.88% 99.67% 99.40% 98.62% 98.53% 98.48% 98.45% 98.39% 98.32% 98.16% 98.12% 98.09% 97.76% 97.71% 97.63% 97.25%

4 100.00% 99.78% 99.59% 99.39% 99.37% 99.36% 98.91% 98.88% 98.88% 98.84% 98.84% 98.73% 98.65% 98.46% 98.40% 98.40% 98.16%

5 ^100.00% ^99.84% ^99.83% ^99.79% ^99.43% ^99.19% ^98.95% ^98.71% ^98.63% ^98.24% ^98.15% ^97.95% ^97.92% ^97.78% ^97.60% ^97.59% ^97.47%

6 100.00% 99.99% 99.90% 99.29% 99.09% 99.01% 98.91% 98.13% 97.88% 97.81% 97.70% 97.51% 97.42% 97.27% 97.01% 96.95% 96.85%

7 100.00% 99.98% 99.79% 99.66% 99.60% 99.30% 99.18% 99.17% 99.11% 98.84% 98.56% 98.50% 98.19% 98.07% 97.99% 97.73% 97.64%

8 ^100.00% ^99.96% ^99.93% ^99.89% ^99.59% ^99.52% ^99.40% ^99.35% ^99.34% ^99.23% ^99.21% ^98.96% ^98.81% ^98.76% ^98.76% ^98.75% ^98.52%

9 100.00% 99.79% 99.78% 99.75% 99.67% 99.64% 99.63% 99.52% 98.92% 98.88% 98.87% 98.85% 98.81% 98.72% 98.18% 98.16% 98.05%

10 100.00% 99.96% 99.86% 99.82% 99.67% 99.39% 99.31% 98.80% 98.75% 98.74% 98.64% 98.52% 98.48% 98.11% 97.96% 97.65% 97.61%

# of Breaches

Here each set of breaches which can potentially impact the system contribute to a decline in the system Availability percentage, with each individual breach in each simulation round and therefore each

simulation round contributing a stochastically different impact rating, even in rounds with the same breach activity level. While the cumulative impact of successful non-recovered breaches is continuously

increasing, the choice of exponential function is deliberate to reflect the idea that any particular breach has an exponentially declining impact on Availability.

The impact percentage itself is arbitrary and can be scaled through the exponential function to suit the assumed severity of individual breach impacts. What is primarily important for my purposes is not the level of resulting decrement but the stochastic behaviour and distribution of the Availability measure (and corresponding loss distribution) as breaches increase since we are primarily interested in the ‘tail loss’ of the distribution. Here we see that for a Medium Controls system simulated over 30 individual years, the profile of the cumulative number of breaches experienced per year over the first 255 days ranges

substantially from 100% availability (zero breaches, no impact) to a decrement of several percentage points below 100% depending on the number cumulative breaches⁵⁸:

Figure 91 – Monte Carlo Simulation of Cumulative Breaches over 255 days

58 255 days represents the limit at which all 365 day Medium System simulations operated above 95% Availability. Fore a Medium Controls system, some simulations result in Availability ratings below 95% beyond 19 cumulative breaches reflecting that, because of the one period lag effect introduced to simulate non-detected/countered/recovered breaches, at lower control levels a system can become ‘unstable’ where availability continuously declines until reaching unacceptable levels of performance. Comparatively, for High and Very High Control systems, the number of cumulative breaches generally remains under 10 per day (i.e. most breaches are

Using the above model, we are able to simulate daily system Availability for any combination of Preventive, Blocking, Detection, Counter and Recovery controls over a single year (365 days), or for multiple years. The following figures indicates a resulting example time series of the Availability of the system and the attributed business Loss per Day based on all controls set to “High”:

Figure 92 - High Controls Time Series (95% Control Effectiveness, 365 Days)

With controls set to ‘High’, system Availability generally stays above 99.8% and is mostly above 99.9%, a level of reliability that might be routinely expected in most enterprise systems. The resulting business losses per day attributed to declines in system Availability can be conveniently displayed as a histogram of the probability distribution of losses which supports further analysis and the economic comparison of control scenarios that involve the moments of the loss distribution including estimates of the average and 95% VaR and CVar:

In document Mixed structural models for decision making under uncertainty using stochastic system simulation and experimental economic methods: application to information security control choice (Page 188-200)