4.2 The proposed overload control solution
4.2.2 Host-level design
The architecture of the overload control solution for this level is showed in
4.2. THE PROPOSED OVERLOAD CONTROL SOLUTION 68
Host-level Detection Agent
The Host-level Detection Agent is a multi-threaded application, which can be deployed by the VNF provider in a dedicated VM, in the same cloud infras-
tructure running the VNFs (Figure4.2a). An alternative approach, which is
viable for the provider of the NFVI, is to run the Host-level Detection Agent on
the hypervisor as a privileged process (Figure4.2b). In both cases, this agent
is adopted to detect physical resource contention; in the latter approach, the agent also replaces the VNF-level Detection Agent, in order to protect a VNF from traffic in excess. The Host-level Detection Agent monitors one or more VNFs in the NFV network. It is possible to deploy more than one Host-level Detection Agents on the same cloud infrastructure, where each Host-level De- tection Agent monitors a subset of VNFs in the NFV network.
The Host-level Detection Agent receives data on virtual CPU utilization, either from VNF-level Detection Agents (if it is deployed by the VNF provider), through a shared ring buffer or other inter-VM communication channels, or from the hypervisor (if it is deployed by the NFVI provider), using APIs pro- vided by the hypervisor.
The Host-level Detection Agent can detect the traffic in excess towards a
VNF, by using the same algorithm of the VNF-level Detection Agent (Alg.2,
and eq. (4.1) and (4.2) in section4.2.1). It periodically samples the virtual CPU usage of the VM, and the network throughput from the Host-level Mitigation Agent; then, it tunes the traffic drop ratio of individual VNFs to drop traffic. In addition, the Host-level Detection Agent can identify overload conditions that are due to physical resource contention. These conditions may occur when the NFVI experiences a fault (such as, a broken CPU that must be turned off ), which reduces the resources available to the VNFs, and which causes com- petition among them for the remaining resources (but may be insufficient to sustain the current workload). Moreover, physical resource contention can
In the case of physical resource contention, it may not suffice to drop traf- fic, since a VNF would free physical resources that could be consumed by neighbour VMs, causing a vicious circle and worsening the performance of the VNF. In this scenario, the most appropriate course of action is to detect that overload is caused by physical resource contention, and to mitigate the contention by disabling part of the VNFs and by reserving resources for the
most critical ones. According to the ETSI NFV resiliency requirements [21,
sec. 7.3], NFV is expected to support multiple levels of service availability and, under overload conditions, it should be able to downgrade low priority ser- vices and to preempt resources from them (e.g., a video call service should be downgraded or preempted in favor of voice calls).
Under physical CPU contention, a virtual CPU reaches full utilization (i.e.,
there are no idle CPU ticks) even if the workload is below the virtual CPU
quota. For example, if two VMs have both a 1 GHz CPU quota, but they both run on an oversubscribed physical CPU (e.g., a 1 GHz physical CPU, with a 2:1 vCPU-to-pCPU ratio), then the hypervisor may be unable to honour the quota, and each VM will actually get up to 0.5 GHz CPU cycles. However, de- tecting physical CPU contention is problematic for a telecom provider that uses an NFVIaaS, since it has no visibility of the underlying physical host. Moreover, physical CPU contention cannot be detected within the VM using traditional CPU monitoring tools: both in the case of CPU contention and of workload peaks, CPU monitoring tools would report a 100% consumption of
the virtual CPU, since they compute the ratio betweenbusyandidleCPU ticks,
and thus would not be able to discriminate between the two cases.
In order to discriminate between physical CPU contention and other over- load conditions, and to perform special actions against contention, we con-
sider the absolute number ofbusyCPU ticks (including ticks spent executing
both in user-space and kernel-space) that are actually executed by the virtual CPU per unit of time. If there is physical CPU contention, the hypervisor CPU
4.2. THE PROPOSED OVERLOAD CONTROL SOLUTION 70
scheduler gives to the virtual CPU less physical CPU cycles than its expected CPU quota. Thus, we detect physical CPU contention by monitoring the num- ber of actual busy CPU ticks of the virtual CPU, and comparing it to the maxi- mum number allowed by its CPU quota:
pCPU contention ⇔ idle ticks≈0
∧busy ticks6≈maximum busy ticks
where the maximum for busy ticks is calibrated by running on the virtual CPU a CPU-intensive load under no physical CPU contention. The count of busy ticks can be obtained from the VNF-level Detection Agent inside a VM
(section4.2.1), or from the virtualization infrastructure using hypervisor APIs.
The Host-level Detection Agent periodically samples the number of busy ticks since the previous sample, and estimates the physical CPU share allotted to the virtual CPU, by computing the ratio between busy ticks and the amount of “wall-clock” time that has been elapsed. The wall-clock time can be collected by the VNF-level Detection Agent inside a VM (using a paravirtualized clock
provided by the hypervisor [91,92]) and from the virtualization infrastructure.
The Host-level Detection Agent notifies the Host-level Mitigation when phys- ical CPU contention arises or disappears.
Finally, the Host-level Detection Agent aggregates the information about the overload state of VNFs that it monitors (either caused by excess traffic, or by physical CPU contention), and sends periodic update messages to the Network-level Detection Agent, as discussed later in this section.
Host-level Mitigation Agent
The Host-level Mitigation Agent is an application that executes in the same environment of the Host-level Detection Agent. It interacts with the Virtu-
alization Infrastructure Manager (VIM) in order to alleviate the contention on physical CPUs, by pre-empting resources from the less important ("non- critical") VMs. The relative importance of VMs is configured according to per- formance and availability requirements of NFV services (e.g., the ETSI NFV resiliency requirements provide examples of service availability levels, where emergency telecommunications have priority over video streaming and other internet traffic [21, sec. 7]).
The Host-level Mitigation Agent periodically checks the presence of phys- ical CPU contention: if this is the case, it selects the VMs with the lowest crit- icality, and decreases their scheduling priority in order to free CPU time for the highest-criticality VMs. If the scheduling priority is already at the lowest priority, the VM is suspended. These steps are repeated until the physical CPU contention persists, and reverted when CPU resources are available. This ap- proach can be easily deployed on existing virtualization technologies, such as the KVM hypervisor and OpenStack, using their APIs to change the execution state of VMs.
Optionally, in the case of NFVI providers, such as in NFVIaaS (Figure4.2b),
the Host-level Mitigation Agent can be used to drop the traffic in excess to- wards individual VNFs, in a similar way to the VNF-level Mitigation Agent
(section4.2.1). This objective is achieved by configuring network traffic for-
warding mechanisms of the virtualization infrastructure to establish a net- work tunnel. When the Host-level Detection Agent detects an overload con- dition, it can trigger the Host-level Mitigation Agent to drop the traffic in ex- cess. The amount and the type of traffic to drop is configured by the Host-level
Detection Agent as described in section4.2.1: the Host-level Detection Agent
updates the traffic drop ratio according to a rule that uses resource utilization metrics, and it applied traffic-matching rules to identify which traffic should be dropped.
4.2. THE PROPOSED OVERLOAD CONTROL SOLUTION 72
Network Detection Agent Network Mitigation Agent VM or PM VNF
Incoming network traffic VNF
VNF
From Host Detection Agents Accepted network traffic Rejected network traffic Network tunnel Network status check Traffic drop rate
Figure 4.5.Architecture of network-level detection and mitigation.