ACI Connecting the DOTs Part One From Zero to Hero

(1)

ACI…Connecting the DOTs – Part One

From Zero to Hero

Cisco ACI basically consists of a set of Nexus 9000 switches, deployed in a two-level spine/leaf topology and operating in a special mode, named ACI mode, under the control of an Application Policy Infrastructure Controller (APIC) cluster.

Simply put, imagine ACI as a gigantic distributed multilayer modular switch, centrally manageable and fully programmable via an object-oriented REST API.

Figure 1

As a multilayer switch, ACI:

• performs Layer 2 and Layer 3 forwarding

• interacts with traditional campus and datacenter networks • exchanges routes with the external world

• implements security and QoS

• allows for services insertion (as firewalls and load balancers).

In short, it fulfills the expectations for a device of this class, and even exceeds them by performing all these tasks in a smarter, more scalable, more manageable and, substantially, more efficient way.

The secret is, ACI has superpowers, and great power means great responsibility.

This document and the following ones in the series “Cisco ACI… Connecting the Dots” deal with this power and the knowledge a network administrator needs to build in order to manage the resulting responsibility.

(2)

PKZipped things to initially know...

Modern networks are all about virtualization. As the mighty Dinesh Dutt reminds us, in his book “Cloud Native Data Center Networking” (highly recommended), we are actually moving from the obsolete inline virtualization model to the more modern and scalable overlay model. The first one, where every hop is aware of which virtual network the passing traffic belongs to, is commonly achieved through VLANs and VRFs.

The second model, where only the first and the last hop are aware of the virtualization, is common to various technologies but, in the case of ACI, it is achieved through VXLAN tunneling.

VXLAN is a data plane virtualization technology which relies on UDP-based tunnels, and provides an L2-in-L3 encapsulation using the Virtual Network Information (VNI) 24-bits tag. This tag is part of the VXLAN header and is used to discriminate the transported L2 domains (rfc:7348).

(3)

A typical spine/leaf topology, as the one indicated in Figure-2, can be exploited efficiently by the VXLAN technology by setting the VXLAN’s header source UDP port with the hash of the inner data packet’s header. In this way, the leaf can easily load-balance, via Equal-Cost Multi-Path (ECMP), the derived tunneled data-flow among the available outgoing interfaces.

Even if the VXLAN technology has been originally designed to extend single L2 domains only, ACI is able to perform inter L2 segments forwarding, aka VXLAN routing, assigning a specific VNI to every VRF defined in the overlay.

Figure 3

Moreover, as shown in Figure-3, the traffic across the ACI fabric is tunneled via iVXLAN - an improvement over the traditional VXLAN encapsulation – based on the document “draft-smith-vxlan-group-policy-05” (and the successive ones). Specifically, the figure represents iVXLAN tunnels for point-to-point overlay traffic, encapsulated among leaves.

When the encapsulated traffic is, indeed, BUM (Broadcast, Unknown unicast or Multicast), the ACI’s iVXLAN tunneling model changes significantly, as shown in Figure-4.

(4)

The APIC assigns to each Bridge Domain and VRF defined in the overlay (a Bridge Domain is, substantially, an L2 segment, further explanations will come soon), a unique GIPo (Global IP Outer) address. A GIPo is a /28 IPv4 multicast address, where the last 4 bits can be used to address 16 path identifiers named FTAGs, although ACI limits their use to a maximum index of 12. Therefore, referring back to the Figure-4, in which only two spines appear, FTAGs 1, 3, 5, 7, 9, 11 are assigned to the red FTAG-Tree, rooted at SPINE-1, while FTAGs 2, 4, 6, 8, 10, 12 are assigned to the green FTAG-Tree rooted at SPINE-2 (I have deliberately omitted Tag 0, which will be illustrated in a future paper when talking about Multi-Pod).

Now, assuming, for example, that an overlay domain is associated with the GIPo address 225.1.1.224/28, ACI will use the interval 225.1.1.225-236 as the corresponding FTAG range (12 addresses). As a consequence, when a BUM packet from this overlay domain is received by a leaf, its header is hashed to obtain the corresponding FTAG. The packet is then encapsulated into a iVXLAN header with an outer IP destination address set to GIPo + FTAG, and sent to the spine, rooting the FTAG itself.

The spine will then forward the encapsulated packet, point-to-multipoint, along the associated multicast distribution tree (the RED or the GREEN one), whereas certain leaves, chosen by the APIC, will forward it back to the remaining spines to complete the distribution (this last step is necessary to support Multi-Pod, for example), since spine-to-spine connections, in ACI, are not permitted.

It is well known that VXLAN is a data plane virtualization technology based on a simple learn-and-forward method, and that this behavior places serious limitations on its scalability. To scale-out the fabric, a robust control plane is needed.

The control plane is usually scaled-out in two ways:

• controller-less approach - BGP-EVPN is a luminous example of this first category

• controller-based approach - ACI is a worthy representative of this second category.

(5)

After a brief recap on the underlay, it is now time to introduce some important ACI's logical blocks:

Figure 5

• Tenant: it is a logical container for application policies and networking objects. Tenants represent administrative boundaries, they can be totally isolated or they can share resources. The access to a tenant and its functionalities is ruled by a programmable Role Base Access Control (RBAC) model.

An ACI Fabric can contain multiple Tenants.

• VRF: it represents a routing table virtualization that allows for a simultaneous coexistence of multiple routing tables in a single router. In this way, multiple routing instances, with possibly overlapping addresses, can live in the same system and be assigned to different tenants.

A Tenant can contain multiple VRFs.

• Bridge Domain (BD): it actually represents a single, Fabric-wide, multilayer distributed switch. From the L2 perspective it is the L2 segment, it defines the MAC address space and flood domain, and it can be terminated by L3 via an SVI that provides default gateway. Since BD is a flood domain, a broadcast from any of its subnets is received by all the others.

A VRF can be associated to multiple Bridge Domains.

(6)

virtual Endpoints. The Endpoints that are part of an EPG may be chosen in different ways, a common one is to use a VLAN as a selector on the leaf’s front-panel port. For example, a host in VLAN 10 connected to LEAF-1/PORT-5 may have its representing Endpoint associated to EPG-1, while another host, still in VLAN 10 but connected to LEAF-1/PORT-6, may have its representing Endpoint associated to EPG-2.

A Bridge Domain can contain multiple Endpoint Groups.

• Endpoint (EP): the most basic definition is that it is represented by a MAC address plus zero or more IP addresses. For example, a host connected to the LEAF’s front-panel port is usually represented by an EP consisting of a MAC address plus, if it has one, a single IP address. An Endpoint Group can contain multiple Endpoints.

The next chapter focuses on Bridge Domain, a concept that, even if not particularly new, often raises concerns because it breaks some of the traditional network paradigms.

Indeed, in a classification process, if an attribute of the object to be classified determines, inevitably, the class that the object itself belongs to, one can end up by classifying a plane as a bird only because it has wings.

In the context of traditional networking, to belong to a VLAN implies, inevitably, to belong to the L2 segment associated with that VLAN.

According to this, for example, two Tenants connected to the same switch cannot use the same VLAN without sharing the same L2 segment, no matter which port they are connected to.

The only way to solve this problem is to decouple a VLAN identity from its membership in an L2 segment.

(7)

It’s time to take the metro… (ethernet!)

The concept of Bridge Domain is part of the Cisco’s EVC (Ethernet Virtual Connection) Framework, designed years ago to support the Metro Ethernet Services.

As not so many people have Cisco ACI at their personal disposal, but probably a simulation environment instead, I have decided to implement a possible Bridge Domain configuration in a VIRL Lab, as shown in Figure-6.

The Lab consists of:

• 4 border routers, based on Cisco IOSv, each of them split in two VRFs, respectively owning the physical interfaces ge0/1 and ge0/2; each router simulates 2 virtual hosts, HRED and HBLUE

• 1 central router, based on Cisco IOS-XEv, connected with 4 border routers via 8 distinct interfaces, one for every host’s VRF; this router, R1, is configured with 2 Bridge Domains where it performs L2 forwarding, and with 2 Bridge Domain Interfaces (BDI) that serve to terminate the Bridge Domains by L3 and enable routing.

Figure 6

(8)

In the L3 domain, the traffic destined to 10.20.20.13, for example, generated by H1RED from 10.10.10.11 via the interface ge0/2 in vlan-101, is routed, by R1, to the interfaces ge0/2 H3BLUE, in vlan-122, via the interfaces BDI10 (ingress) and BDI20 (egress).

This represents an inter-BD routing.

The responses to the broadcast-destined ping sent by H1RED and H1BLUE demonstrate that the Bridge Domain is the flood domain.

On the other hand, the intra-BD traceroute,

and the inter-BD traceroute,

still confirm the expected behavior of the L2 and L3 forwarding process.

Host Interface Service Instance VLAN Bridge Domain

H1 (vrf red) GigabitEthernet2 11 101 10 H1 (vrf blue) GigabitEthernet3 12 102 20 H2 (vrf red) GigabitEthernet4 11 111 10 H2 (vrf blue) GigabitEthernet5 12 112 20 H3 (vrf red) GigabitEthernet6 31 121 10 H3 (vrf blue) GigabitEthernet7 32 122 20 H4 (vrf red) GigabitEthernet8 41 121 10 H4 (vrf blue) GigabitEthernet9 42 122 20 Table 1

(9)

H1RED H1BLUE interface GigabitEthernet0/1 description to R1 mac-address 0200.0010.001a no ip address exit interface GigabitEthernet0/1.1 encapsulation dot1Q 101 vrf forwarding red ip address 10.10.10.11 255.255.255.0 exit interface GigabitEthernet0/2 description to R1 mac-address 0200.0020.003b no ip address exit interface GigabitEthernet0/2.1 encapsulation dot1Q 122 vrf forwarding blue ip address 10.20.20.13 255.255.255.0 exit Table 2

Table-2 shows H1RED’s and H3BLUE’s configurations of interfaces connected to R1.

R1

interface GigabitEthernet2 description to H1-RED no ip address

service instance 11 ethernet encapsulation dot1q 101

rewrite ingress tag pop 1 symmetric bridge-domain 10 exit exit interface GigabitEthernet7 description to H3-BLUE no ip address

service instance 32 ethernet encapsulation dot1q 122

rewrite ingress tag pop 1 symmetric bridge-domain 20

exit exit

Table 3

Table-3 shows R1’s configurations of interfaces respectively connected to H1RED and H3BLUE.

I will briefly explain the meaning of these configurations:

• service instance id ethernet: it represents an ethernet flow point (EFP), a logical interface that binds the physical port with the configured Bridge Domain; its id is locally significant only

• encapsulation dot1q x: it is a frame selector, if a packet entering the physical interface has an 802.1Q VID matching to x, the packet will be taken in charge and learned on the corresponding EFP; if no match occurs on any of the service instances defined for the physical interface, the packet is dropped

(10)

R1 interface BDI10 mac-address 0200.0010.1010 ip address 10.10.10.1 255.255.255.0 ip address 172.16.10.1 255.255.255.0 secondary exit interface BDI20 mac-address 0200.0020.2020 ip address 10.20.20.1 255.255.255.0 ip address 172.16.20.1 255.255.255.0 secondary exit Table 4

Table-4 shows R1’s BDIs configurations, respectively L3-terminating the Bridge Domain 10 and the Bridge Domain 20.

Please note that, when more than a single subnet is defined in a Bridge Domain, the associated gateways are configured, on the BDI, as secondary IP addresses. By having a look inside the R1’s ARP Table it should be clear, as expected, that the IP addresses of the remote hosts H1RED…H4RED and H1BLUE…H4BLUE have been learned in the context of their respective BDIs.

By examining now the Bridge Domain 10,

(11)

the relation between Ethernet Flow Points (EFP), represented by their respective service instances, and MAC addresses associated with them results equally clear.

For example, this excerpt

describes that the MAC address 0200.0010.004A is bound to the Ethernet Flow Point 41 (EFP41), represented by the service instance 41, on the interface GigabitEthernet8, the one facing H4RED. Clearly it is the MAC address of H4RED. This last observation concludes the Lab and, as a final consideration, I can confirm that the use of Bridge Domains, instead of VLANs, effectively decouples the VLAN-tagged traffic received on a port from its belonging to a single L2 segment and its relative flooding domain.

In the EVC context, a VLAN can be used as a selector to associate a packet to an EFP and to a corresponding Bridge Domain, the L2 flooding domain.

In the ACI context, a VLAN can also be used as a selector, and this decoupling is crucial for the Fabric’s scalability and multitenancy.

But, having superpowers, ACI is able to use not only VLAN’s VIDs, but also VXLAN’s VNIDs and NVGRE’s VSIDs as selectors.

ACI will use them to associate packets to Endpoint Groups - the objects tightly connected with Bridge Domains - instead of simply associating packets to Bridge Domains, as I have shown in the EVC case.

(12)

At this (end)point, learn, learn, learn…

Having defined the basic element that communicates through the switch as the “Endpoint”, a unidirectional communication between two Endpoints, EP1 and EP2, connected respectively to the front-panel ports P1 and P2 of the leaf L, and part of the Bridge Domain B, can be simply described as a linear sequence of operations where a data packet D enters P1 and exits P2.

Figure 7

When the data packet D enters P1, its MAC address is data-plane learned by L in B, and that’s how a switch normally works, as I have shown in the previous Lab. I will also name P1 as “Fabric Ingress Point” in L,B for D.

(13)

If the unidirectional inter-leaf communication in Figure-7 becomes intra-leaf, i.e. it crosses the Fabric encapsulated into a iVXLAN, ACI uses its superpowers to data-plane learn, on the destination leaf, either the D’s MAC address or IP address, associating it with the iVXLAN tunnel towards the source VTEP, as shown in Figure-8.

Here are some further considerations about this learning process:

• the ports connecting the elements of the Fabric itself - F orange ports in Figure-8 - are not associated with the concept of Fabric Ingress Point; this concept is only related to the ports where the Endpoints are directly attached to the Fabric - the front-panel ports of the leaf switches - using the Cisco terminology - P ports in Figure-8

• the result of the data-plane learning is stored, by the ACI’s leaves, in 2 different tables:

o Local Station Table (LST) that contains addresses of Endpoints connected to the local leaf

o Global Station Table (GST) that contains addresses of Endpoints connected to remote leaves

• assuming that the packed D contains both a MAC and a IP address, the Station Tables will be populated with:

o LST on LEAF-1: D’s MAC address and D’s IP address

o GST on LEAF-2: D’s MAC address, if the packet has been received via L2 forwarding or D’s IP address, if the packet has been received via L3 forwarding; the L2 or L3 nature of the D’s forwarding is discriminated against by the receiving leaf, (which does that by) looking at the iVXLAN VNI associated to it

• the ACI’s MAC learning process is scoped to the Bridge Domain, which means duplicate MAC addresses may exist in different Bridge Domains • the ACI’s IP learning process is scoped to the VRF, which means

duplicate IP addresses may exist in different VRFs.

At this point a question that arises should be about the purpose of learning IP addresses, other than the MAC addresses.

(14)

The control plane of an ACI Fabric contains hundreds of thousands of MAC addresses and single IP addresses (/32 IPv4 or /128 IPv6), bonded together with the respective source ports (and other specific metadata) in a single gigantic table named Endpoint Table (EP Table).

No more single FIB and RIB glued together by an ARP table, this one built via a slow, control-plane driven and CPU-impacting ARP resolution process, but a new single and efficient Endpoint Table built via an ASIC-accelerated data-plane learning mechanism.

By the way, the possibility to have tens or even hundreds of thousands of MACs and single IP entries in a Table should not create any concerns because these entries, given their totally deterministic nature, will be memorized into hardware CAM tables, instead of the TCAM ones, making it much easier for the switch’s silicon to manage such a large number of items.

Another advantage of the EP Table concerns the scalability of the ACI’s Fabric security model. A specific document on the ACI’s security model is in my near-future plans, I have touched upon the subject here just for completeness. Before the conclusion of this chapter, I would like to clarify some typical misconceptions about the ACI’s control-plane that, in my experience, are quite common:

• ACI uses MP-BGP as control-plane protocol to address the Endpoints inside a single Fabric

o That’s not true! ACI uses MP-BGP to address the external routes, the routes that connect the Fabric Endpoints to the networks present outside the Fabric

• ACI uses BGP-EVPN as control-plane protocol to address the Endpoints inside a single Fabric

o That’s not true! BGP-EVPN is used as a spine-to-spine protocol to transport control-plane information in the case of ACI multi-Pod and multi-Site designs

• ACI uses proprietary protocol named COOP as control-plane protocol to address the Endpoints inside a single Fabric

o That’s true, and I will talk about this in a while.

In conclusion, ACI has shown its first superpower, data-plane learning the IP addresses together with the corresponding MAC one, putting them all into a single Endpoint Table. And that is all great, but where does the big responsibility related to this power lie?

(15)

One law to route them all

Consider an Endpoint EPA in subnet SA with IP address IPA and MAC address MACA communicating with an Endpoint EPB in subnet SB with IP address IPB and MAC address MACB via a routed network (see Figure-9).

The packet sent from EPA to EPB has L3 source IPA and L3 destination IPB, L2 destination MACR1A and L2 source MACA, where MACR1A is the MAC address of the interface in SA of R1, the default gateway of HA.

R1A performs routing because it receives a packet destined to its own MAC address, and this foresees two possibilities:

• First, the subnet SB, the one EPB is part of, is directly connected to R1. In this case, R1 is the last-hop router and therefore it performs a direct delivery of a packet with L3 source IPA and L3 destination IPB, L2 destination MACB and L2 source MACR1B ,where MACR1B is the MAC address of the R1 interface in SB.

Figure 9

• Second, the subnet SB, the one EPB is part of, is not directly connected to R1. In this case, R1 is an intermediate-hop router and therefore it delivers to R2, the next hop router to SB, chosen by consulting its routing table, a packet with L3 source IPA and L3 destination IPB, L2 destination MACR2T and L2 source MACR1T, where MACR1T and MACR2T are the MAC addresses of the R1 and R2 interfaces in ST, the transit subnet between R1 and R2.

Figure 10

R2 will then continue routing, facing the same possibilities seen above for R1, and this process continues until HB is reached.

In both cases, however, IPs stayed end-to-end, MACs changed hop-by-hop.

(16)

Two important points to be kept in mind emerge from the process illustrated above:

• A router can perform routing when it receives a packet L2-destined to its own MAC address

• When a router is the last-hop router, it performs the direct delivery of the routed packet putting the destination’s host MAC address in the L2 packet’s header.

The question now is how the last-hop router can be aware of the destination’s host MAC address. The answer is ‘by using the ARP protocol’.

If the last-hop router cannot find the MAC address of the destination host in its ARP table, it sends a broadcast ARP-Request message via the interface connected to the L2 segment associated to it. The corresponding unicast ARP-Reply message will return the requested MAC address, enabling the final delivery of the data packet.

This process works perfectly, but here are some considerations in regard: • ARP is a control-plane protocol, which means that it is not managed at

wire-speed in hardware by the ASICs, but it has to be managed at low speed, in software, by the general purpose CPU that supervises other control-plane protocols, like BGP and OSPF

• An ARP-Request is, mainly, a broadcast message, and modern networks try to avoid broadcasting as much as possible, for scalability reasons • A great part of the broadcast traditionally present in today’s networks

is ARP-related.

Furthermore, an ARP-Request is not only a way to ask for the MAC address associated to a destination IP address, but it is also a way that hosts use to notify, via broadcast, the entire L2 segment about their own IP address. Gratuitous ARP (GARP) is the trick used in this case, and it can consist of:

• Gratuitous ARP-Request, used to detect a duplicate IP address • Gratuitous ARP-Reply, used to update the other hosts’ ARP table.

Almost any host in the network, excluding a special category called “silent hosts” (like some security devices, as firewalls, for example), initiates ARP/GARP messages, even for reasons other than only signaling their own presence in the network (an hypervisor after a vMotion, or a Cluster after moving an IP to a different NIC, just to name a few).

(17)

Make routing great again!

By figuring out an IP communication from Endpoint EP1 to Endpoint EP2, the possible senders of an ARP-Request are:

• EP1, if EP1 and EP2 share the same subnet

• A last-hop router R, if EP1 and EP2 don’t share the same subnet.

However, independently from the ARP-Request sender, I want to refocus on the conclusion presented in the previous chapter, the idea to remove the broadcast behavior from ARP. But how?

Perhaps by transmitting ARP-Requests via unicast, by ‘routing’ ARP! Wait, wait, wait… ARP is an L2 protocol, how it’s possible to ‘route’ L2? By leveraging L3, how else?

Figure-11 shows that the structure of an ARP-Request message

Figure 11

contains a Target IP as well. After all, the meaning of an ARP-Request is to obtain its corresponding Target MAC.

By leveraging his first superpower - learning via data-plane the Endpoints’ IP addresses together with their MAC ones - if the Endpoint’s IP in the Target IP address of an ARP-Requests has been previously data-plane learned, ACI can route the ARP-Request to it without using any broadcast.

This means two things:

• if the target’s Endpoint port is on the same leaf as the sender’s, the packet will be locally delivered to it

• if the target’s Endpoint port is on some other leaf, the ARP-Request will be encapsulated into the iVXLAN point-to-point tunnel connecting the requester’s VTEP to the target’s VTEP, where the packet will be remotely delivered to it.

(18)

The silent hosts, for example, are typically Endpoints that are not initially learned, because they don’t emit any ARP/GARP at startup or later. In those situations, how is it possible to discover their IP addresses without flooding, with broadcast ARP-Requests, the whole Fabric?

ACI has another superpower to address this problem, ARP-Gleaning!

To understand how ARP gleaning works, it is necessary to put in order the information about Endpoints learning provided up to now:

• the Endpoints’ MACs and IPs are locally data-plane learned at Fabric’s Learning Points, and put into a leaf’s table named Local Station Table (LST)

• the Endpoints’ MACs or IPs (MACs for L2 forwarding, IPs for L3 forwarding) are remotely data-plane learned when packets exit from the iVXLAN tunnels terminating at their destination VTEPs, and put into a leaf’s table name Global Station Table (GST)

• ARP-Requests are ‘routed’ to destinations, represented by their ARP Target Destination IP, if ACI has previously learned the address.

Figure 12

Having said that, it is equally necessary to add the information that has been missing up to now, to complete the puzzle:

(19)

• The leaves inform the spines of their locally discovered Endpoints via a proprietary protocol named Council Of Oracle Protocol (COOP), where the Citizens, the leaves, must update the Oracles, the spines, about their Endpoints’ discoveries as soon as they happen

• When a leaf needs to forward an L2 or L3 packet to an unknown destination, the leaf proxies (sends iVXLAN-encapsulated) that packet to the spine, hoping that the spine itself can deliver it to the destination. As a consequence of the last point on the previous list, if the packet proxied to a spine is an ARP-Request with broadcast destination, and if the spine does not know the ARP-Request’s Target-IP, it will perform the ARP-Gleaning process, as shown in Figure-13.

Figure 13

The ARP Gleaning process is, substantially, a special type of ARP-Request (with an EtherType of 0xfff2) that a spine triggers to any leaf where the Bridge Domain’s VRF of the original ARP-Requester is deployed, to try to discover a silent host. In the example, every leaf where the Bridge Domain is deployed contains an SVI that terminates it by L3. This SVI provides pervasive gateway, a per-subnet Fabric-wide anycast gateway present on multiple leaves.

(20)

This seems as if we could put an end to this grueling marathon. Not just yet! There is another not so uncommon case to consider.

Imagine having a Bridge Domain where, for a certain subnet in use, no corresponding SVI IP address has been assigned. In such a context, Figure-14 depicts the Endpoint EP3 as a silent host in this subnet, and the Endpoint EP1 as a part of the subnet as well. Now EP1, to communicate with EP3, sends an ARP-Request for it, and the request is intercepted by the leaf EP1 is connected to. Since this leaf has not learned the EP3’s IP address yet (it is a silent host), it proxies the EP3’s ARP-Request to a spine for ARP Gleaning.

The question now is, how can the ARP-Gleaning process remotely trigger an ARP-Request sourced by an SVI’s IP address in the ARP-Request’s Target IP subnet if, as shown in Figure-14, this address has not been assigned?

Figure 14

Well, perhaps this time I should quote the old Abba song that said: “Knowing me, knowing you… There is nothing we can do!”.

If fact, the only way to solve this problem is, simply, to re-enable the old-fashioned ARP-Flooding mechanism, without using anymore one of the best ACI’s superpowers.

As I said in the beginning of this document, great power means great responsibility, and it is an ACI administrator’s responsibility to decide when and where to leverage ACI’s superpowers, according to specific situations, especially because the situations in which those superpowers are misleading are not so uncommon.

I want to illustrate another example.

(21)

necessity is to interpose an L3 firewall between Endpoints located in different subnets. Figure-15 presents such a solution, implemented via a Policy Based Routing (PBR) Service Insertion - a typical network design pattern - where two Endpoints in two different Bridge Domains/Subnets are using an L3 multi-homed firewall as an IP gateway to establish and maintain secure communications.

Figure 15

It is important to notice, in the context of the unidirectional communication of a single IP packet, in the same VRF, from EP1 to EP2, that:

• When the data packet D enters the P1 of the leaf L, the MAC address of EP1 is data-plane learned in BD1, such as the EP1 IP address.

P1 is a Fabric Learning Point in L for D.

• When the data packet D enters the P3, the EP3 MAC address (FW) is data-plane learned in BD2-FW, such as, again, the EP1 IP address.

P3 is again a Fabric Learning Point in L for D.

The problem, in this configuration, is that since the packet D is recirculated outside the Fabric to be sanitized by the firewall FW, there are two Fabric Learning Points in L for D, P1 and P3.

At P1 the leaf L will learn, in the Bridge Domain BD1, the MAC address of EP1 (0000.0011.1111) instead, at P3, the leaf L will learn, in the Bridge Domain BD2-FW, the MAC address of the Endpoint FW (0000.0044.4444).

This is right, since MACs change hop-by-hop (it is how IP routing normally works). What is not so normal is that, since IPs stay end-to-end, the same IP address, the EP1’s one (10.10.10.11), is learned by the Fabric in two different ports, P1 and P3, of two different Bridge Domains, BD1 and BD2-FW.

Now, since the ACI’s IP learning process is scoped to the VRF, this behavior, at minimum produces a Fabric-wide Endpoint’s database flapping but, it could lead to security problems as well.

(22)

network’s functionality. Exploring those countermeasures further is beyond the purpose of this document, but take into account that ACI requires specific knowledge, to be built step-by-step, lesson after lesson, for each optimization it makes available.

Before I close the subject related to ARP, there is one more thing.

ACI has another superpower - it can leverage the Sender IP field contained in an ARP packet to learn via data-plane the sender’s IP address. This behavior is intended to complement and improve the normal IP data-plane learning, which allows ACI to know the IP address of an Endpoint even before it sends a single IP packet.

Unlike ARP-Requests, broadcast and multicast traffic cannot be mitigated by some trick even if, as I have explained in a previous chapter, its delivery can be optimized by leveraging the multicast services made available by the underlay network, the delivery that ACI specially tunes-up using concurrent FTAG-Rooted multicast distribution trees.

Nevertheless, there is a kind of traffic in the BUM (Broadcast, Unknown unicast, Multicast) family, usually forwarded as broadcast, whose delivery can be optimized by ACI, the unknown unicast.

Regarding the concept of unknown unicast traffic, imagine having two hosts as part of the same IP subnet, H1 and H2, and programming the ARP table of H1 with a static mapping of the H2’s MAC and IP address.

At that point, pinging H2 from H1, a unicast frame L2-headed by the H2’s MAC and L3-headed by its corresponding IP will be directly delivered to the connected switch.

If the switch’s FIB does not have any information about H2 (this usually happens when a host has been silent long enough for its info to become stale and to be removed from the FIB itself), this unicast frame, destined to an unknown L2 destination (hence the name ‘unknown unicast’), will be flooded along the entire L2 segment.

Translating the described situation in the context of an ACI Fabric, it is possible, indeed, that the leaf where an Endpoint EP1 is connected to doesn’t have any info about another remote Endpoint EP2, but that the spine does, because other leaves may have discovered E2 and informed the spine about it. Therefore, the leaf EP1 is connected to, which doesn’t have any info about EP2, uses its superpower to avoid flooding the Bridge Domain, by proxying this unknown unicast packet to a spine, hoping the spine itself has the info needed to deliver it via unicast.

(23)

Time to remember… and say Goodbye

ACI is a huge and challenging subject, it redefines and bends some realities network admins are used to and, sometimes, it does ‘bad things’ for ‘good reasons’. Studying ACI and becoming confident with both its theoretical and practical aspects is a process that requires time and effort but, in the end, I think it is rewarding.

Anyway, to refresh the reader’s memory about the main subjects of this document, what follows is a brief recap of some of the key aspects to remember:

• The ACI Fabric behaves like a gigantic multilayer switch spread across a spine/leaf network topology, where a physical underlay offers connectivity services to a logical overlay built over an iVXLAN virtualization

• In ACI, an Application Policy Infrastructure Controller (APIC) Cluster programs and manages the Fabric, but it is outside the Fabrics data-path, which means that its fault will not impact the Fabric data forwarding

• The underlay transports the overlay’s L2 and L3 traffic associating it to specific iVXLAN VNIs, one for each overlay’s Bridge Domain and one for each overlay’s VRF

• The underlay offers multicast services to the overlay’s BUM traffic, maximizing the scalability and performances via multiple concurrent multicast distribution Trees named FTAG-Trees

• The main logical structure defined in the overlay are Tenants, VRFs Bridge Domains, Endpoint Groups and Endpoints. The Tenant represents an administrative context, the others are used to define and support the network topology, security and forwarding

• The concept of Bridge Domain, representing the ACI’s L2 domain, allows for the VLAN to be relegated to become a selector intended merely for the categorization of the traffic

• The ACI Fabric control plane is based on a very large database of Endpoints’ MAC and IP Addresses, memorized and replicated among the spines via a proprietary protocol named Council Of Oracle Protocol (COOP)

• ACI does not use the usual FIB, RIB and ARP tables to address the internal Fabric’s learned Endpoints, but instead it leverages a single Endpoint Table (EP Table) that, among other metadata, contains MAC addresses and /32 IPv4 - /128 IPv6 addresses; no traditional routing is used inside the Fabric to reach the learned Endpoints

(24)

Endpoints; if they cannot address an Endpoint, they send the packet to a spine counting it should do it

• ACI has superpowers, two of them are the capacity to data-plane learn both the Endpoint’s MAC and IP identities from an IP packet entering the Fabric, and to learn, still via data-plane, the Endpoint’s IP identity from an ARP packet entering the Fabric:

o the identities are learnt at Fabric Ingress Points, that are leaves’ front-panel ports

o the MAC address is always learnt

o the IP address is learnt if the packet has to be routed, that is L2-destined to the MAC address owned by an SVI (pervasive-gateway) in its respective Bridge Domain

o the IP address is also learnt from the Sender IP Field of an ARP-Request

• By leveraging the IP control-plane, ACI can use another superpower - route an ARP-Request to its Target Destination without broadcasting it; this process can be operated directly by a leaf or by a spine, with the latter being able to force the discovery of silent hosts, if necessary, through a special ARP-Request process named ARP Gleaning

• By leveraging the IP control-plane, ACI is able, in many circumstances, to avoid broadcasting across the Fabric the unknown unicast traffic • Any ACI superpower may have side effects, sometimes disruptive;

therefore, it is up to the network administrator to evaluate, case by case, which one to use and when.

This document, and the ones following it for the series ‘Cisco ACI – Connecting the Dots…’, represents only the tip of the iceberg and has been envisaged only as an introduction of the subjects that other more specialized documents, books and Cisco Live sessions develop more in-depth.

To consolidate the knowledge on these subjects, I would suggest reading about them from the following sources:

• ACI Endpoint Learning – Whitepaper by Cisco

• Application Centric Infrastructure Design Guide – Whitepaper by Cisco • Deploying ACI – Book by Cisco Press.