NETWORK DISCOVERY USING INCOMPLETE MEASUREMENTS

(1)

MEASUREMENTS

by

Brian Eriksson

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy (Electrical Engineering)

At the

University of Wisconsin - Madison

(2)

(3)

(4)

Acknowledgements

First of all, I would like to thank my advisors, Robert Nowak and Paul Barford. This disserta-tion could not be written without the extensive professional support given by both, including the countless hours spent editing my paper drafts and presentations. Rob, particularly for taking a chance on me early in my graduate school career and teaching me some very important life lessons along the way. And Paul, for spending many late deadline nights helping me with papers at the last minute and introducing me to a research area that allowed me to be creative and exploit my unique skill set.

I would also like to thank the number of other research collaborators who gave me significant help over the course of my graduate studies, including Bruce Maggs (Duke University/Akamai), Aarti Singh (CMU), Nick Duffield (AT&T), Matthew Roughan (University of Adelaide), Joel Sommers (Colgate University), Peyman Milanfar (University of California - Santa Cruz), Mark Crovella (Boston University), Mark Coates (McGill University), and Sina Farsiu (Duke University). Also, the remaining members of my PhD committee, Amos Ron, Barry Van Veen, and William Sethares. Final professional thanks go to the number of past/current graduate students here at the University of Wisconsin, including Minglei Huang, Jarvis Haupt, Mike Rabbat, Waheed Bajwa, Rui Castro, Laura Balzano, and Gautam Dasarathy.

(5)

On a personal note, I would like to thank my parents and my sister for support throughout my graduate school career. You did not always understand what exactly I was doing in graduate school, but you did always understand why I was doing it. Also, to the doctors and nurses at the University of Wisconsin hospitals and clinics. While the final months of my Ph.D. were completely different that I ever could have imagined, everyone here has treated me with respect and care during a difficult time.

Finally, to my fiancee and best friend, Amy. You have been the only force of sanity in my life during moments when I frequently lost my perspective on things. And I hope that in the future a majority of our conversations no longer revolve around complaining about graduate school.

(6)

List of Tables

3.1 Details of honeypot data sets used in our study. All data was collected over a one day period on December 22, 2006. . . 25 3.2 Counts of occurrences of common source IP addresses in multiple honeypots . . . 26 3.3 Shared path estimation results for a 1000 node synthetic topology assuming that

probes from 800 randomly selected end host nodes were observed in 8 randomly selected monitors. . . 43 3.4 Shared path estimation results for the Skitter topology assuming that probes from

700 randomly selected end host nodes were observed in 8 randomly selected monitors. 44 3.5 Comparison of three different techniques for discovering pairwise distances between

end hosts. Where N is the number of end hosts, M are the number of monitors (N _≫M) . . . 56 4.1 An example hop count matrix using observed hop elements from the singletraceroute

pathp1→r1→r2 →r3 →r4 →p2 (where “-” represents an unknown element). . . 73

4.2 Hop Matrix reconstruction error rates. The RMSE of 100,000 core router to core router hop distances held out. . . 77 4.3 Division of Matrix Completion Errors for Holdout Data . . . 77 4.4 Performance of Unseen Link Classification Algorithm with various threshold values

using λthresholding on both the bootstrap thresholding and the hop count thresh-olding methodologies. . . 81

(12)

5.1 Upper bound probing complexity for the three probing methodologies for balanced

ℓ-ary tree. (Wherep(ℓ) is sublinear in ℓ) . . . 103 5.2 Comparison of number of probes needed to estimate logical topology using synthetic

Orbis topologies. . . 105

7.1 NTP Dataset - The average geolocation error for various end host to landmark mapping methodologies. . . 138 7.2 NTP Dataset - Hop-based Mapping methodology quintile errors. . . 138 7.3 Commercial Node Dataset - The average geolocation error for various end host to

landmark mapping methodologies. . . 139 7.4 Commercial Node Dataset - Hop-based Mapping methodology quintile errors. . . 140 7.5 PinPoint Algorithm complexity for both probing and computation. WhereN is the

number of end hosts,Kis the probing budget,T is the number of monitors,M is the number of landmarks,B is the number of bootstrap iterations, and Gis the number of feasible geolocation points. . . 148 7.6 The geolocation error for all geolocation methodology using latency data from all

landmarks (error distance in miles). . . 153 7.7 The performance of the NBgeo Algorithm given additional data (error distance in

miles). . . 153 7.8 The geolocation error (in miles) for all geolocation methodology using latency data

from all landmarks (for number of landmarks,T = 50,200). . . 154 8.1 GEANT Network Data - Number of false alarms declared for a percentage of the

true anomalies detected. . . 177 8.2 GEANT Network Data - Number of false alarms declared in order to detect every

anomaly in the GEANT dataset (with respect to various wavelet types). . . 178 8.3 Synthetic Traffic Matrices - Number of false alarms declared for a percentage of the

(13)

8.4 Abilene Network Data - Number of false alarms declared for a percentage of the PCA anomalies detected. . . 181

(14)

List of Figures

1.1 Toy network topology. (Left) - Physical topology, (Right) - Logical topology. . . 3 1.2 Example of the three Internet measurement types (Ping, traceroute, and passive

where the monitor is atIP6) between two points in the network, where a time-to-live

(TTL) value of 60 indicates that there are 64₋60 = 4 routers between the two end hosts in the network. . . 6

3.1 (Left) Example Network Topology with sources Si sending packets through a core component to monitors Mk, (Right) Example network where S1 and S2 share a

border router. . . 27 3.2 Comparison of clustering results for Unique Contrast Clustering and Hop Distance

Nearest Neighbor in terms of average number of cluster elements in matching IP subspaces . . . 29 3.3 Example of a subnet having multiple egress points. . . 30 3.4 2-D histogram of hop count contrast vectors with clusters highlighted in ellipses. . . 32 3.5 Comparison of Gaussian mixture clusters to random clusters. Simulated topology,

N = 1000,M = 8. . . 34 3.6 Comparison of Gaussian mixture clusters to random clusters. Skitter topology,N =

700, M = 8. . . 35 3.7 Striped dots indicate passive measurement data observed, Black dots indicate no

information observed (Left) - Observations where Network-centric imputation may perform well, (Right) - Observations where Network-centric imputation will fail. . . 35

(15)

3.8 Imputation accuracy over a range of randomly selected missing values using data from the real-world honeynet dataset. . . 37 3.9 Imputation accuracy over a range of randomly selected missing values using data

from M = 16 honeypots. (Left) N = 1000, (Center) N = 2000, (Right)N = 3000 . 38 3.10 Spectrum of sharedness (black dots represent routers). (Left) No sharedness,

(Cen-ter) Intermediate sharedness, (Right) Maximum sharedness . . . 39 3.11 Example of cluster-level path estimation. . . 40 3.12 The effect of increasing the number of clusters on the shared path estimation

per-formance on the simulated topology using the cluster-level shared path estimation method. . . 43 3.13 The effect of increasing the number of clusters on the shared path estimation

perfor-mance on the Skitter topology using the Gaussian mixture EM cluster-level shared path estimation method. . . 44 3.14 Topology estimation performance for two different estimation methods in a 1000

node synthetic topology, ((Left) M = 8, (Center)M = 16, (Right)M = 24) . . . 46 3.15 Performance of topology estimation algorithm in 1000, 2000, and 3000 node synthetic

topologies with M = 16, (Left) - Predictive Function Topology Estimation, (Right) - Cluster-Level Topology Estimation. . . 46 3.16 Example mask array W, (with N = 4 and M = 2). Note that not all hop-counts

from end hosts to monitors are observed, and none of the hop-counts between end hosts are observed. . . 54 3.17 Two end hosts with the same hop distance to a single monitor. . . 57 3.18 Simulation results for error rates of pairwise hop estimation for synthetic topology

versus amount of available data (N=1000). (left) M=8, (center) M=16, (right) M=32 61 3.19 Simulation results for error rates of pairwise hop estimation for synthetic topology

(16)

3.20 The effect of embedding dimension to estimating the pairwise distance values for the synthetic topology, N = 1000, w/ M = 32 and calculated dimension d = 5, confidence bars indicating +/-1 standard deviation . . . 62 3.21 The effect of adding additional monitors to estimating the pairwise distance values

for the synthetic topology, observing complete hop count data,N = 3000, confidence bars indicating +/-1 standard deviation . . . 62 3.22 RMSE of pairwise hop estimation simulation results for the Skitter topology (N =

1000). (Left) M = 8, (Center) M = 16, (Right) M = 32 . . . 63 3.23 Simulation results for asymmetric reverse paths for synthetic topology (N = 1000, M =

16) versus amount of available data . (left) Reverse paths off by 1 hop, (center) Re-verse paths off by 2 hops, (right) ReRe-verse paths off by 3 hops . . . 64

4.1 A representation of our pragmatic definition of the Internet’s core. . . 67 4.2 Empirical Cumulative Probability for the imputation error using both Matrix

Com-pletion and Mean imputation. . . 77 4.3 Percentage of total links correctly classified plotted against threshold of confidence

upper bound (λ) for both bootstrap upper bound estimate and hop count estimate. . 79 4.4 (Left) - Number of additional unique core routers found using the two probing

tech-niques, (Right) - Number of additional unique core links found using the two probing techniques . . . 86

5.1 Example of Network Radar on simple logical topology. . . 89 5.2 Example simple logical topology in a proper DFS Order. . . 90 5.3 (A) - Case A - σ_i,i2₋₁₋σ_i2₋₁_,i₋₂ < δ. The current end host xi is attached to the

parent of xi−1. (B)- Case B -σ2i,i−1≥σi2−1,i−2+δ. A new router ri is created with childrenxi, xi−1 with parent f(xi−2). . . 94

5.4 (A) - Case C-1 -σ_r2∗−σ_i,i2₋₁

< δ. The current end host (xi) is attached to router

r∗_. _(B) _{- Case C-2 -} _σ2

r∗ < σ_i,i2₋₁ +δ. A new router r_i is attached on the path between routers r∗ _and _f₍_r∗_{). . . 95}

(17)

5.5 Example of covariance values from a single end host not revealing the entire topology 98 5.6 (Left) The first split taken on a balanced ℓ-ary tree. (Right) The second split

taken on a balanced ℓ-ary tree. Both splits indicated by the dotted line, the arrow indicates the randomly chosen end host covariance values are measured against. . . . 100 5.7 Real world topology used to test tomography methods . . . 104 5.8 Topology reconstruction results for the three algorithms (DFS Ordering, Sequential,

and Hierarchical Clustering). . . 107

6.1 Resulting ordering of Gene Microarray reconstructions. (Left) - Standard Agglom-erative Clustering, (Center) - Outlier Based Clustering, (Right) - Robust Outlier Based Clustering. . . 122

7.1 (Left) - Probability for latency measurements between 10-19ms being observed given a target’s distance from a monitor. Stem plot - Histogram density estimation, Solid line - Kernel density estimation. (Right) - The kernel estimated probability of place-ment in each county given latency observation between 10-19ms from a single monitor marked by ’x’. . . 130 7.2 (Left) - Estimated posteriori probabilities for all counties in the continental US.

(Right) - Estimated posteriori probabilities for constraint-based restricted counties. . 132 7.3 Toy example of network routing geography vs. direct line-of-sight geography. . . 134 7.4 Geographic placement of NTP servers. . . 135 7.5 Example of network where an end host isC hops away from a landmark, with both

sharing the same border router. . . 136 7.6 (Left) Hop-based geolocation mean error decay with the number of observed hop

counts by each end host. (Right) Hop-based geolocation median error decay with the number of observed hop counts by each end host. (Standard deviations are shown in the error bars) . . . 141

(18)

7.7 Likelihood distribution of distance to landmark given observed latency of 10-20ms. Solid line - Kernel density estimation, Dashed Line - Estimated cumulative distribu-tion, Dashed blocks - histogram . . . 143 7.8 Empirical cumulative probability of error distance for both NBgeo with constraint

information and the CBG method. . . 152 7.9 Median geolocation error (in miles) given a limited probing budget (T = 200). . . 156 7.10 Cumulative distribution of geolocation error for both PinPoint and Octant

algo-rithms (K = 20, T = 200). . . 157 7.11 Cumulative distribution of geolocation error for PinPoint removing the improvements

of bootstrap estimation and exponential latency weighting. (K = 20, T = 200). . . . 158 7.12 Cumulative distribution of geolocation error for confidence quintiles derived from

95% bootstrap confidence interval size (K = 20, T = 200). . . 159 8.1 The BasisDetect Framework . . . 166 8.2 (Left) - 1024 minutes of packet counts across a single link in the GEANT network.

Known anomalies are marked with ’x’. (Center) - The first four atoms found in the signal dictionary consisting of a Discrete Cosine Transformation (DCT). (Right) - Comparison of the observed signal with a representation using the best linear combination of the four atoms. . . 170 8.3 Fourier analysis of Abilene data (March 1-15, 2004). (Left) - Abilene traffic, (Center)

- Important region of power spectrum., (Right) - Fourier approximation . . . 171 8.4 GEANT Network Data - False Alarm anomalies found for a specified level of true

anomaly detection for the three time-series detection methodologies (Fourier, EWMA, BasisDetect). . . 177

(19)

8.5 Tuning parameter performance experiment, examination of how well the BasisDetect algorithm performs as each of the tuning parameters are removed. Using the full BasisDetect algorithm (γ, ρ learned from training set), BasisDetect w/o residual (γ

learned from training set, ρ = 0), and BasisDetect w/o penalty (ρ learned from training set,γ = 0 . . . 179 8.6 Synthetic Traffic Matrices - False Alarms declared for a specified level of true anomaly

detection for the three network-wide detection methodologies (PCA, Distributed Spatial, BasisDetect). . . 180 8.7 Abilene Real-World Network Data - Using 15 anomalies found by the PCA

method-ology, the false alarm rates are displayed for both BasisDetect and the Distributed Spatial methodology. . . 181

(20)

List of Algorithms

1 - Gaussian Mixture EM Imputation Algorithm . . . 36

2 - MDS Algorithm with Incomplete Passive Measurements . . . 55

3 - MDS Algorithm with Incomplete Passive Measurements andBGP Information . . 58

4 - Unseen Link Estimation Algorithm - Bootstrap Thresholding . . . 81

5 - Unseen Target Estimation Probing Algorithm . . . 85

6 - Ordered Logical Topology Discovery Algorithm . . . 97

7 - Bisection DFS Ordering Algorithm - bisect(X,δ) . . . 99

8 - Outlier-based Clustering Algorithm . . . 114

9 - Robust Outlier Clustering Algorithm . . . 117

10 - NBgeo - Naive Bayes IP Geolocation Algorithm . . . 133

11 - PinPoint IP Geolocation Algorithm . . . 149

12 - Dictionary Construction Algorithm . . . 172

(21)

Abstract

Resolving characteristics of the Internet from empirical measurements is important in the develop-ment of new protocols, traffic engineering, advertising, and troubleshooting. Internet measuredevelop-ment campaigns commonly involve heavy network load probes that are usually non-adaptive and in-complete, and thus directly reveal only a fraction of the underlying network characteristics. This dissertation addresses the open problem of Internet characteristic discovery in an incomplete mea-surement regime. Using partially observed meamea-surements, we specifically focus on the problems of Internet topology discovery, inferring the geographic location of Internet resources, and network anomaly detection.

First, we consider the inference of topological characteristics of the Internet from three dis-tinct forms of incomplete measurements. Initial work demonstrates how Passive Measurements, potentially-incomplete passively observed characteristics of the network, can be used to infer topo-logical structure, such as clustering and shared path lengths. The second form of missing mea-surements come in the form of a set of traceroute _{probes, where we obtain partial knowledge of}

route lengths between routers in the network. Using a novel statistical methodology, we show how unobserved links between routers can be detected. Finally, we develop a noveltargeteddelay-based tomographic methodology, which resolves the tree topology of a network with a methodology that only requires a number of directed measurements within a poly-logarithmic factor of derived lower bounds.

The second component of this dissertation focuses on two critical networking problems – ge-ographic location interference of Internet resources and network anomaly detection. In terms of

(22)

geographic location inference, our methodology exploits a set of landmarks in the network with known geographic location and targeted latency probes to avoid erroneous measurements caused by non-line-of-sight routing of long network paths. The use of a novel embedding algorithm allows for the inferred geolocation of end hosts to be clustered in areas of large population density without explicitly defined population data. Finally, we examine detecting unforeseen anomalous events in a network. Using a limited training set of labeled anomalies, our new anomaly detection framework extracts signal characteristics of anomalous events and detects their occurrence across observed network-wide measurements.

(23)

Brian Eriksson

Under the supervision of Professor Robert D. Nowak At the University of Wisconsin - Madison

Resolving characteristics of the Internet from empirical measurements is important in the de-velopment of new protocols, traffic engineering, advertising, and troubleshooting. Internet mea-surement campaigns commonly involve heavy network load probes that are usually non-adaptive and incomplete, and thus directly reveal only a fraction of the underlying network characteristics. This dissertation addresses the open problem of Internet characteristic discovery in an incomplete measurement regime. Using partially observed measurements, we specifically focus on the problems of Internet topology discovery, inferring the geographic location of Internet resources, and network anomaly detection.

The first problem addressed in this work is the inference of topological characteristics of the Internet from three distinct forms of incomplete measurements. Initial work demonstrates how

Passive Measurements, potentially-incomplete passively observed characteristics of the network, can be used to infer topological structure, such as clustering and shared path lengths. The second form of missing measurements come in the form of a set oftracerouteprobes, where we obtain partial

knowledge of route lengths between routers in the network. Using a novel statistical methodology, we show how unobserved links between routers can be detected. Finally, we develop a novel

targeted delay-based tomographic methodology, which resolves the tree topology of a network with a methodology that only requires a number of directed measurements within a poly-logarithmic factor of derived lower bounds.

The second component of this dissertation focuses on two critical networking problems – ge-ographic location interference of Internet resources and network anomaly detection. In terms of geographic location inference, our methodology exploits a set of landmarks in the network with known geographic location and targeted latency probes to avoid erroneous measurements caused by non-line-of-sight routing of long network paths. The use of a novel embedding algorithm allows for the inferred geolocation of end hosts to be clustered in areas of large population density without

(24)

network. Using a limited training set of labeled anomalies, our new anomaly detection framework extracts signal characteristics of anomalous events and detects their occurrence across observed network-wide measurements.

Approved:

Professor Robert D. Nowak Department of Electrical and Computer Engineering University of Wisconsin - Madison

(25)

Chapter 1

Introduction

As science and technology advances, the more prevalent extremely complex systems will become. This complexity is often in the form of decentralized systems with a very large number of interde-pendencies. These systems form loosely defined “networks”, and can be found in areas ranging from social interaction to genetic regulation systems. The analysis of these structures, commonly known as Network Science, has emerged in recent years to develop formal methodologies for predicting and modeling behaviors of these complex systems. This dissertation will focus on developing novel statistical and machine learning techniques in the field of Network Science for application on one of the largest man-made network structures in existence, the Internet.

Over the past quarter century, the Internet has grown into a gigantic, extremely complex in-frastructure that connects over a billion users world wide. The ability to measure, map, and analyze characteristics of the Internet accurately would facilitate network design, network man-agement and network security processes by exposing the strengths and weaknesses in connectivity and opportunities to improve its robustness and performance. Prior work on discovering network characteristics, such as generating Internet maps [1, 2] or IP geolocation [3, 4], have mainly focused on the engineering problems associated with extensively probing the Internet using high network load probes, and then aggregating the vast quantities of data returned. This approach has inherent shortcomings, with timeliness issues of estimated characteristics due to the large probing load on the network and the need for frequently updated disambiguation databases. In contrast, the work

(26)

in this dissertation focuses on transforming the task of resolving network characteristics from an engineering exercise based on exhaustively probing the network to a mathematical inference prob-lem based on exploiting a non-exhaustive subset of observed network measurements. By validating our methodologies on a known network structure such as the Internet, we have a stepping stone for the vast array of other scientific problems where the existence of a network is implied (e.g.,genetic regulatory networks, brain networks, etc.).

Thesis Statement

The complexity of the Internet requires more thorough and intelligent analysis of network measurements than previously performed. By exploiting known network structure and novel data fusion methodologies, latent information in noise corrupted and incomplete Internet measurements can be revealed. This hidden information exposes important new features in the network that were previously ignored, with applications in the areas of topology discovery, IP geolocation, and anomaly detection.

1.1

Internet Measurements and Terminology

In order to refer to objects in the Internet, throughout this dissertation we will use terminology common to networking literature. An end host will refer to any object in the Internet that can send or receive information requests. We will focus on the level-3 network layer where these end hosts are connected throughrouters which direct where data packets travel such that the packet destination is eventually reached. An autonomous system (AS) is a partition of end hosts and routers controlled by a single network operator (e.g., Level3 or AT&T). A physical topology

route will indicate the specific physical routers that a path between two end hosts contains, and a

logical topologyroute will refer to the path topology containing only routers with either in-degree or out-degree greater than one. A labeled example of this terminology can be seen in Figure 1.1. In order to resolve characteristics of the network, we will rely on information returned through the use of eitherActive orPassive Internet Measurements.

(27)

Figure 1.1: Toy network topology. (Left) - Physical topology, (Right) - Logical topology.

1.1.1 Active Measurements

A vast majority of prior Internet characteristic discovery research is dependent onActive Measure-ments [1, 2, 3, 4, 5, 6, 7, 8, 9], which here we will specify as a tomographic network measurement performed by specifying a probe destination target from a set origin point in the network. The active measurement output will consist of some characteristic of the network between the origin and destination (e.g., delay, router topology, etc.). In this dissertation, we focus on two specific active measurement probes, ping andtraceroute_.

Ping Measurements

The most basic active probe considered in this dissertation are simple ICMP ping probes [10]. Using an ICMP echo request packet, the host origin computer will send a probe to a targeted destination end host, returning both the round trip time latency (RTT) in milliseconds and the time-to-live (TTL) value, indicating the number of routers between the host computer and the targeted end host. Advantages of ping measurements are that the probes are lightweight with very little load on the network path, while the main disadvantage is that no further useful topology characteristics (shared path lengths of two network routes, specific routers traversed by a path, etc.) are returned by individual ICMP ping measurements.

(28)

traceroute Measurements

In prior research (e.g., [1, 2]), the predominant measurement for acquiring Internet topology characteristics has been based on tools similar to traceroute _{probes to gather data. Standard}

traceroute _{probes further exploit ICMP packets to return both the number of routers (}_i.e., _the

hop count between two points in the network), and the set of router interface IP addresses along the path between the two probe points in the network. This probing methodology allows for routing adjacencies to be known along the path between the two probe points (e.g., router A is physically connected to router B). In addition, the router interface IP addresses allow for domain name server (DNS) requests for further information about each router along the observed path. This technique, referred to as unDNS [11], creates location hints that have been used frequently on the problem of estimating an end host’s geographic location [3, 4]. Unfortunately, effective use of such hints requires significant and frequently updated databases which still introduce the possibility of errors [12].

Great strides have been made in mitigating the problems associated with active probe Internet measurements, such as interface disambiguation in [11, 13], which is the problem of resolving multiple IP addresses associated with a single physical router. This has enabled accurate mapping of ISP topologies (e.g., [2]) and of the Internet’s core (e.g., [1, 14]) using traceroute _probes.

However, there are still three important limitations in the use of active probing tools for Internet characteristic discovery. First, the vast size of the Internet means that a set of measurement hosts

M and target hostsN whereN _≫M must be established in order for the resultant measurements to capture the diverse features of the infrastructure (especially on the edges of the network [14]). Second, active probes sent from monitors to the large set of target hosts result in a significant traffic load on the network. Third, in order to prevent reverse engineering of networks, service providers frequently attempt to thwart structure discovery by blocking ICMP probes to specific routers (and thus blocking both traceroute _{and ping probes). This results in the acquisition}

of incomplete active measurements due to (i) - the inability to perform exhaustive probing of all objects in the network in a reasonable length of time, and/or (ii) - the obfuscation of critical network infrastructure by administrators to avoid reverse engineering.

(29)

1.1.2 Passive Measurements

One alternative to the use of active probes is to acquire network information passively, where instead of introducing probe-based traffic into the network we instead measure existing Internet traffic. The methodology we will consider in this dissertation consists of a series of monitors on network links sampling traffic. From passively sampled packets, we can obtain the IP address and Time-To-Live (TTL) count off the packet header. At the origin of each packet, it is assigned an operating system dependent integer value (i.e. 64, 128, or 255). As the data packet traverses the network, the TTL count is decremented by a single count at each router encountered. When the TTL count reaches zero, the packet is discarded, thus preventing packets from forever traversing the network. Using the technique from [15], the TTL count can be translated into the number of routers between the end host and the passive monitor. It is not uncommon to observe packets from an single end host source at several of the passive monitors, resulting in a vector of hop-count distances from each monitor to that source. These vectors provide an indication of the topological location of the source relative to the monitors, with no additional load added to the network by measurement probes.

Unfortunately, a finite duration passive measurement campaign will likely result in incomplete measurements, because packets from each host are typically only observed at a subset of the mon-itors. This can be due either to packet sampling restrictions at our monitors (where only a subset of traffic will be observed) or an end host not directing any traffic towards the locations of specific monitors. Due to the inherently incomplete nature of passive measurements, there is relatively little prior work that addresses passive network monitoring.

An example of the three Internet measurement types considered in this dissertation are shown in Figure 1.2.

1.2

Motivation and Summary of Major Contributions

We will focus on three significant network characterization problems,(i) - Topology inference, (ii)

(30)

method-Figure 1.2: Example of the three Internet measurement types (Ping, traceroute_{, and passive}

where the monitor is atIP6) between two points in the network, where a time-to-live (TTL) value

of 60 indicates that there are 64−60 = 4 routers between the two end hosts in the network. ologies to handle the effects of missing and corrupted measurement data to improve upon prior network characterization performance. This missingness will be the result of either measurements that are unavailable during the probing campaign (e.g., incomplete passive measurements), or the result of a targeted probing algorithm where we will select particular active measurements in order to reduce the total load on the network.

1.2.1 Internet Topology Inference

There are significant challenges in any approach to measurement and characterization of Internet topology. First, the lack of built-in support for topology measurement coupled with the desire of many Internet Service Providers’ to keep much of this information private calls for a distributed measurement infrastructure and structural inference methods that are reliable and robust. Next, the vast size and global footprint of the Internet suggest that a potentially significant number of measurement hosts will be required in order to gather sufficient data to generate comprehensive

(31)

maps. Finally, the well known dynamic nature of the Internet means that measurements must be taken almost continuously in order to identify changes in a timely fashion. Understanding the Internet’s structure through empirical measurements is important in the development of new topology generators, new protocols, traffic engineering, and troubleshooting, among other things. Our topology inference results offer the possibility of a greatly expanded perspective of Internet structure with much lower network traffic impact and management overhead.

The first step in our topology inference study will focus on primarily usingpassive measurements

to resolve topology characteristics. There are significant challenges in using passive packet measure-ments for discovering Internet structure. First, and most importantly, the individual measuremeasure-ments themselves would seem to convey almost no information about network structure. Second, end host IP addresses are often considered sensitive and are typically subject to privacy constraints (we ad-dress this by only using end host IP adad-dresses as unique identifiers of hosts, and to resolve which specific Autonomous System (AS) the end host resides in). And finally, passive measurements give no indication of which routers were traversed between two points in the network, making the problem of topology discovery far more difficult when compared to an active measurements method-ology. Despite these challenges, we will describe passive measurement-based algorithms that enable

(i) automatic clustering or grouping of traffic sources that share network paths accurately without relying on IP address or autonomous system information,(ii) topological structure to be inferred accurately with only a small number of active measurements,(iii)missing information to be recov-ered, which is a serious challenge in the use of passive packet measurements. We demonstrate our techniques using a series of simulated topologies and empirical data sets. Our experiments show that the clusters established by our method closely correspond to sources that actually share paths. We also show the trade-offs between selectively applied active probes and the accuracy of the in-ferred topology between sources. Finally, we characterize the degree to which missing information can be recovered from passive measurements, which further enhances the accuracy of the inferred topologies.

The second stage of our topology study focuses on the problem that common mapping cam-paigns using traceroute reveal only a portion of the underlying topology. We will demonstrate

(32)

that standard probing methods yield datasets that implicitly contain information about much more than just the directly observed links and routers. Each probe yields information that places con-straints on the underlying topology, and by integrating a large number of such concon-straints it is possible to accurately infer the existence unseen components of the Internet (i.e., links and routers not directly revealed by the probing). Moreover, we show that this information can be used to adaptively re-focus the probing in order to more quickly discover the topology. These findings suggest radically new and more efficient approaches to Internet mapping, specifically on the dis-covery of the core of the Internet. We define ”Internet core” as the set of routers that is roughly bounded by ingress/egress routers from stub autonomous systems. We describe a novel data analy-sis methodology designed to accurately infer(i)the number of unseen core routers,(ii) the unseen hop-count distances between observed routers, and(iii)unseen links between observed routers. We use a large experimental dataset to validate the proposed methods. The validation shows that our methods can predict the number of unseen routers to within a 10% error level, estimate 60% of the unseen distances between observed routers to within a one-hop (i.e.,a single router) error, and robustly detect over 35% of the unseen links between observed routers. Furthermore, we use the information extracted by our inference methodology to drive an adaptive active-probing scheme. The adaptive probing method allows us to generate maps using roughly 50% fewer probes than standard non-adaptive approaches.

The focus of our topology study then shifts to the field of delay-based tomographic probing. Topology recovery via tomographic inference is potentially an attractive complement to standard methods that use TTL-limited probes. Unfortunately, prior tomographic techniques (e.g., [5, 8]) have required an infeasible exhaustive (i.e., quadratic with respect to the number of end hosts considered) number of probes for accurate, large scale topology recovery. We will describe new techniques that aim toward the practical use of tomographic inference for accurate router-level topology measurement. We will focus on a novel Depth-First Search (DFS) Ordering algorithm that clusters end host probe targets based on shared infrastructure, and enables the logical tree topology of the network to be recovered accurately and efficiently without the need for an exhaustive number of measurement probes. We evaluate the capabilities of our DFS Ordering topology recovery

(33)

algorithm in simulation and find that our method uses 94% fewer probes than exhaustive methods and 50% fewer than the current state-of-the-art. We also present results from a case study in the live Internet where we show that DFS Ordering can recover the logical router-level topology more accurately and with fewer probes than prior techniques.

Finally, we examine theoretical bounds for resolving hierarchical clustering (i.e.,the tree topol-ogy of a network) from limited and potentially noise corrupted similarities. Our main contributions prove that a sampling-at-random (i.e., passive sampling) methodology will always require an ex-haustive (i.e.,quadratic) number of pairwise similarity measurements to resolve the entire clustering hierarchy. This is then contrasted with a targeted, active sampling regime, where we show how in the presence of uncorrupted measurements a methodology can be designed that requires only ∼O(NlogN) pairwise similarities to accurately reconstruct a tree topology forN objects. These results are then extended to the regime where our similarities are corrupted with noise, where we present a methodology that will reconstruct the clustering with high probability using only ∼O(NpolylogN) targeted measurements.

1.2.2 IP Geolocation

The ability to pinpoint the geographic location (or geolocation) of IP hosts is compelling for ap-plications such as on-line advertising and network attack diagnosis. While prior methods (e.g.,

[3, 4, 9, 16, 17]) can accurately identify the location of hosts in some regions of the Internet, the accuracy of standard IP geolocation techniques can be impaired by noisy measurements (e.g., dis-tance derived from non line-of-sight Internet routes) or potentially misleading information such as DNS names which are generated by and must be interpreted by people. The hypothesis of our geolocation work is that the accuracy of IP geolocation can be improved through the creation of a flexible analytic framework that incorporates different types of geolocation information.

We introduce the NBgeo framework, a machine-learning classification based geolocation algo-rithm. This methodology uses a set of lightweight measurements from a set of known monitors to a target, and then classifies the location of that target based on the most probable geographic region given probability densities learned from a training set. For this study, we employ a Naive Bayes

(34)

framework that has low computational complexity and enables additional societal information to be easily added to enhance the classification process. We use explicitly defined (i.e.,a priori supplied) population data from the US Census [18] to improve upon our estimation. Our results show that the new NBgeo framework results in geolocation estimates that have median error 50 miles closer than the current measurement-based geolocation methods.

We then introduce a second novel methodology for IP geolocation that we call PinPoint. Pin-Point is based on two key innovations. First, we use a geographically diverse set Internet hosts with ground truth geographic coordinates as landmarks, providing our algorithm with implicitly defined population information. PinPoint begins by identifying the subset of landmarks that are geographically nearest to the target host using a novel clustering methodology based on hop count measurements. Next, PinPoint uses latency measurements fromlandmark subsets to geolocate the targets. Using only the latencies from landmarksclosestto a target in hop distance results in highly accurate predictors compared to latency measurements from arbitrary landmarks which tend dis-tort distance due to the vagaries of routing. PinPoint estimates geolocation from latencies using a novel sparse embedding algorithm that preserves latency distances and encourages the targets to cluster geographically, which is desirable since targets tend to concentrate in cities. This second innovation serves as an important regularization in the embedding process that further mitigates the effects of noise and errors. We demonstrate that PinPoint performs significantly better than all existing geolocation tools using measurements conducted from a large set of end hosts with ground truth locations. Our results show that PinPoint is able to geolocate hosts with a median error of 38 miles and an average case of less than 97 miles. In contrast, the best commercial IP geolocation database yields an average error of 493 miles, while the previous state-of-the-art measurement-based geolocation methodology yields a median error of 91 miles and an average error of 170 miles.

1.2.3 Anomaly Detection

The ability to detect unexpected events in large networks can be a significant benefit to daily network operations. A great deal of work has been done over the past decade to develop effective anomaly detection tools (e.g., [19, 20]), but they remain virtually unused in live network

(35)

opera-tions due to an unacceptably high false alarm rate. We seek to improve the ability to accurately detect unexpected network events through the use of BasisDetect, a flexible but precise modeling framework. Using a small dataset with labeled anomalies, the BasisDetect framework allows us to define large classes of anomalies and detect them in different types of network data, both from single sources and from multiple, potentially diverse sources. Network anomaly signal characteris-tics are learned via a novel basis pursuit based methodology. We demonstrate the feasibility of our BasisDetect framework method and compare it to previous detection methods using a combination of synthetic and real-world data. In comparison with previous anomaly detection methods, our BasisDetect methodology results show a 50% reduction in the number of false alarms in a single node dataset, and over 65% reduction in false alarms for synthetic network-wide data.

1.3

Organization

This dissertation is organized as follows. The proceeding chapter describes our work in relation to prior research. Chapter 3 describes how to use passive measurements to discover various char-acteristics of network topology (end host clustering, shared path estimates, end host-to-end host path lengths). In Chapter 4, given an initial set of Internet probes, we present novel statistical techniques to estimate unseen routers and links in the network. Then in Chapter 5, we describe an intelligent network tomography probing procedure that drastically reduces the total number of active delay-based probes needed to resolve the logical topology of a network. Theoretical bounds for tomographic clustering-based methodologies can be found in Chapter 6. The final two method-ology chapters deal with two application-based case studies in resolving network characteristics. In Chapter 7, two methodologies for resolving the geographic location of Internet resources are described. For the geographic estimation of network objects, these methodologies use either an implicit or explicit restriction on areas with large population density. Finally, in Chapter 8, a novel model-based anomaly detection procedure is presented. Then in Chapter 9 the contributions of this thesis are again summarized and some future directions are explored.

(36)

Chapter 2

Related Works

In this chapter we set the contributions of the thesis in terms of prior research.

2.1

Network Topology Discovery from Incomplete Passive

Mea-surements

While there have been many previous studies that have focused on developing methods for esti-mating network topology, a great deal of prior work in this area has focused on solely using active

traceroute-like probes (e.g., [11, 13, 14, 21]). In each case, these studies highlight several challenges

associated with this kind of approach, including the need for widely distributed nodes from which probes can be sent (i.e., to address the need for a broad perspective) and the difficult problem of interface disambiguation. A number of large topology mapping efforts that attempt to address the problem of limited perspective have been active for years including the well known Skitter [1] and Dimes [22] efforts. While the problem of interface disambiguation has been known since Paxson’s work in the mid-1990’s [23], the recent study by Sherwoodet al. demonstrates how problematic this issue can be when using standard disambiguation techniques [21]. Another study that is related to ours is by Magoni and Hoerdt [24]. In that paper, the authors describe a traceroute-based

approach and encounter the same difficulties with perspective and router interfaces.

(37)

of active probed-based measurements to examine router topology structure. Acquisition of passive hop count measurements from packet traffic is a previously studied problem. While the deployment of specialized hardware on TAP’ed links (e.g.,[25, 26]) could be used in our work, publicly available data sets almost always anonymize source IP addresses making it impossible to relate measurements from multiple sites. An alternative form of passive packet measurements are those collected in network honeypots [27, 28, 29, 30]). Honeypots monitor routed but otherwise unused address space, so all traffic directed to these monitors is unwanted and almost always malicious. Honeypots do not solicit traffic, however low interaction sensors will respond to incoming connection requests in order to distinguish spoofed addresses. In this way they are not completely passive. However, monitors of large address segments can receive millions of connections per day from systems all over the world and therefore offer an incredibly unique and valuable perspective [31]. The unsolicited nature of honeynet traffic coupled with the volume and wide deployment of monitors make it an attractive source of data for our work.

Passive measurements of routing updates have been previously used to establish intra-domain network maps [32], meanwhile our goal is to discovery Internet-wide structure with more simple and lightweight hop count measurements (the number of routers between two points in the network). The focus of our passive measurement network discovery work in Chapter 3 is on identifying Internet structure in terms of clusters of clients [33], shared paths [7, 6, 34, 5], and end-host to end-host distances [35, 36, 37, 38, 39].

2.1.1 Passive End Host Clustering

Clustering end hosts in the Internet in a topologically significant manner is a problem relevant to the creation of overlay networks [33, 40] and the geolocation of resources [3, 4, 9, 41]. The most relevant prior research ([33]) uses a BGP routing table based approach to group IP addresses. In contrast to this prior approach, our methodology will rely not on IP addresses (which can be spoofed, [42]), but on passively observed incomplete hop count measurements.

(38)

2.1.2 Passive Shared Path Estimation

Using delay-based tomographic methods, prior methods (e.g.,[5, 6, 7, 34]) have shown how using a series of active measurements shared logical routing paths can be estimated. The main focus of this prior work is relative comparisons between multiple paths in order to estimate the logical

topology of a network. Our methodology will show how using a small number of active probes, we can estimate the number ofphysical routers shared between two paths in the Internet.

2.1.3 Passive Network Embedding

Previous network embedding methods have considered the different problem of latency estimation between nodes. In [35, 36, 37, 38, 39], methods are proposed in which a set of M landmark nodes are embedded in a low-dimensional Euclidean space, and thenM N latency measurements are made between each landmark node and allN other nodes. While past studies have identified difficulties with some of the basic assumptions of embeddings (e.g., [43]), more recent work has shown them to perform quite well in practice [44]. Embeddings have also been proposed as a mechanism for topological inference [38, 39]. These approaches are based on hop-count measurements obtained using an exhaustive number of active probes between landmarks and all other nodes. In contrast, our proposed approach relies primarily on incomplete passive measurements between landmarks and our target end hosts and additionally a negligible number of active probes, resulting in a significantly lighter weight approach to the problem. Our emphasis on passively collected data avoids the problems of using a large number of active measurements, which includes the difficulty in generating real-time Internet topologies from these measurements and the prevalence of blocking standard active probes by ISPs. The total number of active probes needed for our method will be shown to grow quadratically in the embedding dimension, making the method almost completely dependent on passive measurements. Our embedding methodology, unlike the prior work in [35, 38, 39], is designed to embed IP sources given very incomplete network measurements. In fact, due to their reliance on complete measurements, the previous work in the area of network embedding is incomparable to our methodology.

(39)

2.2

Inferring Unseen Structure of the Network Core

In Chapter 4 we examine the characteristics of the Internet core given a limited number of mea-surements. Our consideration of “Internet core” is informed by prior topology mapping studies including [13, 14, 45]. While these papers provide various definitions of “core”, we believe that a strict definition is of less importance and ultimately arbitrary. The goal of our study is not to find specific boundaries, but to find as much of the central component of the Internet as possible. To that end, our definition is similar to what is given in [46] — roughly that the core is bounded by routers that are greater than one IP hop beyond end hosts or border routers of stub autonomous systems.

Our work is also informed by our passive measurements work in Chapter 3. In those studies, we propose methods for establishing Internet maps based on passive observations of hop counts in packets. While the idea of using inference methods to estimate incomplete hop counts is similar, the work in Chapter 4 differs from theirs in objective (unseen core inference), data (the use of active probes), and methods (unseen router estimation, matrix completion, unseen link estimation). The work in our network embedding work in Chapter 3 examines the problem of estimating pairwise hop counts using incomplete measurements to a set of landmarks. In contrast, the work in this chapter will demonstrate a methodology for estimating pairwise hops using only massively incomplete pairwise measurements between the objects.

The first component in our unseen core estimation study focuses on determining the number of previously unseen core routers would be found given an increase in the probing of the network. This problem is motivated by a standard problem in statistics, the “unseen species problem”, where given an incomplete observation, we try to estimate how much was missed. Classic results in [47] estimated the number of unseen species of moths in an environment given a limited observation, and the work in [48] estimates the total number of words Shakespeare knew given his collective works. Recently, methodologies in both [49] and [50] have examined the problem of unseen species estimation in the context of networking. Both of these methodologies are directed towards finding the total number of routers/links in a network given limited observations. While estimating the

(40)

total number of unseen routers is an interesting problem, validating the results is an impossible task without the entire network available (infeasible when considering the Internet). Our work focuses on the problem of estimating how many additional routers would be found given a fractional increase in the probing infrastructure. This would be of interest to anyone trying to determine whether or not to continue probing a network to discover additional nodes. To the best of our knowledge, our work is the first attempt to estimate the increased coverage of the network found given a feasible number of additional measurements.

The next component in our unseen core estimation study shows how the unobserved link lengths between core routers can be estimated. The work uses a set of Internet probes to construct a “hop count matrix” with each element containing the number of routers between two points in the network. Due to the limited number of probes sent throughout the network, this matrix is very sparsely populated. Recent work in [51] has shown that matrices of size N _×N and of rank r

can be exactly reconstructed with only k known elements, wherek_∼O N1.2rlogN. Due to the very large size of the matrix, our work uses an efficient matrix factorization method from [52]. We use these prior results in an attempt to infer the unobserved path lengths between arbitrary core routers. By expanding upon these techniques, we develop a novel methodology for estimating unseen link locations in the network, an issue previously unexplored in Internet literature.

Our final component of the unseen core study is a targeted probing methodology that directs the user towards areas of the network that contain the most uncertainty given the current set of measurements. Our targeted probing methodology looks at a similar goal as prior work on the Dou-bleTree algorithm [53], an intelligent probing mechanism devised for the purpose of sampling end hosts in the Internet. The DoubleTree algorithm uses the specific tree topology characteristics and specially crafted probes to limit the number of probes needed to discover the topology. In contrast to this prior work, our methodology focuses on the reduction of the number of source-destination pairs used to probe the network using standard off-the-shelf techniques (e.g., traceroute_{), not}

on the crafting of special probes to minimize measurement load. We offer our targeted probing techniques as validation that the unseen core techniques of inferring unseen core routers and unseen core links are correctly revealing areas of particular uncertainty in the network, where increased

(41)

probing would result in greater understanding of the current topology of the Internet.

2.3

Toward the Practical Use of Network Tomography

The work in Chapter 5 introduces a new tomographic methodology for resolving the tree-based logical routing topology in a network. The initial work most directly related to the research in this chapter are the hierarchical clustering methodologies explored in [5, 6, 7, 34]. The main limitation to these methodologies are the requirement of acquiring the entire covariance matrix (e.g.,

O N2_{measurements given} _N _{number of end hosts in the topology). The hierarchical clustering}

methodology will be considered the worst case probing bounds, as it performs an exhaustive probing on the set of end hosts in the network. This is due to the decoupling of topology measurements and topology inference, where no information from prior measurements is used to inform new measurements and topology inference is performed completely separate from the measurement process.

A more efficient probing methodology is the Sequential Topology Inference algorithm from [54]. This work sequentially builds the logical tree structure and leverages the current estimated logical tree structure to determine where the next probe pair measurements should be performed. This work couples topology inference and measurement into one process by exploiting the tree structure of the topology. For a balanced ℓ-ary tree (a balanced tree where each non-leaf node has exactly

ℓ children), this reduces the number of probes needed from O N2 for hierarchical clustering, to

O(N ℓlog_ℓ(N)) for the Sequential Topology Inference algorithm.

Our ordering-based method will show how improvements to this performance can be obtained by exploiting the structure of not just the tree topology, but the structure of the topology measure-ments. We will show how our methodology can further reduce the number of probes by roughly a factor of 2 compared to this current state-of-the-art. This improvement is the result of considering the “ordering” of the end hosts considered, previous referred to as a topological sort [55]. The idea of a topological sort has been explored previously in sensor network literature in [56], where a topological sort of the nodes in a sensor network provides efficient routes through the network

(42)

with lower power consumption. Due to the focus on wire-line networks in this work, we are not able to chose the routing. Instead we will use a modified version of topological sorting to efficiently reconstruct the logical routing from Internet measurements.

2.4

IP Geolocation using Population Data

Considerable prior work has been done on the subject of IP geolocation [3, 4, 9, 41]. While we are informed by this work and our motivation for highly accurate estimates is the same, the geolocation methodologies described in Chapter 7 takes several steps to improve estimation accuracy versus prior algorithms. Unlike the methodologies of [3, 4], no traceroute _{probes are necessary in the}

either geolocation methodology. This avoids the problems of interface disambiguation [11, 21] and dependency on unreliable unDNS naming conventions that are sometimes used for geolocation [12].

2.4.1 NBgeo Related Work

Recent work in the machine learning literature has shown how complicated classification problems with many degrees-of-freedom can be broken down into several lower-dimensional problems using a technique called Naive Bayes [57]. Empirical work in [58] and [59] has shown considerable improve-ment on classification using Naive Bayes even against more complicated classification techniques. We will exploit Naive Bayes and standard techniques in nonparametric statistics literature [60] to develop our novel NBgeo geolocation algorithm. The empirical results in [61] showing the super-linear relationship between the population count of a geographic area and the number of routers located in that area will help inform our Naive Bayes methodology. This property will be exploited using population density and geographic data (size, adjacency) of each county acquired using the publicly available databases from the U.S. Census website [18].

2.4.2 PinPoint Related Work

While our Naive Bayes methodology is a first step in exploiting population information and a known partitioning of the geography, this work is expanded upon in the PinPoint algorithm to

(43)

account for situations where either no population data is known or a natural partitioning the geography is unavailable. Our PinPoint algorithm also avoids the need for latency measurements from a shared infrastructure, common to the methodologies of [9, 41]. Latency measurements from shared measurement infrastructure, such as Planetlab [40], have been found to be highly variable [62], which can cause bias in geolocation results. PinPoint instead relies primarily on hop count values from a set of monitors, avoiding this shared measurement infrastructure latency variability problem. Hop counts can be easily established by examining TTL values in IP packets using the method described in [63].

The PinPoint methodology is also informed by prior work on low-dimensional embedding of observed pairwise distances. Commonly referred to as Network Coordinate Systems, low-dimensional embedding problems in networking literature have been well studied over the past several years [37, 64, 65, 66, 67]. The goal of these studies is to establish a method for accurately estimating latencies between arbitrary hosts in the Internet. A common problem in previous net-work coordinate algorithms are triangle inequality violations caused by inaccurate long latencies under consideration [68]. These violations are due to the underlying networkmanifold structure not returning direct line-of-slight measurements for every observation. To limit or avoid this problem altogether, PinPoint considers only short distance (low latency and small hop count) measurements in the embedding algorithm. We hypothesize that these measurements are more likely to be highly correlated to the line-of-sight in the network and therefore avoid triangle inequality violations.

Finally, the Network Time Protocol infrastructure plays an intrinsic role in our PinPoint al-gorithm study. NTP itself was developed by Mills to enable hosts to tightly synchronize their clocks [69]. The protocol specifies that synchronization will be facilitated by a widely deployed hierarchical infrastructure of time servers. At the top of the hierarchy are stratum 0/1 servers servers, which use either GPS or atomic clocks as their source. A direct consequence of GPS in NTP is a large set of servers distributed throughout the world with precisely known locations, and with the capability to respond to measurement requests. To the best of our knowledge, this is the first work using NTP for IP geolocation. However, PinPoint does not rely exclusively on NTP, and in future empirical studies, we expect to add other nodes to our landmark database including DNS

(44)

servers that report their locations.

2.5

Model-based Anomaly Detection

The focus of Chapter 8 is on anomaly detection in network time-series data. Initial anomaly detection work considered only single time-series data in isolation (e.g., a single link in a network). This work uses some transformation of the network data to distinguish between standard operating environment and “residual” anomaly energy. These methodologies included analysis using wavelets [70, 71], Exponentially Weighted Moving Average filters [72, 73], and Fourier filtering [71].

Initial network-wide anomaly detection work focused on the application of Principle Component Analysis (PCA) [19, 74, 75] to a collection of network time-series data. The methodology, originally described in [74], decomposes a traffic matrix into a set of vector components that capture the variance across all links or flows of the network. The components that resolve the highest variance across all links (e.g.,the most standard components) are considered to represent standard operating characteristics of the network observed in the link data matrix, the”modeled traffic”. Meanwhile, the less dominant components represent residual traffic that is abnormal to the links in general, the”residual traffic”. The amount of traffic energy in this residual component determines whether or not an anomaly has occurred in the observed traffic on each link. The limitations of this PCA approach are well documented in [76]. In addition to having high sensitivity to tuning parameters, large anomalies in the network can corrupt the ”modeled traffic” components and therefore cause obvious events to be ignored by the methodology. In addition, detected anomalies found by PCA can not be localized to the specific anomalous link or router. Finally, it can lead tomasking, where one anomaly hides another.

The authors of the Distributed Spatial Anomaly Detection technique described in [20] recognize that one of the main limitations of the PCA approach was the necessity of communicating all flow information back to some centralized computation point. Using non-parametric statistics and False Discovery Rate techniques (FDR) [77], each router in the network generates just a small test statistic that is communicated for anomaly detection. The use of more sophisticated multiple

(45)

hypothesis detection techniques, like FDR, allows for better statistical detection rate than simple thresholding. One of the biggest limitations of this approach is the complete decoupling of the measurements in the time domain. Therefore, any temporal correlation between network anomaly events (the measurements at time t helping inform the events from measurements at t+ 1) are ignored. In addition, the measurements considered are with respect to traffic volume only, with no discussion on how other network information (bytes, unique IP address, entropy measurements) could be intelligently fused into the framework. Finally, the detected anomalies are not necessarily points of interest to a network administrator or anything that might represent the known structure of anomalies in networks, they are simply events of traffic volume that are unlike anything found in the training set. This could be significantly biased by limited training data, introducing the possibility for a large number of false alarms reported. This situation may occur when events are unlike the training set observation and yet uninteresting from a network administration prospective. Other distributed approaches to anomaly detection exist [78, 79], but our focus is not distribution, but rather we aim to carefully treat the false alarm problem.

Our anomaly detection methodology will explo