of the sizes of objects of a given color

Full text

(1)Learn More, Sample Less: Control of Volume and Variance in Network Measurement Nick Duffield Carsten Lund Mikkel Thorup AT&T Labs–Research 180 Park Avenue, Florham Park, NJ 07932, USA E-mail: duffield,lund,mthorup @research.att.com. Abstract Consider a set of objects

(2)

(3)

(4) , each endowed with a size and a color . We wish to estimate the sums !#" %$ &('*)+& , of the sizes of objects of a given color , from a sampled subset of objects. How should the sampling distribution be chosen in order to jointly control both the variance - of ./ and the number of samples taken? of the estimators This problem is motivated from network measurement, in which the are the byte sizes of traffic flows reported by routers, and the are the common properties of the packet of the flow, e.g., source and destination IP address. In this paper we propose a sampling scheme that optimally controls the volume of the measurements, and the variance of unbiased usage estimates 0 - / , while retaining usage detail down to the finest level of granularity in the colors. We provide algorithms for dynamic control of sample volumes and evaluate them on flow data gathered from a commercial IP network. The algorithms are simple to implement and robust to variation in network conditions. The work reported here has been applied in the measurement infrastructure of the commercial IP network. To not have employed sampling would have entailed an order of magnitude greater capital expenditure to accommodate the measurement traffic and its processing. Keywords: Internet Measurement, Flows, Sampling, Estimation, Variance Reduction. 1 Introduction Consider the following estimation problem. A set of objects 1325476896;:;:;:<6= , each endowed with a size >+?A@CB and a color D ? . We wish to estimate the sums E#F(DHGI2KJ ?L MON*PQM >+? of the sizes of objects of a given color D , from a sampled subset of objects. How should the sampling distribution be chosen in order to jointly control both the variance of the estimators S RE F(DHG of E#F(DHG and the number of samples taken? This is an abstract version of a practical problem that arises in estimating usage of network resources due to different users and applications. The usage is determined from network measurements, and sampling is employed to control the resources consumed by the measurements themselves. We start this paper by explaining the motivation behind the stated sampling problem, and showing how the constraints imposed by the intended use of the measurements lead us to employ flow sampling.. 1.1 Motivation The need for detailed network usage data. The collection of network usage data is essential for the engineering and management of communications networks. Until recently, the usage data provided by 1.

(5) network elements (e.g. routers) has been coarse-grained, typically comprising aggregate byte and packet counts in each direction at a given interface, aggregated over time windows of a few minutes. However, these data are no longer sufficient to engineer and manage networks that are moving beyond the undifferentiated service model of the best-effort Internet. Network operators need more finely differentiated information on the use of their network. Examples of such information include (i) the relative volumes of traffic that use different protocols or applications; (ii) traffic matrices, i.e., the volumes of traffic originating from and/or destined to given ranges of Internet Protocol (IP) addresses or Autonomous Systems (AS’s); (iii) the packet and byte volumes and durations of user sessions, and of the individual flows of which they comprise. Such information can be used to support network management, in particular: traffic engineering, network planning, peering policy, customer acquisition, usage-based pricing, and network security; some applications are presented in details in [2, 11, 12]. An important application of traffic matrix estimation is to efficiently redirect traffic from overloaded links. Using this to tune OSPF/IS-IS routing one can typically accommodate 50% more demand; see [15]. Satisfying the data needs of these applications requires gathering usage data differentiated by IP header fields (e.g. source and destination IP address and Type of Service), transport protocol header fields (e.g. source and destination TCP/UDP ports), router information specific to a packet (e.g. input/output interfaces used by a packet), information derived from these and routing state (e.g. source and destination AS numbers), or combinations of these. Collecting the packet headers themselves as raw data is infeasible due to volume: we expect that a single direction of an OC48 link could produce as much as 100GB of packet headers per hour, this estimate based on statistics collected for the experiments reported later in this paper. Flow level statistics. In this paper we focus on a measurement paradigm that is widely deployed in the current Internet and that offers some reduction in the volume of gathered data. This is the collection—by routers or dedicated traffic monitors–of IP flow statistics, which are then exported to a remote collection and analysis system. (Some alternative paradigms are reviewed in Section 8). Most generally, an IP flow is a set of packets, that are observed in the network within some time period, and that share some common property. Particularly interesting for us are “raw” flows: a set of packets observed at a given network element, whose common property is the set of values of those IP header fields that are invariant along a packet’s path. (For example, IP addresses are included, Time To Live is not). The common property may include joins with state information at the observation point, e.g., next hop IP address as determined by destination IP address, and routing policy. More generally, the common property may aggregate ranges of IP header fields (e.g. over source subnet rather than individual source IP address) producing non-raw flows. It is handy to call the the set or sets of values, that form the common property shared by the packets of a flow, a “color”. A color that defines a raw flow will be called “primary”. The granularity at which a router differentiates colors is configurable. To satisfy the requirements of the different applications mentioned above, differentiation of colors should be as fine as possible. In exploratory studies the colors of interest may be determined only after measurement have taken place. Thus, measurement should distinguish usage by primary color, by collecting raw flows. A router keeps statistics—including total bytes and packets—for active flows passing through it. When a packet arrives at the router, the router determines if a flow is active for the packet’s color. If so, it updates statistics for that color, incrementing packet and byte counters. If not, is instantiates a new set of counters for the packet’s color. A router will designate a flow as terminated if any of a number of criteria are met. 2.

(6) S. S Router. Mediation. S Collector. Network Traffic. S. Measurement Traffic. Sampling Location. Figure 1: F LOW OF MEASUREMENT TRAFFIC , AND POTENTIAL SAMPLING POINTS : (left to right) packet capture at router; flow formation and export; staging at mediation station; measurement collector.. When the flow is terminated, its statistics are flushed for export, and the associated memory released for use by new flows. Termination criteria can include (i) timeout: the interpacket time within the flow will not exceed some threshold; (ii) protocol: e.g., observation a FIN packet of the Transmission Control Protocol (TCP) [24] that is used to terminate a TCP connection; (iii) memory management: the flow is terminated in order to release memory for new flows; (iv) aging: to prevent data staleness, flows are terminated after a given elapsed time since the arrival of the first packet of the flow. Flow definition schemes have been developed in research environments, see e.g. [1, 4], and are the subject of standardization efforts [17, 26]. Reported flow statistics typically include the properties that make up flows defining color, its start and end times, and the number of packets and bytes in the flow. Examples of flow definitions employed as part of network management and accounting systems can be found in Cisco’s NetFlow [3], Inmon’s sFlow [16], Qosient’s Argus [25], Riverstone’s LFAP [28] and XACCT’s Crane [29]. Flow statistics offer considerable compression of information over packet headers, since the flow color is specified once for a given flow. For example the previous OC48 example, the volume of flow statistics is roughly 3GB per hour, i.e., up to a roughly 30-fold volume decrease compared with packet headers. Measurement collection architecture. Figure 1 illustrates an architecture for measurement collection that has been realized in a commercial network. Flow level summaries are constructed at a router from the packets that traverse it. The summaries are transmitted to the measurement collector, possibly through one or more mediation stations. These may be employed for reasons of reliability: measurement export commonly uses the User Datagram Protocol (UDP), which has no facility to detect and resend measurements lost in transit. However, export to a nearby mediation station over a Local Area Network (LAN) may in practice be very reliable, as compared with export over the Internet to a distant collection point. The mediation station may then reliably transmit measurements to the ultimate collection point, the requisite memory and processing resources being cheaper in workstations than in the routers. The mediation station can also perform aggregation to support distributed network management applications. Data volumes and the need for sampling. The volumes of flow summaries generated by a large network with many interfaces is potentially massive, placing large demands on memory resources at routers, storage. 3.

(7) and computation resources at the collector, and transmission bandwidth between them. This motivates the use of sampling to reduce data volume, while at the same time maintaining a representative view of raw flow data. Flow summaries can be sampled by any system on the path from generation to collection. This may be necessary for scaling reasons. The architecture forms a tree, with multiple routers sending measurements to each of several mediation stations. Progressive resampling can be applied at each stage to avoid “implosion” as data progresses up the tree. At the collector or mediation station, a typical analysis application executes a query that performs a custom aggregation (i.e. one not previously performed) over all flows collected in a given time period. Here, the role of sampling is to reduce the execution time of the query. For final storage at the collector, sampling can be used to permanently reduce the volume of historical flow data (non-sampled flows would be discarded) while maintaining the ability to execute custom queries at the finest differentiation of colors. Effectiveness of sampling methods. An elementary sampling method is to uniformly select 4 in T of the flows, either independently (i.e. each object is selected independently with probability 4UVT ) or deterministically (objects T.687TW6;:;:;: are selected and all others are discarded). The statistical properties of any proposed sampling scheme must be evaluated. Do inferences drawn from the samples reflect the properties of the raw data stream? What is the impact of sampling on the variance of usage estimates? A striking feature of flow statistics is that the distributions of the number of packet and bytes in flows are heavy-tailed [14]. This property contributes to the roughly 30-fold average compression of data volumes when passing from packet headers to flows that was remarked upon above. Uniform sampling from heavy tailed distributions is particularly problematic, since the inclusion or exclusion of a small number of data points can lead to large changes in usage estimates; these are subject to high variance due to the sampling procedure itself. Note that a sampling strategy that samples all big flows and a fair fraction of the smaller flows could reduce the estimator variance. This raises the question: which is the best such sampling strategy? This is the problem we study in this paper. Knowledge of sampling parameters. Sampling parameters used for flow selection must be known when the data is analyzed, in order that it can be interpreted correctly. For example, to estimate the byte rate in a raw packet stream from samples gathered through 4 in T sampling, we need to multiply the byte rate represented in the sampled stream by T . Since raw traffic volumes vary by location and time, we expect that sampling rates will have to vary in order to limit resource usage due to measurement to acceptable levels. For example, an unexpected surge in traffic may require dynamic lowering of sampling rates. Generally, then, sampling rates will not be global variables independent of the data. We believe sampling parameters must be bound to the data (e.g. to individual measurements, or inserted into the stream of measurements). Each entity that samples or resamples the flows must bind its sampling parameters to the data appropriately. We eschew keeping sampling parameters in a separate database to be joined with the data for analysis. This is not robust against bad synchronization between the two data sets, or undocumented manual changes of sampling parameters.. 1.2 Contribution Sampling requirements and the basic problem. The contribution of this paper is to study what makes a good flow sampling strategy. The goals are fourfold: (i) to constrain samples to within a given volume; (ii) 4.

(8) to minimize the variance of usage estimates arising from the sampling itself; (iii) to bind the sampling parameters to the data in order that usage can be estimated transparently; and (iv) with progressive resampling, the composite sampling procedure should enjoy properties (i)–(iii). To formalize the problem, recall the set of flows labeled 1X2Y476896;:;:;:<6= equipped with a size >? and color D ? . We wish to sample them in such a way that we can estimate the total size of flows of color D , i.e., the sums EZF(DHG[2 J ?L MON*PQM >\? , knowing only the sampled sizes and their sampling probabilities, without the need to retain other information on the original set of flow sizes. We propose to continuously stratify the sampling scheme so the probability that an object is selected depends on its size > . This attaches more weight to larger objects whose omission could skew the total size estimate, and so reduce the impact of heavy tails on variance. We must renormalize the sampled sizes >]? in order that their total over any color D becomes an unbiased estimator of E#F(DHG ; this will be achieved by coordinating the renormalization with the sampling probabilities. Finally, we require a mechanism to tune the sampling probabilities in order that the volume of sampled objects can be controlled to a desired level, even in the presence of temporal or spatial inhomogeneity in the offered load of flows. Outline. Section 2 develops the basic theory of the sampling method. Section 2.1 establishes a general relationship between the sampling probabilities (as a function of objects size) and the renormalization (applied to the sizes) in order that the estimates of EZF(DHG be unbiased. We then turn to examine the relationship between sample volume and estimator variance due to sampling. In Section 2.2 we point out that uniform sampling offers no further ability to control the variance once the average sampling volume is fixed. Instead, in Section 2.3, we show how the sampling probabilities should be chosen in order to minimize a cost function that takes the form of a linear combination of sample volume and estimator variance. In Section 2.4 we show that optimizing the cost function for the total traffic stream, also optimizes for each color individually. We follow up with a number of subsidiary results. In Section 2.5 we extend the formalism to multidimensional size attributes, e.g. to use both packet and byte sizes of flows. In Section 2.6 we describe unbiased estimators of the variance of the sample volume and attribute total. In Section 2.7 we show that the set of sampling operations considered is closed under certain compositions; consequently, the total effect of a set of sampling operations applied successively (e.g. at a router, then at mediation station) is statistically equivalent to a single sampling operation with appropriate parameters. In Section 3 we demonstrate how the method achieves our aims by applying it to datasets of NetFlow statistics gathered on the backbone of a major service provider. In Section 3.1 we show that the variance of usage estimation from flow summaries is greatly reduced by using size dependent sampling as opposed to uniform sampling. In Section 3.2 we find that even when packet sampling is mandated at a router (e.g. by software speed constraints), any further desired reduction in measurement volume is best achieved (i.e. with noticeably smaller usage estimator variance) by forming flows and applying size dependent sampling, as opposed to dispensing with flow formation altogether and just performing further packet sampling. In Section 3.3 we describe an efficient algorithm for size dependent sampling used in the experiments that avoids the need for explicit pseudorandom number generators. In Section 4 we extend the method to the dynamic control of sample volumes. The cost function described above contains a positive parameter ^ that encodes the relative importance we attach to constraining sample volumes as opposed to reducing sample variance. We present an iterative scheme by which ^ can be adjusted in order that the expected sample volume meets a given constraint. This is useful because even 5.

(9) a static flow length distribution may not be well characterized in advance. In this case it would be difficult to determine in advance the value of ^ needed in order to meet a given sampling volume constraint. We analyze the rate of convergence of the iteration. Substituting mean sample volume with the actual number of samples taken in the iterative scheme yields an algorithm to dynamically control sample volumes. This enables control of sample volumes under variation in the offered load of flows, e.g., during a sharp rise in the number of short traffic flows commonly that can occur during denial of service attacks [20]. Several variants of the control are discussed in Section 5. In Section 6 we illustrate the effectiveness of such controls with the NetFlow traces that exhibit large transient phenomena. An efficient randomized algorithm for a root finding problem that arises in our work in described in Section 7, along with a Large Deviation-based analysis of its performance under data resampling. The feasibility of a number of alternative measurement schemes and related work are discussed in Section 8. We conclude in Section 9 by outlining a current application in a commercial network of the work described here.. 2 Sampling Theory 2.1 Sampling and renormalization The key elements of our algorithm are size-dependent sampling, and renormalization of the samples. These are described by the following two functions. A sampling function is a function _C`badc e fg6H4Vh . The > > interpretation of the sampling function is that attribute is to be sampled with probability _iF G . j will denote the set of sampling functions. A renormalization function k is a function amlnc5a . A sampled size > is then renormalized by replacing it with kF > G . Consider a set of = sizes o >!?qpH?*Prsututut s v prior to sampling. Initially we can think of this set either as the sizes of all objects, or as the set sizes of objects of a particular color. The total size of the = objects is. x. v. Ew2 ?*Pr \> ? :. (1). Suppose first that we wish to obtain an estimate of E from a subset of sampled values, and, generally, without needing to know the original number = of sizes. Consider an estimate of E comprising a sum of sampled sizes that are then renormalized:. v. x. w RE 2 ?yPr{z ? k|F >+? G. (2). where the o ?pH?*Prsututut s v are independent random indicator variables, ? taking the value 4 with probability z z _iF >\? G and f with probability 4~}0_iF >!? G . Denote by RE the expected value of RE over the distribution of the random variables o ?qp . In what z follows we shall treat the sizes o >{?p as a given deterministic set: randomness resides only in the o ?p . RE is z said to be an unbiased estimator of E if RE 2E . Since ? 2#_ ? , RE is unbiased if. v. z. v. x x E2 *? Pr \> ? 2 *? Pr _iF >\? Gk|F >+? G2 RE 6. (3).

(10) This happens for all collections. o >Q?qp if and only if k|F > G2 > U_iF > G\6. for all >. :. (4). In the rest of this paper we will assume that (4) holds.. 2.2 Bounding sample volumes. v. The sample volume, i.e, the number of samples, is RT 2 J ?*Pr z ? . Suppose now that we wish to impose a set known a limit on the expected sample volume RT . Consider first the case of drawing samples v from \ > ? y ? P r _iF G . For size = . The expected number of samples is thus less than some target if 5 RT 2 J this to be true for all collections o >{?p requires that _iF >!? G2nU<= for all = : the sampling probability _]F > G is independent of > . The choice of a constant sampling function has an interesting and potentially troubling consequence. This is that is there is no further latitude in the choice of the sampling function that could be used to attain some other desirable statistical property of RE , such as keeping its variance small. Indeed, our motivating example comes from cases where the the o >{?p have a heavy tailed distribution, and hence the inclusion or exclusion of a small number of sample points can have a great influence on RE . One approach to this problem would be to explicitly take into account the distribution of > when choosing the sampling function. This would entail choosing a non-constant _ such that the report volume constraint would be satisfied for a class of size distributions, although not for all. This approach has the potential disadvantage that if the size distribution does not fall into the given class, the constraint will no longer be satisfied. Furthermore, it is not clear in examples that the byte or packet distributions can be characterized in a universal fashion, independently in changes in network technologies and applications.. 2.3 Static control of sample volume and variance Instead, we take an approach which allows us to jointly control the volume of samples RT and the variance of the estimator RE without assumptions on the distribution of the sizes o >?qp . We form a cost function that embodies our aim of controlling the variance of RE and the expected number of samples RT . For ^ @ a we define for each _ @ j [ F_QG27 (5) RE Z^ RT :. ^ is a parameter that expresses the relative importance attached to minimizing RT versus minimizing ~RE . The variance of RE is v v v v x x x x > ? ? \ > ? \ > ? ? + > ? \ > ? + > ? 7 RE 27 ?*Pr{z k|F G2 ?*Pr k F G7 z 2 ?yPr k F G F 4}0_iF GG_]F G2 ?*Pr F 4}0_iF >\? GGU_]F >+? G : Thus. x. (6). v. [ F_QG2 *? Pr > ? F 4b}A_]F >+? GGU_iF >+? G#^q_iF >+? G Define _ @ j by _ F > G2n ¢¡*£!o476 > U¤^ p . Theorem 1. ~¥F_\G[[7F_{G for all _ @ j and o >\?qp , with equality only if _A2¦_Q . 7. (7).

(11) Proof: §¨c > F 4!}¢§¥GU¤§Q©^ § is strictly convex on on F(fg6H4Vh by _!7F(§¥G , and the result follows.. F(fg6 ª«G and minimized at §2 > U¤^ . Thus it is minimized. We can interpret the form of _{ as follows. High values of the size > , those greater than the threshold ^ , are sampled with probability 4 , whereas lower values are sampled with progressively smaller probability. ¬ Thus the contributions of higher values of > to the estimator R have greater reliability than smaller values. This is desirable since uncertainty about the higher values can adversely impact the variance of RE , especially for heavy tailed distributions of > . We note for the sampling function _i , the renormalization function takes the simple form k<¥F > G2 ¢¤®!o > 6^ p . We write ¢ RE to denote the specific random variable arising from the choice of ^ .. 2.4 Sampling and color partitions Recall from the introduction, that in our typical application, we think of the packets as being colored, and that our aim is to estimate the total sizes of the packets in each color class D . If D ? is the color of packet 1 , E#F(DHG02 J MONPQM >\? is the total size of packets with color D , and our unbiased estimator is then # RE F(DHG¯2 J MONPQM z ? k|F >+? G , that is, # RE F(DHG is obtained simply by summing the sampled normalized sizes of color ? O M N Q P M D . Let Z of sampled packets with color D . By linearity of expectation, RT F(DHG2 J z be the number ! > ? M gT°2 J S RE F(DHG are independent for each D , and RT F(D;G . Also, since each is picked independently, the # M 7 # hence 7<E2 J RE F(DHG . Thus,. [¥F_QG27¯± RE #^ ²w RT 2. x. RE F(DHGS^ ²S RT F(DHG p M o;7~Z. (8). That is, our objective function b¥F_QG minimizes itself locally over each color class. One could easily imagine scenarios where one wanted different objectives for different colors. However, in our application, the sampling device is not assumed to distinguish colors, and in addition, we imagine that our samples may latter be analyzed with respect to many different colorings. Theorem 1 show that the strength of our sampling strategy is that it is the unique optimal strategy with respect to (8), no matter the coloring. Indeed, finer control of sampling by color, within a given volume constraint, can only increase estimator variance. Suppose that we wish to control individually the sample volume M arising from each color D while achieving the same total sample volume 2YJ M M over all colors. This would be achieved by M applying a differing threshold ^ to the sampling packets from each color D . The price of imposing finer grained volume control is to increase the aggregate variance of the S RE F(DHG . The following is a direct corollary ´ _\ is suboptimal for ~¥F_QG . of (8) Theorem 1 on noting that the _{³X2 Corollary 1 Let oHµ M `!D2¶476;:;:;:6· p be a partition of o476;:;:;:¤6= p , and for each member D of the partition º µ M be a target sample volume, and let ^gM ¼ solve ^2KJ ?½<¾ ³ ¢¡*£\o > ? 6^ p UV M . Let ^,¼ solve let M¹¸» v ^¿2 J ?*Pr ¢¡*£!o > ? 6^ p U J M M . Let RE F(D;G02 J ?/½¤¾ ³ z ? k F > ? G with the z ? distributed according to _ . Then x x ¢RE À7F(DHG[ ¢RE ³ À F(DHG : (9). M. Proof: Let T0 Á. M. 2ÂJ ?/½ Ã Ä z ? <k ¥F >+? G with the z ? distributed according to _Q . By Theorem 1, x x F/7¤E Á À F(^ ¼ G RT Á À G[ F/ <E Á Ä À nF(^ ¼ G RT Á Ä À G : Á Á 8. (10).

(12) But. J Á. A RT Á À 2 ÅRT À2. 2 J Á. Á. 2 J. gTA Á Ä À and hence the result follows. Á. 2.5 Multidimensional sizes In practical cases, the sizes > can be multidimensional. For example, a flow record can contain both byte and packet counts for the flow. Estimation of multiple components of the size may be required. One approach would be to apply the foregoing analysis to each component independently. However, this would require a separate sampling decision for each size, leading to the creation of multiple sampled streams. This is undesirable, both for the extra computation required, and the increased volume of samples resulting. Another approach is to base sampling on one component >? (e.g. flow byte size) and then use the factor 4U_ F > ? G to renormalize other components Æ ? . Although this results in an unbiased estimator for J ? Æ ? , it does not minimize the corresponding cost function for the components Æ ? . Instead, we outline a simple extension of the previous section that creates only single samples per size. Consider a multidimensional size Ç¦2ÈF > F 4G 6;:;:;:<6 > FÉWGG @ a¯Ê presented by each potential sample. Analogous to the one-dimensional case we have now the sampling function _Ë`9aÊCcÌe fg6H4Vh and renormalization function ÍZ`{aÊÎcÏaiÊ . Given a set of sizes oÇ ?p , the binary random variable ? takes the value 4 with z probability _iF(Ç ? G . Similar to before, one can show that ÐÑÒ2 J ? ? ÍF(Ç ? G is an unbiased estimator of Ñ 2 J ? Ç ? for all z size sets oÇ ?p if and only if ÍF(ÇÓG2±ÇU_iF(Ç[G for all Ç . Let Ô Á @ a Ê denote the vector with components F(^+F 4G Á 6;:;:;:6^+FÉWG Á G . Suppose now we wish to minimize a cost function of the form.. r. r. v. x. Ô F_{G27qF(ÑÖÕHÔi× G ?*Pr _]F(Ç ? G. (11). Similarly to Theorem 1,r one finds that Ô F_ Ô G° Ô F_{G for all sampling function _ for the function r _ Ô F(ÇÓG2 ¢¡£!o476ÇØÕHÔ × p . r Variants of this approach are possible. Consider replacing the variance of the single variable 7<Ñ¶ÕÔ × with the sums of variances of the form J FEZFy·,GU¤^+Fy·9GG . The corresponding cost function is. Á. Ô F_QG2. x. v. Pr. x. v. <EZFy·,G ?*Pr _iF(Ç ? G : ^ Fy·9G. (12). Á 2 ¢¡£!o476 Ù Ç ÕÔ × p . The sampling function _ that minimizes Ô F_QG is _ Ô 2.6 Estimating variance from samples Although Theorem 1 allows us to minimize a linear combination of sample volume and variance, it does not provide a means of estimating the variance of RE direct from the sample since the variance is expressed as a sum over all sizes >!? , not just those that have been sampled. However, we can apply the same ideas as were Ú|Û used above to find an unbiased estimate of E , but now finding an unbiased estimate R of 7¤E . v ÚÛ Ú|Û The restriction that R be computed only from the samples requires it to be of the form R 2 J ?*Pr z ?/Ü F > Ú Û Ü \ > ? p 2 RE for all o . This requirement is equivfor some function . For unbiasedness that R v v > we require Ü + > ? \ > ? \ > ? + > ? ? * ? P r * ? P r _]F G F G2 J alent to J F 4}0_iF GGU_]F G for all o >\?qp . Hence. Ü F> G 2Ý. >. _iF > G Þ. . F 4}0_iF > GG 6. x Ú Û R 2. and 9. > ? ? Ý F 4}0_]F >+? GG : z _]F >+? G¤Þ. (13). ?G.

(13) The specific forms of. Ú ] Ú Ûiß Û ß R and R arising from the choice _A2#_Q reduce to Ú Û]ß x ? ^+F(^X} >+? G l 6 Ú R Ûiß 2 x >+? F(^} \> ? G l 6 2 ¿ R ? z ?. (14). Ú Û. l. where Æ 2¶ à¤®\ofg6Æ p . It is interesting to note that there is no contribution to R for >+?á ^ . This is because such > ? are selected with probability 4 , and hence their contribution to RE has no variance. Indeed, Ú|Û the summand in the expression for iR is zero for >.â ^ , and maximized when > 2^,U78 . Ú+ã Ú+ã 2 By similar reasoning we can find and unbiased estimator R of the variance of T . Writing R Ú\ã ( ? ä + > ? + > ? ä \ > ? + > ? \ > ? \ > ? p ? ? ? F G and requiring R 2 J _iF G F G2 J _iF G F 4}Å_iF GG27qF ¹ J RT G for all o , we obtain ä F > zG2ÈF 4}_iF > GG and hence. rÚ ] Ú+ã ß x Û ß R 2 ?z ? F 4} +> ? ¤U ^,G l 2^ × R 6. r Ú Û]ß Ú\ã ß x >+? R 2 ? F U¤^,G F 4b} \> ? ¤U ^9G l 2 ^ × R : (15) Ú ã ß As with (14), terms with > ? â ^ vanish identically, and the largest contributions to R arise when > 2^,U78 . and. 2.7 Composing sampling operations The sampling and renormalization operations defined above can be composed. We envisage such composition when data is fed forward through a number of processing stages operating in different locations, or under different volume or processing constraints. Consider then a composition of É sampling procedure controlled by thresholds ^ r Õ;Õ;Õ!n^ . The threshold ^ increases at each stage of the composition, correÊ sponding to progressively finer sampling. Let Å RT å sututut Ä denote the number of samples present after passing through the · samplers in the composition, and RE å sututut Ä the corresponding unbiased estimator of E . The following result says that the expected sample volume and variance of E at stage · are equal to those that would result from a single sampler controlled by threshold ^ . Given a set æ of sizes, let ç FèæéG denote the Á set of sampled and renormalized sizes obtained using the threshold ^ , i.e.,. p çQ¥FèæéG2êo ¢¤®Qo^6 >p ` >I@ ² æ 6 (16) z ë 2 4 > where z ë are independent random variables taking the value 4 with probability _ F G and f otherwise. Theorem 2 Let f ¸ ^ r Õ;Õ;Õ]È^ . Then for each set æ of size, ç Ä FOçQ Äì å :;:;:<FOçQ å FèæmG\:;:;:%G has the same Á distribution as ç Ä FèæéG . In particular Å RT å sututut Ä 2 RT Ä and 7 ¢RE å sututut Ä 27 ¢RE Ä . Proof: Any attribute that survives as far as stage · of sampling is renormalized according to k9 å sututut s Ä F > G[`2 k Äí Õ;Õ;Õ í k å F > G2 ¢¤®!o^ 6;:;:;:H6^ r 6 >p 2 à¤®!o^ 6 >p 2k Ä F > G . It survives with probability _{ å sututut s Ä F > G that Á Á obeys the recursion _Qå sututut s Ä F > G 2 _\å sututut s Äì å F > G_\ Ä Fkå sututut s Äì å F > GG . But _\ Ä Fkå sututut s Äì å F > GG 2 ¢¡*£!o476 à¤®!o^ × r 6 >p U¤^ p . Since ¢¡*£!o476 > U¤^ × r p à¡*£!o476 ¢¤®Qo > 6^ × r p U¤^ p 2 à¡*£!o476 > U¤^ p when ^ × r Á Á Á Á Á Á Á ^ , a simple induction show that _{ å sututut s Ä F > Gé2_\ Ä F > G . Thus the renormalization and sampling probability Á for the chain is determined by the last component ^ , and the stated properties follow immediately. Á. 3 Comparison of Static Sampling Strategies In this section we perform an experimental comparison of the proposed optimal flow sampling strategies with other sampling strategies. We show that, out of all the compared methods, optimal flow sampling provides the least estimator variability for a given volume of flow measurements. 10.

(14) 1e+12. 1e+12. nosampling sample100K. 1e+10 1e+08. 1e+08. 1e+06. 1e+06. 10000. 10000. 100. 100. 1. 0. nosampling uniform33. 1e+10. 500 1000 1500 2000 2500 3000 3500. 1. 0. 500 1000 1500 2000 2500 3000 3500. Figure 2: C OMPARING OPTIMAL AND UNIFORM SAMPLING . Left: optimal, Z RE F(D;G and EZF(DHG vs. color RE î ïgF(DHG and E#F(DHG vs. color index. Note increased variability of 3RE î ïgF(DHG ad compared index. Right: uniform, 3 with Z RE F(DHG .. 3.1 Comparison of Uniform and Optimal Flow Sampling In this section we compare the experimental performance of optimal sampling and uniform sampling on a set of flow summaries. This comparison is relevant for selecting a sampling method for the selective export of flows, or at a mediation station or measurement collector, or within a database used to store flows. Our data for the experiments comprised a trace of NetFlow statistics collected during a two hour period from a number of routers in a commercial IP backbone. The total sample comprised 108,064,914 individual flow records. The mean number of bytes per flow was 10,047. We partitioned the total sample according to 3,500 colors, determined by the router interface that generated the record, and by a partition of the address space of the external IP address of the flow record (source address for incoming flows, destination address for outgoing flows). Thus the exact partition was somewhat arbitrary, but it satisfied rr the following criteria: (i) the had large variation in total bytes (from less than 100 bytes to more than 4;f bytes), (ii) the partition was not based on the byte values, and (iii) there were a reasonably large number of colors. Our implementation of the optimal sampling strategy is described in Section 3.3 below. Our aim here is to exhibit the relative accuracy of the sampling methods in estimating the byte size of the individual colors, and particularly for the “important” colors, i.e., those with large bytes sizes. We compared two sampling strategies: uniform33 which is 4 in T sampling with T52ñð ð , and sample100K which is optimal sampling with threshold ^32È4;f fg6f f f . We chose these (relative) values in order that the total number of records sampled in each case would be similar: 3.03% and 3.04% of the records respectively for the two strategies. We compared three sets of numbers: E#F(DHG , the actual number of bytes for color D ; S RE F(DHG , the number of bytes for color D as determined by optimal sampling and renormalization with sample100K; and RE î ï F(DHG , the number of bytes for color D as determined with uniform33 sampling and renormalization through multiplication by Tw2nð ð , In Figure 2 we plot nonsampled byte counts E#F(DHG and sampled byte counts ( S RE F(DHG in the left plot, 3RE î ï|F(DHG in the right plot) against color index D , with the groups sorted in decreasing order of EZF(D;G . On the vertical axis the byte counts are displayed on a log-scale. When no record from a group were sampled, we set the log 11.

(15) 1e+08. 1e+08. no sampling errorbar sample100K. 1e+07. 1e+07. 1e+06. 1e+06. 100000. 100000. 10000. 0. 20. 40. 60. 80. 100. no sampling errorbar uniform33. 120. 140. 10000. 0. Figure 3: C OMPARING OPTIMAL AND UNIFORM SAMPLING : ERROR plots from Figure 2 with error bars for sampling variability added.. 20. 40. BARS:. 60. 80. 100. 120. 140. portions of corresponding. byte count to 4 : these are the points that lie on the horizontal axis. Note that otherwise 3 RE î ï|F(DHG â 4;f fg6f f f since sample100K never yields a value less than the threshold ^ . Figure 3 is similar to Figure 2 except we have zoomed into a small number of groups in the middle. We have added error bars for the sampling error, corresponding to 2 standard deviations computed using (14). In Figure 4 we similarly compare the strategies through the relative error, i.e., FE#F(DHG} # RE F(DHGGU<EZF(DHG in the left plot, and FEZF(D;G} 3RE î ïF(DHGGU<EZF(DHG in the right plot. The vertical axes display the relative error as a percentage, chopped to lie between ò34;f fó . It is clear from the Figures that sample100K achieves a considerable reduction in estimator variance over uniform33 for groups with a fair amount of traffic, say at least 1,000,000 bytes. This is exactly the desired behavior: we get good estimates for important groups. It also shows how bad uniform sampling is in the case with heavy-tail sizes: even for group with significant number of bytes the uniform33 strategy can be very poor. The root mean square error is about 500 times greater for uniform than for optimal sampling.. 3.2 Comparison of Packet and Flow Sampling Methods In this section we investigate the relative tradeoffs of sample volume vs. estimation accuracy for sampling methods at routers, and in particular the relative effectiveness of packet and flow sampling methods. Some packet sampling may be required at the router even if flow statistics are constructed. In heavily loaded backbone routers, flow formation in software may be unable to keep up with the line rate of the interface cards. In this case, flow statistics are formed from a uniformly (i.e. 4 in T ) sampled stream of packets. Given a target accuracy for usage estimation, one question we address here is whether further control of the volume of measurement traffic is best achieved by optimal sampling of these flows, or whether flow statistics could be dispensed with altogether, and the volume controlled through packet sampling alone. As in Section 3.1 we used NetFlow summaries collected in a network backbone to drive this comparison. There were 17,898,241 flows divided into 13,110 colors according to customer-side IP address. Each flow summary details the duration ô of the flow, and the total packets õ and bytes ö that it comprises. We want to determine the average amount ÷ of measurement traffic that would arise if the packet stream represented in the flow summaries with parameter Fèõ!6ö<6ôG were subjected to a given sampling strategy. For a pure packet 12.

(16) 100. 100. sample100K. 50. 50. 0. 0. -50. -50. -100. 0. 500 1000 1500 2000 2500 3000 3500. -100. Figure 4: C OMPARING OPTIMAL AND UNIFORM SAMPLING : # RE F(DHGGU<EZF(D;G . Right: uniform, FE#F(DHG]}ø3RE î ïgF(DHGGU<EZF(D;G .. uniform33. 0. 500 1000 1500 2000 2500 3000 3500. RELATIVE ERROR .. Left: optimal,. FE#F(DHG}. sampling method, i.e. with no flow statistics formed, then ÷ is the number of sampled packets. With flow statistics (whether formed from a full or sampled packet stream) ÷ is the final number of flow statistics Ú formed. will denote the variance of byte usage estimated from the measurement traffic generated from Ú the packet stream of the original flow Fèõ!6ö<6ôG . For a given sampling method, let ÷àF(DHG and F(D;G denote the Ú sums of ÷ and respectively over all flow summaries of color D . Thus ÷àF(DHG is the total expected number of Ú measurements in that color, while F(DHG is the variance due to sampling of the estimated total color D bytes. Ú We describe ÷ and for the different sampling methods; they are summarized in Table 1. (i) Optimal Flow Sampling: optimal sampling of flows summaries formed from the full packet stream. Ú l With sampling threshold ^ , ÷2_ Fèö;G and 2nö7F(^}ö;G , the expected variance from (14). (ii) Uniform Flow Sampling: 4 in T sampling of flow summaries formed from the full packet stream. The expected number of flows is ÷C2±4UVT ; the estimated bytes take the values TËö with probability Ú 4UVT and f otherwise, hence 2nö F(T¶}¿4G . (iii) Uniform Packet Sampling: the atomic measurements are packets drawn by 4 in T sampling form the full packet stream; no flows summaries are formed. A flow of õ packets gives rise to õ|UVT atomic measurements on average. Let ö r 6;:;:;:<6öVù be the sizes of the packets in the flow. Then estimator ù variance is F(TY}Â4G|J ?*Pr ö ? â F(TY}ê4G ö U7õ . We use this lower bound on the estimator variance, equivalent to making the simplifying assumption that all packets have the same size ö U7õ . (iv) Uniform Packet Sampling c Flows: packets are sampled uniformly, and flow summaries formed from the sampled packet stream. The flow summaries are not further sampled, and hence estimator variance is the same as for uniform packet sampling. We use the same lower bound. Assume that packets are independently sampled; the probability that at least one packet of a flow of õ ù packet is sampled 4m}ÂF 4é}n4UVTØG . This provides a lower bound on the number of resulting flows. A larger number of flows may result if the separation of sampled packets exceeds the flow timeout ¬ . The mean number of sampled packets is õ|UVT . If this exceeds 4 , the worst case is that sampled 13.

(17) method optimal flow uniform flow uniform packet uniform packet c flow (upper) uniform packet c flow (lower). ÷ _! Fèö G 4UVT õ|UVT ¬ ûFèõ!6ôV6T.6 G 4~}F 4}4UVTËG ù. Ú. ö7F(^X}ö;G l ö F(Tú}«4G ö F(Tú}¿4GU7õ ö F(Tú}¿4GU7õ ö F(Tú}¿4GU7õ. Ú. Table 1: Expected number of flows ÷ and bytes estimator variance from sampling a single flow of duration ¬ ô , õ packets and ö bytes, with flow timeout . The function û is defined in (17). packets are equally spaced at the mean spacing ô TàUgFèõ}¿4G á ¬ will be one flow per packet. Thus ÷2nûFèõ!6ôVüT.6 G , where. ¬. , the flow timeout. In this case, there. ¬ 4 if TWô¯þFèõÅ}¿4G and T ¸ õ ûFèõ!6ô üTW6 G2Cý (17) |õ UVT otherwise ¬ For comparisons we assume that will be no less than the typical value ð7f¥ÿ used for an unsampled ¬ ¬ ¬ packet stream; since ûFèõ!6ôVüTW6 G is nonincreasing in , taking 2Âð7f gives an upper bound on the ¬. sample volumes.. Ú. To form a compact measure of estimator sampling variability we average the per-color standard errors F(DHGU<E#F(DHG with the color byte totals E#F(DHG as weights, yielding the weighted mean standard error:. ç¹2. Ú F(DHG J M M J E#F(DHG. (18). This attaches greater importance to a given standard error in estimating a larger byte total. Let denote the total number of packets represented in the flows, i.e., the sum of õ over all flow summaries. The effective U J M ÷àF(DHG is the ratio of the total number of packets represented in the original flow sampling period A summaries, to the number of measurements J M ÷àF(DHG . As an example, for uniform 4 in T packet sampling, the effective sampling period is T . Figure 5 displays ç as a function of the effective sampling period for the various sampling methods; the points on a given curve are generated by varying the underlying sampling rates. Optimal sampling provides the best accuracy by at least one order of magnitude over at least five orders of magnitude of the sampling period, the accuracy being more pronounced at lower sampling periods. The convergence between optimal flow sampling and packet sampling for very high sampling periods is due to the fact that only very long flows tends to be sampled in each case. Optimal sampling combines the compression advantages of forming flow records, with low estimator variance due to size dependent sampling. We conclude that even when a certain degree of packet sampling is mandated by packet processing constraints in the router, the formation and optimal sampling still allows more accurate usage estimation than further packet sampling.. 3.3 Quasi-random implementation of optimal sampling Independent pseudorandom selection of flows according to a given sampling function _]F > G can be performed using a well-known random number generator; see e.g. [18]. However, the computational costs of effective generators may prohibit their use in some applications. In this section we describe a simple implementation 14.

(18) weighted mean standard error. 10 1 0.1 0.01 0.001 0.0001. uniform flow uniform packet uniform packet->flow, upper uniform packet->flow, lower optimal flow. 1e-05 1e-06. Figure 5: C OMPARISON sampling period. 1. OF. 10. 100 1000 10000 100000 1e+06 1e+07 effective packet sampling period. S AMPLING M ETHODS : weighted mean standard error vs. effective packet. of the sampling strategy described in Section 2, that was used for all experiments reported in this paper. The implementation nearly as efficient as the 4 in T sampling strategy. The pseudo code in Figure 6 describes our implementation of a function that determines whether to sample the data. We assume that the input record has an integer field > that contains the size to be sampled over, e.g., bytes in a flow record. Furthermore the record has a real field

(19)

(20) that indicates the multiplicative renormalizing factor 4U_ F > G that will be applied to the data if sampled. The function returns 1 if the data is to be sampled, 0 if not, i.e., the function returns the indicator variable ? for a given flow 1 . z We chose to return the normalizing factor rather than perform the normalization directly, since it may be desired to apply the factor to other sizes. For example, in the context of flow sampling, we could apply the factor obtain for sampling byte count to packet counts as well. This results in an unbiased estimator of the total packet count, although it does not in general have minimal variance. A more sophisticated approach to sampling joint variable can be based on our work on multidimensional sampling described in Section 2.5. : ¸ ^ ) modulo ^ using a The function keeps track on the sum of sizes of small data record (i.e.,

(21) | static integer . If is uniformly distributed on the integers between 0 and ^X}¿4 , then if the size of the record is greater than ^ then it will always be sampled, while if the size is less than ^ then the record is sampled with probability > U¤^ . Thus under the uniform distribution assumption, the heuristic implements the optimal strategy. Theorem 3 The algorithm described in Figure 6 implements the optimal sampling strategy, under the assumption that the variable is a uniformly distributed integer between 0 and ^}¿4 . A simple condition for the assumption on to hold is that (a) the subsequence of flow sizes o >i? ` ¸ +> ? ^ p are i.i.d. random variables, and (b) the support of the distribution of the >? generates ofg6H476;:;:;:¤6^} 15.

(22) int sampleData (DataType data, int z) { static int count = 0; if (data.x > z) data.samplingFactor = 1.0; else { count += data.x; if (count < z) return 0; // do not sample this data else { data.samplingFactor = ((double)z)/size; count = count - z; } } return 1; // sample this data }. Figure 6: Quasi-random implementation of the data sampling algorithm.. 4 p under addition modulo ^ . It then follows from the theory of random walks that the distribution of converges to the uniform distribution. In fact, the same result holds under weaker dependence conditions of the >\? , e.g., that they form an ergodic Markov chain. Although the marginal sampling probabilities conform to _ F > G , sampling of different flows will not in general be independent using the algorithm of Figure 6. This is for two reasons. First, sampling, rather than being independent, more closely resembles periodic sampling at the appropriate size dependent rate. To see this, note that a stream of flows of uniform size > will be sampled roughly periodically, with average period ^,U > . Second, the driving sequence of underlying flow sizes may not be independent. However, we do not believe that this will be a significant issue for estimation. We found that correlation between the sampling indicators for successive flows fell rapidly (in lag) to near zero. Furthermore, flow summaries of a given color are interspersed with flows of other colors, further reducing correlations of the sampling indicators of flows of a given color. Finally, we note that dependence between flow sizes is irrelevant when sampling with a good pseudorandom number generator; sampling decisions are independent, or at least as independent as the sequence of numbers produced by pseudorandom generator.. 4 Dynamic Control of Mean Sample Volume In Section 2 we showed that the cost function é¥F_QG²2ñ7 ú RE ^ RT was minimized by taking _]F > G2 _\¥F > G2 ¢¡*£!o476 > U¤^ p as the sampling function. In the measurement applications that motivate us, we want to be able to directly control the number of samples taken, in order that their volume does not exceed the capacity available for processing, transmission and storage. Clearly the sample volume depends on ^ , and so the question is: how should ^ be chosen? In a dynamic context, the volume of objects presented for sampling will generally vary with time. Thus, in order to be useful, a mechanism to control the number of samples must be able to adapt to temporal variations in the rate at which objects are offered up for sampling. This is already an issue for uniform sampling: it may be necessary to adjust the sampling period T , both between devices and at different times 16.

(23) in a single device, in order to control the sampled volumes. In optimal sampling, the threshold ^ controls the sampling volume. At first sight the task appears more complex than for uniform sampling, since the optimal sampling volume depends on both the volume of offered flows, and their sizes. However, we have devised an algorithm to adapt ^ to meet a given volume constraint which requires knowledge only of the target and current sample volumes. Consider the case the target mean sample volume is less than = , the total number of objects from which to sample. Å RT W2wJ ? z ? is the total number of samples obtained using the sampling function _i . Now the expected number of samples T3©2Î RT 2 J ? _\¥F >+? G is clearly a non-increasing function of ^ , and indeed we show below that there exists a unique ^|¼ for which T À²2ê . A direct approach to finding ^¼ is to construct an algorithm to find the root. In Section 7 we shall provide a algorithm that does just this by recursion on the set of sizes o > ? p . However, this approach is not suited to all sampling applications. For example, storage or processing cycles may not be available to perform the recursion. Instead, we first present a simple iterative scheme to determine ^ ¼ that works by repeated application to the set of sizes o >!?p . The advantage of this approach over a recursive one will become most evident when we come on to consider the application to dynamically changing sets o >?p in Section 4. Whereas recursion must complete on a given set o >Q?qp , the iterative method allows us to replace the set of sizes o >?qp with new data after each stage of the iteration, without requiring to store the sizes. Theorem 4 Assume that >!?á and set ^<ù l r 2! F(^ù¥G , õ @B .. f for all 1Ó2 476;:;:;: = , that = á , and that ^ rmá f . Define !F(^,G~2ê^¥TÅUV. (i) !F(^,G is concave and the equation QF(^,GÓ2^ has a unique positive solution (ii). õà¨c. ^+¼ .. ^<ù is monotonic, and *¡* àù!#"^ù2^ ¼ .. TÅ%$ is monotonic, and *¡* àù&'"TÅ%$m2 . Proof: !F(^,G has the form J ? ¢¡£!o > ? 6^ v p UV , from which the following observations can be made. First, > (*) ï²2Â ¢¡*£ ?*Pr >+? . Second, as a sum of concave functions, is concave. Third, !F(^,G¯2n^ =UV á 4 for ^¢. > (,+.- 2 ¢¤® v?*Pr >+? . !F(^,G¯2 J ? >+? UV for ^ â«. From these properties, !F(^,G2ê^ has a unique solution ^¼ á f . Furthermore !F(^,G á ^ (resp. !F(^,G ¸ ^ ) for ^ ¸ ^¼ (resp. ^ á ^¼ ), and hence o^<ù p is a strictly monotonic sequence bounded above by ¢¤®Qo^|¼¤6^ r;p . Therefore *¡ !ù #"#^<ù²2^,¼ since is bounded and continuous on F(fg6 à¤®{o^g¼¤6^ r;p G . T is clearly continuous and non-increasing in ^ and hence T % $ is monotonic in õ and converges to T À as õàc5ª , converging from above if ^ r ¸ ^ ¼ (i.e. if T å á ), and converging from below if ^ rbá ^ ¼ (i.e. if TÅ å ¸ ). We illustrate the form of and convergence of the sequence o^ v+p in Figure 7. In some cases, the á /> (,+.- 2± ¢¤® ?7>+? . Then fixed point ^,¼ can be achieved in finitely many iterations. Suppose J ? >\? UV \ > ? « â >. , ( . + v > 0 . Thus once ^ falls in the interval e (,+.- 6 ª«G , the next iteration ^¼b2ÂJ ? UV and !F(^9GÓ2^,¼ for ^ yields !F(^ v G¯2Â^,¼ ; see Figure 8. We note that by Theorem 4 o^ ù p and oT . $ p are monotonic and so ^¤ù will never overshoot the fixed point ^ ¼ . (iii). õà¨c. 4.1 Rates of convergence Having established convergence of ^ ù to the unique fixed point ^ ¼ , we now investigate how quickly convergence occurs. Our first result show that the number of iterations required to bring T0 to within a given 17.

(24) 132 465. 4<;. 4:9 4>= 487. E3F%G H 4. Figure 7: I TERATION WITH : with sequence o^ v\p converging to ^ ¼ from below.. factor of the target. . ?.@ ACB. I6J E J KML. A8D. Figure 8: O NE - STEP !F(^,GÓ2^ ¼ for ^ âS>/(,+.- if J. A CONVERGENCE :. ? +> ? VU . is controlled uniformly in terms of the initial and target thresholds. Theorem 5 Starting with ^32Â^ r , it requires no more than 4~ON.:PQ r l0R a factor F 4¯TS7G of , i.e, . UgF 4~TS¤G ¸ T ¸ F 4¯TS¤G .. â«>/(,+.- .. ^ r and ^ ¼ .. F(^¼HU¤^ r G!N iterations to get T within. F 4[VS7G p . Then å á ; the case TÅ å ¸ is similar. Let =W2 ¢¤®!o<õUN¤T%$ á v v v W W TÅ Ä á ^ ¼ á ^vl r ^ l r Á (19) 2 2. F ¯ 4 T 7 S G Pr ^ Pr ^r ^r Á Á Á ¸ r r Thus = :PQ 0 l R F(^¼HU¤^ G and the required number of iterations is no more than =4 . Note that if S is small, N.:PQ r l0R F(^ ¼ U¤^ r G!NXYN.:PQ!F(^ ¼ U¤^ r G!NuUZS . In a neighborhood of ^ ¼ , the rate of convergence of ^<ù is governed by the derivative of . Let x E¢m2 ?L 6N [ >\? and \é2 º oH1Ó` >\?]á ^ p : (20) ë Observe that TÅÅ2 E¢U¤^] \² and hence that QF(^,Gé2ÈEàUV ¿^ \²UV . is piecewise linear with right derivative \ UV and left derivative \ × UV where \ × ^ 2 \ º oH1¯` > ? 2^ p . We now express convergence of the absolute difference of T and in terms of these derivatives. Proof: We prove for T. ù Theorem 6 (i) Adopt the assumptions of Theorem 4. N ^¥ù}S^ ¼ N ¸`_ N ^ r }Z^ ¼ N where depends on ^ r as follows. If ^ rÅá ^ ¼ , take _ 2a\² À UV ¸ 4 . Otherwise, for sufficiently small S á f , and sufficiently large ^ r ¸ ^¼ , we can take _ 2b\ × À OS ¸ 4 . Thus, subject to these conditions, the number of iterations required to bring ^7ù within a distance c of ^9¼ is no greater than N.:PQQFCc¤UN ^ r }¹^¼NuGUd:PQ _ N . (ii) N T. }¹T fe ,N þF(^}¦^g*GiÕ;=UV. for all ^|6^g. â f . Hence N T %$ }YN ¸h_ ù F=UVÈG!N ^ r }¦^¼N .. Proof: (i) Suppose ^ rÅá ^ ¼ . From from the concavity of , ^7ù l r 2i!F(^ù¥GO!F(^ ¼ GþF(^<ù}S^ ¼ G.\² À;UV . 2»4}¿E¢ À;UgF(^ ¼ G ¸ 4 and so the bound on the required number of iterations is trivial. Now \² À UV 18.

(25) Otherwise, for ^ r ¸ ^ ¼ , using concavity of and since ^ v is increasing, ^ ¼ }^ r ^ ¼ ÂF(^ ¼ }^ r G.\ × å UV . For sufficiently small S á f , \ × å \ V× À TS ¸ . (ii) Denote ^ 2 ¢¡*£!o^|6^g p and ^ l 2 ¢¤®!o^6^g p . T fe and TÅ have identical contributions from those × >+? not in the interval e ^ 6^ l h . Hence. × x +> ? } ^ GUV lN ^X}¦^ g N¤Õ;=UVê: N Tb}¹T e N 2 (21) × ? L ì j N j % k F ¦ ë Note the rate _ is uniform for T¢å ¸ . We believe this is the more interesting case: when the population size = is unknown, one may conservatively choose a large initial ^ r in order that T å ¸ . Worst-case convergence of the absolute difference T[}¹ occurs when \²VÀVUV is as close as possible to 4 . Since \²àTÅ ¸ , the worst case possible case would entail \ À2 Ò}4 . (One can construct such an example by letting each >{? take one of two sufficiently separated values, with } 4 of the >? taking the larger value). According to Theorem 6, the number of iterations õ required for a given absolute difference N TÅ $ }¹m N is then n¢F(ÈG . In the next section we show that a modification of the iteration allows exact convergence of T % $ in finitely many steps, whose number grows no worse than nàF( G . 4.2 Location of o*p in finitely many steps We now show how a simple modification of the iterator enables exact determination of many iterations for any collection of sizes o >{?qp . Let q QF(^9G2Cý. if \² if \². !F(^,G. ^!¼ within finitely. â . (22) ¸ ^+F(TÅ}r\²GUgF(Ï}s\²HG q The modification of makes use of the subsidiary quantity \Å , the number of sizes that exceed a given threshold ^ . Since such sizes are sampled with probability 4 and k¥F > G2 ¢¤®!o > 6^ p , \² can be determined from the stream of sampled renormalized sizes alone: it is the number of such objects that exceed ^ . q q q Since QF(^,G2±^ iff TÅ2 , shares with the unique fixed point ^¼ . We say that iteration with q terminates when successive values are identical. Since ^|¼ is the unique fixed point of , termination can q q happen only at ^ ¼ . Since QF(^,G ¸ !F(^,G for ^ á ^ ¼ , we expect convergence to be faster with than for q initial ^ á ^ ¼ . We also find that QF(^,G â ^ ¼ for all ^ such that \ ¸ , so that after at most one iteration we enter the regime of convergence to ^g¼ from above. q Theorem 7 Starting with ^ such that \ ¸ , iteration with terminates on ^ ¼ after at most at most n¢F( à¡*£!oêt6 MPQ¯E p G iterations. q v Proof: First consider the case that \XàÈTÅ ¸ , and hence ^ á ^ ¼ . We show this implies that F(^,G q is a decreasing sequence bounded below by ^¼ . Now TÅ ¸ 2úTVÀ implies ^ g`2 !F(^,G ¸ ^ . Since TÅé2E¢U¤^T \² , ^ g satisfies (23) Ï2E U¤^ g u \ : Hence. x T e 2 E e ¤U ^ g T \ e 2 E¢¤U¤^ g T \²Ó F 4b} >+? U¤^ g G¯«E¢U¤^ g T\²é26 (24) ? L f e j N8[ ë and so ^ g â ^ ¼ . If there are no attributes >Q? in F(^ g 6^7h , then Eà2E e , \²2v\ e and so 2þT e by (23) and the iteration terminates with ^ g 2^ ¼ . An immediate consequence is that each non-terminating iteration increases \ by at least 4 . Since \²¿TÅ ¸ , there can be at most such iterations. 19.

(26) w GqT . We are going to show that in each iteration, either EA or wi Define wi á f such that 2ÈF 4s] is at least halved. Assume that x#`2Â à¡*£\o >Q? 6 ¢¡*£ NzP y Ä oN >\? } > N p pÅá f . x â 4 if >\? takes positive integer ë ë Á values. Since E â E â x , there can be at most n¢F8:PQ¯EàG iterations in which E is halved. Suppose Eà is not halved, i.e., that E fe á E¢U78 . Now, w];T.2ú }TÅ.2ú }¿E¢U¤^}\²2 E F 4U¤^|g }4U¤^,G . However, using (24) with ^ and ^g interchanged, we find w T }Vw te T te 2T fe }#T â E fe F 4U¤^ g }¿4U¤^,G â FE¢<U78¥G F 4U¤^ g }¿4U¤^9GÓ2{i w TU78 . Hence w fe T²hw fe T fe h w ;TÅU78 , as desired. \ . Hence after one such iteration we Recall that an iteration that does not yield ^ ¼ must increase \ á 4 , and so ] w ¦2 UVTÅ} 4« }È4 . To finish the bound we need to show that have TÅ â ² w cannot get too small without the algorithm terminating. Here we exploit that if ~ i w ¸ i x UgFè87þ^,G , then ¸ ¸ ^}#^g 2 ^ w & x TÅUgF(Ò}T² \ ;G x UgFè8F( }V² i \ GG x U78 due to the integer inequalities X i \ â ² \ À á . ? > Hence, in one of the next two iterations, ^ does not cross an , in which case \ does not change on that iteration, which implies that the algorithm terminates. Thus w á i x UgFè87þ^,G á i x UgFè87þ^ G , and there can ¼ n F8:P¯ Q ÈG iterations in which w is halved. be at most à q \ . Similarly to (24) one has for ^ g 2 { F(^,G that Next consider the case that initially T¢ á á ². x T e 2 E e ¤U ^ g T \ e 2 E¢U¤^ g T \²Ó F >\? U¤^ g }«4G~SE¢<U¤^ g T\²é2 ? L j N6[ fe ë á and hence ^g ^¼ . Thus after one iteration the problem reduces to the case \ ¿T ¸ .. (25). 5 Pathwise Dynamic Control of Sample Volumes The strength of an iterative method is that it controls the sample volume RT in both the static case that the distribution of sizes of the original population o > ? p is unknown, and—as we shall in this section—the dynamic case that the the population is changing. The latter ability is important for network measurement applications, where both the volume of offered flows and their sizes can change rapidly. As an example, volumes of short flows have been observed to increase rapidly during denial of service attacks. The setup for our analysis is as follows. Consider a sequence of time windows labeled by õ @«B . In | ù!} ù!} each window a set of sizes o >d| p 2 o > ? `\1~2±476896;:;:;:V=ù p v is available for sampling. Given a threshold ù!} ù!} ù!} $ ^ , then the total sample volume in window õ is RT | 2 J ?*Pr z ? | where z ? | are independent random | ù!} variables taking the value 4 with probability _¥F > ? G and f otherwise. The set of sampled renormalized | ù!} | ù&} | ù!} ù!} sizes from window õ is ç 2±o ¢¤®!o > ? 6^ p ` ? 2±4 p , and the estimate of the total size of o >~| p is z RE | ù!} 2ÂJ ½7Ã $% Æ . Generally we want to use control schemes that require relatively sparse knowledge about the past. For ù!} example, we do not want to require knowledge of the full sets of original sizes o > | p from past windows, due to the storage requirements this would entail. The control schemes that we will use here are based on the iterative algorithms of Section 4. The value of ^¥ù l r to use for sampling during window õXÂ4 is a function of only the target sample volume , the following quantities from the the immediately previous window: | ù!} ù!} the threshold ^¤ù , the sample volume RT $ and, optionally, the set of sampled renormalized sizes ç | . Let | ù!} | ù!} \ 2 º oHÆ @ ç `Æ á ^ p . We first specify three candidate control algorithms based on the results of Section 4. The Conservative Algorithm is based on the basic iteration defined in Theorem 4. The Aggressive Algorithm uses the modification described in Section 4.2 with aim of speeding up convergence when RT ¸ . The Root Finding 20.

(27) Algorithm uses a limited recursion to speed up convergence when RT á . This assumes the requisite storage and processing cycles are available. These algorithms can be supplemented by two additional control mechanisms. Emergency Volume Control rapidly adjusts the threshold ^ if systematic variations in the offered load cause RT to significantly exceed the target before the end of the window. Variance Compensation estimates the inherent variance of the sample volume, and adjusts the target volume downwards accordingly.. 5.1 Conservative algorithm.. !ù } à¤®!oÓRT % | $ 6H4 p r ^<ù l 2 ^<ù (26) The maximum with 4 comes into effect when ^ is so large that no sizes end up being sampled, i.e., when ÅRT .$2f . The effect is to drive ^ down and so make some sampling more likely at the next iteration.. 5.2 Aggressive algorithm.. ^ù l r 2. . ã ß $%. . ^<ù . $. !ù } RT % | $ â | ù&} if á RT %$ â f. if. ã ß $%. ß %$ sur. (27) $ ^<ù × ß $% × $ | ù!} Here we apply the form of the iteration (22) only when á RT .$ , rather than under the weaker condition ù&} q á \ | $ . The reason for this that the jump from ^ ¸ ^¼ to QF(^,G á ^,¼ can lead to oscillatory behavior in the presence of size variability. . (,+.-!. $. 5.3 Root finding algorithm The third candidate arises from the fact, established in Section 2.7, that sampling and renormalizing a set of sizes æ«2o >\? `¥1i2È476;:;:;:6= p using threshold ^ , and then resampling the resulting set with threshold ^ g á ^ is statistically equivalent to performing a single sampling operation with threshold ^/g . Suppose for a given value ^ we obtain the sample volume Å RT á . Conditioning on the fixed set of sampled sampled values ç¥FèæéG , we can write the conditional expectation of the sample volume RT fg e s 2 J ½ Ã ß | } _ fe F > G under resampling from ç FèæmG but using the threshold ^g á ^ . We rewrite. ë 2YJ ½ z ë _ fe Fk F > GG where z ë are independent random variables taking the value 4 with probë ability _!¥F > G and f otherwise. From Theorem 2 it follows that RT g e s 2þT e when ^ g â ^ , i.e., RT g e s is an unbiased estimator of T t e . Furthermore, for a given realization of the o ?qp , ^g¨c RT g e s is non-increasing. z Consequently it is relatively simple to determine the root R^ ¼ á ^ of the equation RT g À s 2C . We denote this root by ©FOç{76ÈG . An algorithm to determine the root is presented in Section 7 below. The root finding RT g e s . algorithm is then:. | ù!} ©FOç $ ã 6È G r (,+.-! ß $% sur. ^ù l Y 2 $ ^ù. &ù } RT | $ á | ù!} if â RT % $ â f. if. (28). We will also speak of combining the aggressive with the root finding approach, by which we mean using | ù&} | ù!} (27) when á RT $ â f and (28) when RT $ á .. 21.

No results found