Sampling a Population - Joint probability distribution

E XAMPLE 1.42: B AYESIAN NETWORK

22. Joint probability distribution

2.1 Sampling a Population

Statistics

This chapter reviews basic statistical concepts and techniques. We start by consid-ering a critical problem in statistics: choosing a representative sample. We then dis-cuss statistical techniques to deal with some situations that frequently arise in carrying out research in computer networking: describing data parsimoniously, inferring the parameters of a population from a sample, comparing outcomes, and inferring correlation or independence of variables. We conclude with some approaches to dealing with large data sets and a description of common mistakes in statistical analysis and how to avoid them.

2.1 Sampling a Population

The universe of individuals under study constitutes a population that can be char-acterized by its inherent parameters, such as its range, minimum, maximum, mean, or variance. In many practical situations, the population is infinite, so we have to estimate its parameters by studying a carefully chosen subset, or sample.

The parameters of a sample, such as its range, mean, and variance, are called its statistics. In standard notation, population parameters are denoted using the Greek alphabet, and sample statistics are represented using the Latin alphabet.

For example, the population mean and variance parameters are denoted P and V², respectively, and the corresponding sample mean and variance statistics are denoted m (or ) and sx ², respectively.

ptg7913109 When choosing a sample, it is important to carefully identify the underlying

pop-ulation, as the next example illustrates.

EXAMPLE 2.1: CHOICE OF POPULATION

Suppose that you capture a trace of all UDP packets sent on a link from your campus router to your university’s internet service provider (ISP) from 6 a.m.

to 9 p.m. on Monday, November 17, 2008. What is the underlying population?

There are many choices:

The population of UDP packets sent from your campus router to your uni-versity’s ISP from 12:00:01 a.m. to 11:59:59 p.m. on November 17, 2008

The population of UDP packets sent from your campus router to your university’s ISP from 12:00:01 a.m. to 11:59:59 p.m. on Mondays

The population of UDP packets sent from your campus router to your university’s ISP from 12:00:01 a.m. to 11:59:59 p.m. on days that are not holidays

The population of UDP packets sent from your campus router to your university’s ISP from 12:00:01 a.m. to 11:59:59 p.m. on a typical day

 The population of UDP packets sent from a typical university’s campus router to a typical university’s ISP from 12:00:01 a.m. to 11:59:59 p.m. on a typical day

 The population of UDP packets sent from a typical access router to a typ-ical ISP router from 12:00:01 a.m. to 11:59:59 p.m. on a typtyp-ical day

...

The population of all UDP packets sent on the Internet in 2008

The population of all UDP packets sent since 1969 (the year the Internet was created)

Each population in this list is a superset of the previous population. As you go down the list, therefore, conclusions about the population that you draw from your sample are more general. Unfortunately, these conclusions are also less valid. For instance, it is hard to believe that a single day’s sample on a single link is representative of all UDP packets sent on the Internet in 2008!

The difficulty when setting up a measurement study is determining a sample that is representative of the population under study. Conversely, given a sam-ple, you are faced with determining the population that the sample represents.

This population lies in the spectrum between the most specific population—

ptg7913109

2.1 Sampling a Population 55

which is the sample itself—where your conclusions are certainly true and the most general population, about which usually no valid conclusions can be drawn. Unfortunately, the only guide to making this judgment is experience, and even experts may disagree with any decision you make.

2.1.1 Types of Sampling

As Example 2.1 shows, collecting a sample before identifying the corresponding population puts the metaphorical cart in front of the horse. Instead, one should first identify a population to study and only then choose samples that are representa-tive of that population. By representarepresenta-tive, we mean a sample chosen such that every member of the population is equally likely to be a member of the sample. In contrast, if the sample is chosen so that some members of the population are more likely to be in the sample than others, the sample is biased, and the conclusions drawn from it may be inaccurate. Of course, representativeness is in the eye of the beholder. Nevertheless, explicitly stating the population and then the sampling technique will aid in identifying and removing otherwise hidden biases.

Here are some standard sampling techniques.

 In random, or proportional, sampling, an unbiased decision rule is used to select elements of the sample from the population. An example of such a rule is: “Choose an element of the population with probability 0.05.” For example, in doing Monte Carlo simulations, varying the seed values in random-number generators randomly perturbs simulation trajectories so that one can argue that the results of the simulation are randomly selected from the space of all possible simulation trajectories.

 In the stratified random approach, the population is first categorized into groups of elements that are expected to differ in some significant way. Then, each group is randomly sampled to create an overall sample of the population.

For example, one could first categorize packets on a link according to their transport protocol (TCP, UDP, or other), then sample each category separately in proportion to their ratio in the population.

 The systematic approach is similar to random sampling but sometimes sim-pler to carry out. We assume that the population can be enumerated in some random fashion (i.e., with no discernible pattern). Then, the systematic sam-pling rule is to select every kth element of this random enumeration. For instance, if we expected packet arrivals to a switch to be in no particular order with respect to their destination port, the destination port of every 100th arriving packet would constitute a systematic sample.

ptg7913109

 Cluster sampling, like stratified sampling, is appropriate when the population naturally partitions itself into distinct groups. As with stratified sampling, the population is divided into groups, and each group is separately sampled.

Grouping may reflect geography or an element type. However, unlike in strat-ified sampling, with cluster sampling, the identity of the cluster is preserved, and statistics are computed individually for each cluster. In contrast to strati-fied sampling, where the grouping attempts to increase precision, with cluster sampling, the goal is to reduce the cost of creating the sample. Cluster sam-pling may be done hierarchically, with each level of the hierarchy, or stage, further refining the grouping.

 With purposive sampling, the idea is to sample only elements that meet a specific definition of the population. For example, suppose that we wanted to study all IP packets that are 40 bytes long, corresponding to a zero data pay-load. Then, we could set up a packet filter that captured only these packets, constituting a purposive sample.

 A convenience sample involves studying the population elements that hap-pen to be conveniently available. For example, you may examine call traces from a cooperative cell phone operator to estimate mean call durations.

Although it may not be possible to claim that call durations on that provider are representative of all cellular calls—because the duration is influenced by pricing policies of each operator—this may be all that is available and, on bal-ance, is probably better than not having any data at all.

2.1.2 Scales

Gathering a sample requires measuring some physical quantity along a scale. Not all quantities correspond to values along a real line. We distinguish four types of scales.

1. A nominal, or categorical, scale corresponds to categories. Quantities arranged in a nominal scale cannot be mutually compared. For example, the transport-protocol type of a packet (i.e., UDP, TCP, other) constitutes a nomi-nal scale.

2. An ordinal scale defines an ordering, but distances along the ordinal scale are meaningless. A typical ordinal scale is the Likert scale, where 0 corre-sponds to “strongly disagree,” 1 to “disagree,” 2 to “neutral,” 3 to “agree,” and 4 to “strongly agree.” A similar scale, with the scale ranging from “poor” to

“excellent,” is often used to compute the mean opinion score (MOS) of a set of consumers of audio or video content to rank the quality of the content.

ptg7913109

In document Mathematical Foundations of Computer Networking (Page 72-76)