• No results found

The IETF relies on various vehicles to discuss and distribute protocols it seeks to

standardize. Proposed IETF protocols that have reached a certain level of maturity

are typically disseminated through Requests for Comments (RFCs), namely, have

progressed through rounds of discussions in an IETF Working Group (WG), and

been deemed stable and significant enough to warrant formal publication. RFCs

can belong to different categories, and we focus on standards track RFCs. Our

motivation is that they correspond to protocols that are of sound design, so that an

eventual failure to gain widespread adoption is unlikely to stem from fundamental

technical flaws.

We select a representative sample of RFCs in three steps. The first step involves

eliminating all RFCs issued since 2009. This is to ensure that enough time had

elapsed since the protocol’s initial release for a reasonable assessment of its adoption

status. The second step involves eliminating irrelevant RFCs from the above set.

Irrelevant RFCs are those that are not introducing a new protocol or an extension

not consider data link or physical layer protocols is that IETF was not involved in

the standardization of many of the protocols in these two categories, and therefore,

it is unlikely that we can infer anything from the available data. The set of relevant

pre-2009 standards track RFCs is still huge, and characterizing all of them (using

the characteristics defined in the previous section) is infeasible, therefore, we need

to find a representative sample to infer (potential) statistical correlation between

a number of characteristics and the success of protocols. We seek not only to find

such correlations, but to quantify them and find their statistical significance. Hence,

our sample needs to be representative of all types of protocols, regardless of their

popularity, to avoid potential bias in our statistical findings. In other words, our

focus is on finding a sample that is not dominated by popular protocols and the rules

that govern them, so that our findings can be generalized to all kinds of protocols.

Thus, the third step focuses on producing a subset of RFCs for our analysis. This

was performed by combining two different sampling methods, one producing 110

and the other 120 RFCs.

The motivation for using two different sampling methods is to avoid over-

sampling “popular” protocols that tend to see many extensions and correspond-

ingly generate a large number of RFCs, i.e., a simple random sampling tends to

sample popular protocols more often than other protocols, thereby introducing a

bias in our dataset. The first sampling method involves randomly sampling all past

sampled WGs (in the rest of the chapter, we refer to this sample as “WG-based”),

producing 120 “major” RFCs. In other words, we select 120 WGs from the total

of 179 WGs that have at least one relevant pre-2009 standards track RFC. Then,

from each WG we select the major RFC it produced, i.e., an RFC that is the most

important one among all the standards track RFCs of that particular WG. This

sampling method gives each major RFC a chance of 120179 ' 67% to be included in

our final dataset. The second sampling method involves selecting 110 RFCs ran-

domly from the non-major RFCs, i.e., any other relevant pre-2009 standards track

RFC that is not in the 179 major RFCs set, and in the rest of this chapter we refer

to the resulting sample as “random”. Since there are a total of 1473 relevant RFCs

that are in the standards track and pre-2009, the second sampling method basically

selects 110 out of 1473-179, giving each RFC in this set a chance of 1473110179 '8.5%.

We seek to combine these two sample sets to create a larger set that includes all

types of protocols regardless of their popularities. However, we can only combine

these datasets if their statistical differences are not beyond what we intended. In

particular, our goal is to create a sample that is not biased towards popular proto-

cols, i.e.,not containing many extensions of a few popular protocols. Therefore, we

used the WG-based method to sample more “new” protocols, i.e., protocols with a

fresh design, even if they are not popular. Except for this feature that we intended

to have, and the ones that are closely associated with it, we want the two samples

into a larger dataset does not create a new statistical bias. In other words, we want

to combine the two datasets if and only if their distributions are not drastically

different (except for those dimensions that we intended to be different,e.g., charac-

teristics related to new protocols). We relegate the tests that shows this similarity

to the end of this section, and here only explain what are the general characteristics

of the final sample.

The final sample is found by mixing the above two samples, and as a result

of our two sampling methods, each relevant RFC has a non-zero (i.e., more than

8.5%) chance of being a part of this mixture. If we have found all 230 RFCs in our

list by simple random sampling, the chance of each relevant RFC being selected

would be exactly 1473230 = 16%. However, our sampling assigns a higher chance

(67%) to major RFCs, and a lower chance (8.5%) to others. As alluded to earlier,

this is done to avoid the bias (towards more popular protocols) associated with

simple random sampling. In other words, our final dataset is a more representative

sample (compared to a simple random sample of the same size) of all types of RFCs

regardless of their popularity, since it captures a larger range of protocols, and

therefore, we can generalize our findings to a broader set of RFCs.

The next and most time consuming step involved characterizing the protocol

described in each RFC according to the features of Section 4.2, as well as label it

as successful or not. Both are essentially manual processes that involve our own

external sources, e.g., books, technology blogs, source code forums, product web

pages, IETF mailing lists, etc. This process, while lengthy, was to some extent

simplified by the fact that we dealt primarily with binary decisions for each feature.

This mitigated the impact of unavoidable inaccuracy in the information that led

to classifying each protocol as either having or not having a particular feature.

Internal cross-validation was performed between the authors, and the result of our

classification is available for external review (See Appendix D). The results are in

the form of a spreadsheet with each row representing one of the 230 protocols under

consideration, and the columns corresponding to the different features of Section 4.2

in addition to the label of successful vs. not successful. Comments are available for

cells in the spreadsheet that highlight the motivations behind its setting and/or

pointers to documentation used to support the choice that was made.

As protocol functionality arguably represents a natural partitioning, we present

in Table 4.1 classification results for functionality features, as well as the corre-

sponding number of protocols deemed successful in our two samples, random and

WG-based. Complete results are available from the spreadsheet.

A T S C R D Succ.

Random (110) 33 10 26 8 24 9 71

WG-based (120) 32 6 36 15 8 23 64

before combining them into a larger dataset. We examined the similarity between

the distributions of our two samples, namely, WG-based and random, by applying

the binomial proportion exact test (i.e.,Fisher’s test), each feature. In other words,

for each feature we have two vectors of binary values (with sizes of 110 and 120),

to which we apply the Fisher’s test. The null hypothesis of this test is whether the

two samples are drawn from similar distributions (H0 : p1 =p2), since it is easier

to define such a test compared to a null hypothesis that the two distributions are

different. If the outcomes show that we cannot reject the null hypothesis for all of

the features (except for those that we intended to be different), then it is reasonable

to assume that combining the two datasets does not create a new bias. In other

words, the ideal outcome is that we cannot reject the null hypothesis for most of

the features with a significance level of 95%. The result of the Fisher’s test is a

p−valuethat basically shows the probability of the two samples being drawn from

the same distribution, i.e., the significance with which the null hypothesis can be

rejected is found by “1 - p-value”. These p−values are listed for each feature in

Table 4.2. The rows labeled “Feature” list the corresponding feature on which the

test was performed, and the rows below them (labeled “p-value”) show the p-values

of the Fisher’s test.

It can be seen that as of the RFCs’ labels, i.e., being successful or not, we

cannot reject the null hypothesis with 95% significance (p−value= 0.107). In other

different distributions when it comes to success or failure of protocols. According

to the results listed in Table 4.2 the above claim is true for all other features except

for 4 of them, namely, being a new protocol, generating additional value for other

protocols, being a Routing protocol, and being a Data plane protocol.

Among these 4 features, “being a new protocol” is the most important, since

the rest of the differences are mostly associated with it (see below). By examining

the datasets, we realized that the WG-based sampling produces more new protocols

than the random sampling method, which as alluded to earlier, is what we expected

to happen, i.e., more protocols with fresh designs are going to be present in this

dataset compared to the random sample.

The rest of the differences are mainly associated with being a new protocol

and how the RFCs are distributed. For instance, new protocols mainly generate

additional value for other protocols, compared to extensions of existing protocols

that mostly generate value for their parent protocols. Therefore, since the WG-

based sampling generates more new protocols, the “generating additional value for

other protocols” feature has a higher proportion in this dataset compared to the

random sample. Also, the difference between the proportion of the routing protocols

in the two samples comes from the fact that there are a few new routing protocols in

the IETF standards, and they are mainly popular, therefore, the random sampling

selects more of them (i.e., extensions of those popular protocols) compared to the

plane protocols between the two samples is due to having a large number of major

data plane RFCs (fresh designs) from various WGs, which results in more of them

being selected by the WG-based sampling.

The results of the above tests show that despite the different sampling methods,

and the slight differences in the datasets that they produce (which were intentional

to avoid the bias toward more popular protocols), the two datasets, namely, WG-

based and random, do not have drastically different distributions. Moreover, as we

explain in the next section, we categorize RFCs based on their functionalities (and

being new or not for some categories) before applying the classification methods

to them, and as a result, the slight differences mentioned above do not affect our

findings. In other words, not only the mixture is a more representative sample of

all types of protocols (regardless of their popularities), but also, according to the

results of the Fisher’s test, it does not introduce a new bias that affects our final

findings.

Feature Succ. A T S C R D

p-value 0.107 0.660 0.301 0.301 0.271 0.001 0.021

Feature New Replace backwards Net. Gear Local Domain Internet

p-value 0.000 0.107 0.186 0.891 0.445 0.884 1.000

Feature Increase Change Gen Value Secur. Scala. Perf. Other

p-value 0.787 0.787 0.013 0.623 0.350 0.508 0.515