The IETF relies on various vehicles to discuss and distribute protocols it seeks to
standardize. Proposed IETF protocols that have reached a certain level of maturity
are typically disseminated through Requests for Comments (RFCs), namely, have
progressed through rounds of discussions in an IETF Working Group (WG), and
been deemed stable and significant enough to warrant formal publication. RFCs
can belong to different categories, and we focus on standards track RFCs. Our
motivation is that they correspond to protocols that are of sound design, so that an
eventual failure to gain widespread adoption is unlikely to stem from fundamental
technical flaws.
We select a representative sample of RFCs in three steps. The first step involves
eliminating all RFCs issued since 2009. This is to ensure that enough time had
elapsed since the protocol’s initial release for a reasonable assessment of its adoption
status. The second step involves eliminating irrelevant RFCs from the above set.
Irrelevant RFCs are those that are not introducing a new protocol or an extension
not consider data link or physical layer protocols is that IETF was not involved in
the standardization of many of the protocols in these two categories, and therefore,
it is unlikely that we can infer anything from the available data. The set of relevant
pre-2009 standards track RFCs is still huge, and characterizing all of them (using
the characteristics defined in the previous section) is infeasible, therefore, we need
to find a representative sample to infer (potential) statistical correlation between
a number of characteristics and the success of protocols. We seek not only to find
such correlations, but to quantify them and find their statistical significance. Hence,
our sample needs to be representative of all types of protocols, regardless of their
popularity, to avoid potential bias in our statistical findings. In other words, our
focus is on finding a sample that is not dominated by popular protocols and the rules
that govern them, so that our findings can be generalized to all kinds of protocols.
Thus, the third step focuses on producing a subset of RFCs for our analysis. This
was performed by combining two different sampling methods, one producing 110
and the other 120 RFCs.
The motivation for using two different sampling methods is to avoid over-
sampling “popular” protocols that tend to see many extensions and correspond-
ingly generate a large number of RFCs, i.e., a simple random sampling tends to
sample popular protocols more often than other protocols, thereby introducing a
bias in our dataset. The first sampling method involves randomly sampling all past
sampled WGs (in the rest of the chapter, we refer to this sample as “WG-based”),
producing 120 “major” RFCs. In other words, we select 120 WGs from the total
of 179 WGs that have at least one relevant pre-2009 standards track RFC. Then,
from each WG we select the major RFC it produced, i.e., an RFC that is the most
important one among all the standards track RFCs of that particular WG. This
sampling method gives each major RFC a chance of 120179 ' 67% to be included in
our final dataset. The second sampling method involves selecting 110 RFCs ran-
domly from the non-major RFCs, i.e., any other relevant pre-2009 standards track
RFC that is not in the 179 major RFCs set, and in the rest of this chapter we refer
to the resulting sample as “random”. Since there are a total of 1473 relevant RFCs
that are in the standards track and pre-2009, the second sampling method basically
selects 110 out of 1473-179, giving each RFC in this set a chance of 1473110−179 '8.5%.
We seek to combine these two sample sets to create a larger set that includes all
types of protocols regardless of their popularities. However, we can only combine
these datasets if their statistical differences are not beyond what we intended. In
particular, our goal is to create a sample that is not biased towards popular proto-
cols, i.e.,not containing many extensions of a few popular protocols. Therefore, we
used the WG-based method to sample more “new” protocols, i.e., protocols with a
fresh design, even if they are not popular. Except for this feature that we intended
to have, and the ones that are closely associated with it, we want the two samples
into a larger dataset does not create a new statistical bias. In other words, we want
to combine the two datasets if and only if their distributions are not drastically
different (except for those dimensions that we intended to be different,e.g., charac-
teristics related to new protocols). We relegate the tests that shows this similarity
to the end of this section, and here only explain what are the general characteristics
of the final sample.
The final sample is found by mixing the above two samples, and as a result
of our two sampling methods, each relevant RFC has a non-zero (i.e., more than
8.5%) chance of being a part of this mixture. If we have found all 230 RFCs in our
list by simple random sampling, the chance of each relevant RFC being selected
would be exactly 1473230 = 16%. However, our sampling assigns a higher chance
(67%) to major RFCs, and a lower chance (8.5%) to others. As alluded to earlier,
this is done to avoid the bias (towards more popular protocols) associated with
simple random sampling. In other words, our final dataset is a more representative
sample (compared to a simple random sample of the same size) of all types of RFCs
regardless of their popularity, since it captures a larger range of protocols, and
therefore, we can generalize our findings to a broader set of RFCs.
The next and most time consuming step involved characterizing the protocol
described in each RFC according to the features of Section 4.2, as well as label it
as successful or not. Both are essentially manual processes that involve our own
external sources, e.g., books, technology blogs, source code forums, product web
pages, IETF mailing lists, etc. This process, while lengthy, was to some extent
simplified by the fact that we dealt primarily with binary decisions for each feature.
This mitigated the impact of unavoidable inaccuracy in the information that led
to classifying each protocol as either having or not having a particular feature.
Internal cross-validation was performed between the authors, and the result of our
classification is available for external review (See Appendix D). The results are in
the form of a spreadsheet with each row representing one of the 230 protocols under
consideration, and the columns corresponding to the different features of Section 4.2
in addition to the label of successful vs. not successful. Comments are available for
cells in the spreadsheet that highlight the motivations behind its setting and/or
pointers to documentation used to support the choice that was made.
As protocol functionality arguably represents a natural partitioning, we present
in Table 4.1 classification results for functionality features, as well as the corre-
sponding number of protocols deemed successful in our two samples, random and
WG-based. Complete results are available from the spreadsheet.
A T S C R D Succ.
Random (110) 33 10 26 8 24 9 71
WG-based (120) 32 6 36 15 8 23 64
before combining them into a larger dataset. We examined the similarity between
the distributions of our two samples, namely, WG-based and random, by applying
the binomial proportion exact test (i.e.,Fisher’s test), each feature. In other words,
for each feature we have two vectors of binary values (with sizes of 110 and 120),
to which we apply the Fisher’s test. The null hypothesis of this test is whether the
two samples are drawn from similar distributions (H0 : p1 =p2), since it is easier
to define such a test compared to a null hypothesis that the two distributions are
different. If the outcomes show that we cannot reject the null hypothesis for all of
the features (except for those that we intended to be different), then it is reasonable
to assume that combining the two datasets does not create a new bias. In other
words, the ideal outcome is that we cannot reject the null hypothesis for most of
the features with a significance level of 95%. The result of the Fisher’s test is a
p−valuethat basically shows the probability of the two samples being drawn from
the same distribution, i.e., the significance with which the null hypothesis can be
rejected is found by “1 - p-value”. These p−values are listed for each feature in
Table 4.2. The rows labeled “Feature” list the corresponding feature on which the
test was performed, and the rows below them (labeled “p-value”) show the p-values
of the Fisher’s test.
It can be seen that as of the RFCs’ labels, i.e., being successful or not, we
cannot reject the null hypothesis with 95% significance (p−value= 0.107). In other
different distributions when it comes to success or failure of protocols. According
to the results listed in Table 4.2 the above claim is true for all other features except
for 4 of them, namely, being a new protocol, generating additional value for other
protocols, being a Routing protocol, and being a Data plane protocol.
Among these 4 features, “being a new protocol” is the most important, since
the rest of the differences are mostly associated with it (see below). By examining
the datasets, we realized that the WG-based sampling produces more new protocols
than the random sampling method, which as alluded to earlier, is what we expected
to happen, i.e., more protocols with fresh designs are going to be present in this
dataset compared to the random sample.
The rest of the differences are mainly associated with being a new protocol
and how the RFCs are distributed. For instance, new protocols mainly generate
additional value for other protocols, compared to extensions of existing protocols
that mostly generate value for their parent protocols. Therefore, since the WG-
based sampling generates more new protocols, the “generating additional value for
other protocols” feature has a higher proportion in this dataset compared to the
random sample. Also, the difference between the proportion of the routing protocols
in the two samples comes from the fact that there are a few new routing protocols in
the IETF standards, and they are mainly popular, therefore, the random sampling
selects more of them (i.e., extensions of those popular protocols) compared to the
plane protocols between the two samples is due to having a large number of major
data plane RFCs (fresh designs) from various WGs, which results in more of them
being selected by the WG-based sampling.
The results of the above tests show that despite the different sampling methods,
and the slight differences in the datasets that they produce (which were intentional
to avoid the bias toward more popular protocols), the two datasets, namely, WG-
based and random, do not have drastically different distributions. Moreover, as we
explain in the next section, we categorize RFCs based on their functionalities (and
being new or not for some categories) before applying the classification methods
to them, and as a result, the slight differences mentioned above do not affect our
findings. In other words, not only the mixture is a more representative sample of
all types of protocols (regardless of their popularities), but also, according to the
results of the Fisher’s test, it does not introduce a new bias that affects our final
findings.
Feature Succ. A T S C R D
p-value 0.107 0.660 0.301 0.301 0.271 0.001 0.021
Feature New Replace backwards Net. Gear Local Domain Internet
p-value 0.000 0.107 0.186 0.891 0.445 0.884 1.000
Feature Increase Change Gen Value Secur. Scala. Perf. Other
p-value 0.787 0.787 0.013 0.623 0.350 0.508 0.515