Limitations and Reliability of the Data - Internet Standardization: A Participant Analysis

Establishing the limitations and reliability of the data we have collected is perhaps not entirely trivial. Data is collected from several, quite different, sources and some of the data is embedded in the actual text of IETF working documents.

There are several aspects of the data we have collected that are uncertain. The data we have collected was not originally intended for, nor set up with the goal that it would be easy to insert into a relational database and be analyzed. While building the project database we have had to make several manual enhancements and exceptions to the algorithms parsing the IETF data. Mostly unstructured natural language is, as mentioned, challenging to parse using computer scripts. Therefore the integrity, validity, consistency, and accuracy of the data might also be hard to establish exactly.

The IETF working documents are the primary source of data we explored. They are also the most challenging in respect to establishing an exact quality of the data collected. The data we collect reside in the Author’s Address section of the documents. The specification for the Author’s Address section can be found in RFC 2223, and it reads as follows: Each RFC must have at the very end a section giving the author’s address, including the name and postal address, the telephone number, (optional: a FAX number) and the Internet email address [116]. RFC 2223 does however not specify specifically who to mention in the Author’s Address section.

This very issue was also discussed in the Document Life cycle Training session on the 73^rd IETF meeting. The conclusion in the training session was that currently the IETF has somewhat imprecise rules on defining and documenting exactly who should be documented as authors and contributors in the Author’s Address section of the document. Currently the data reflects what the authors have chosen to write in the documents and thus significant contributors can be missing or others can be mentioned as contributors even though they have not contributed at all.

In this case, the amount of IETF documents in the project database permits us to do quite extensive manual evaluation of the parsed results. We have 5334 RFCs and 53911 Internet-Drafts in the project database. The parsed Author’s Address sections of all RFCs have been checked and errors have been corrected. We also created tools that enable us to make and store custom rules and exceptions to enhance the accuracy of the Author’s Address section parser algorithm. The special rules and exceptions database table contain 3648 strings we know that are author names, 2465 strings we know to be organization names, 613 strings we know that certainly are not author names, and 1089 lines that we order the parser to skip altogether. These special rules greatly enhance the parser algorithm and in many cases enable us to correctly parse several Author’s Address sections that without these rules would be quite impossible to parse.

Meeting attendee lists for each IETF meeting are published as a part of the meeting proceedings. These list also reflect what the attendees have chosen to document about themselves when registering for the meeting. 31 of the 45 IETF meeting attendee lists we have inserted into our database contain the most relevant data for our analysis, namely the attendee names and their organizational affiliation. 14 of the attendee lists lack information about the attendees’ organizational affiliation, however those lists do contain the attendees’ email addresses. Cross referencing meeting attendee lists and author data from the working documents help us to establish the missing information. Email address domain names are also helpful in establishing the organizational affiliation in many cases.

To collect data from the IETF mailing lists should be technically a fairly straight forward task. However, the mailing list message files, that should confirm to the Internet Message Format, does not always do that. There are several erroneous files that contain incorrect messages in the archive. The reasons for the errors are difficult to present. Certainly there are email client software that do not produce messages conforming to the prevalent standard. The bulk of the messages are parseable and our parser script can successfully extract the specified data from the messages. In

addition to technical obstacles, there is also the issue of spam messages on the mailing lists. The IETF mailing lists are however managed in accordance to an IESG statement that spam control must be provided on each IETF mailing list [73].

However, there are a substantial amount of spam messages on the IETF mailing lists interfering with our analysis. The parser script does however match the X-Spam-Score header field and insert the value into the project database, thus enabling us to sort out the bulk of the spam messages later on.

The additional data is mostly collected from external sources and stored, as is, into the project database. Financial data about the selected set of companies is picked up from the financial statements in the annual reports of the companies by hand. Stock performance data is gathered from the MSN Money service, also using a manual process. In the scope of this thesis we thus consider the aforementioned data to be reliable and correct. The organizations for which we have collected financial data are presented in Section 5.4 in Table 11.

The USPTO database was queried using the full-text quick search tools provided on the web-page [142] [140]. The full-text search returns the patents or patent applications that contain our search string anywhere in the writing. We did not limit the query to match our search string only in References section. Using this approach we believe that we will get a more generic salience figure for a specific RFC-document in the patent domain.

Similarly the Google Search and the Google Scholar academic literature Search gives us generic figures on the importance of the RFCs. Both databases were queried in a full-text search manner, thereby the number of hits reflect documents where the search string may be anywhere in the writing. Regarding the Google queries, it is also challenging to establish the definite quality dimensions and limitations of the data. The data will therefore be used bearing these limitations in mind. The Google Search hits reflects a general popularity of the documents within the Internet community and the Google Scholar hits will provide us with a general understanding of how each RFC have impacted the academic community.

5 Analysis

In this chapter we will perform certain analyses on the data we have constructed into our database. Primarily we will search for a meter measuring success in Inter-net standardization. We will also take a closer look at the accomplishments of the Finnish ICT cluster within the IETF. Ultimately we will also establish the correla-tion between the efforts put into Internet standardizacorrela-tion by IETF participants and their success on the marketplace.

5.1 Methodology

As mentioned, our goal is to quantify the results and take different analytical views on IETF standardization work. We have collected data measuring several different aspects of participants’ activities within the IETF. In addition to that, we also have data that enables us to look at the participants’ performance outside the standard-ization scene and thereby compare their standardstandard-ization activities to their perfor-mance in their actual line of business.

Deciding how to measure performance and success in standardization work, proved to be challenging. Public academic research on this topic seems to be quite scarce.

There are some articles focusing on e.g. measuring the performance of SDOs, other articles looking at the impact of the actual standards produced, and also some articles analyzing the co-operative and competitive aspects of standardization work [128], [125], [100]. Research on the inner workings of SDOs appears absent, also data on the inner workings of SDOs is not generally available. Hence, in the scope of this thesis we have had to put quite an effort in constructing the IETF data to be analyzed. The aforementioned issues consequently have an effect on our choice of analysis methodology.

A primary goal will be to establish what the benchmark for success regarding IETF standardization work is. This part of the analysis will be done from the IETF’s point of view. Exemplary participants best serving the IETF’s mission will be isolated.

IETF data covering a longer time-span will be used to enable us to establish trends and changes over time.

In addition to taking a look at the inner workings of the IETF we will also establish the correlation between the level of activity in Internet standardization and partici-pants’ success on the marketplace. Figures such as Net Sales and R&D expenditure will be compared to quantities in IETF participation.

Finally we will also analyze the accomplishments of the Finnish ICT cluster in the IETF. The Finnish participants will be extracted from the data and similar analyzes will be made on them, as on the IETF general population. The Finnish company SSH Communications Security will be given special attention, as their technology and their success has a special relationship to standards development work done in the IETF.

Our analysis methodology also includes a system to classify IETF documents by importance, thereby giving participant contributions an extra dimension of qual-ity. This document importance classification can be made only for the RFCs, as there by definition are no references to Internet-Drafts. The IETF process docu-ment states that under no circumstances should an Internet-Draft be referenced by any paper, report, or Request-for-Proposal, nor should a vendor claim compliance with an Internet-Draft [13]. The RFCs are however given an importance dimension measured using the variables listed in Table 8.

Table 8: RFC importance factors.

Variable Amount of Source

RFCcr RFC cross-references IETF RFCs

RFCip Issued patents PATFT: USPTO

RFCpa Patent applications AppFT: USPTO RFCar Ref. in academic literature Google Scholar RFCgh Hits using Google search Google Search

Using all five importance variables will enable us to calculate a unified importance value, RFCi, for each RFC as in Equation 1.

RF Ci = (RF Ccr)/100 + (RF Cip)2 + (RF Cpa)1 + (RF Car)/100 + (RF Cgh)/1000 (1) Equation 1 will be used in Section 5.2 to classify all RFCs by their impact and as a continuum to that, in Section 5.3, we will determine the participants who have actively contributed to the most influential RFCs. First we will however take a look at some general parameters of IETF work.

In document Internet Standardization: A Participant Analysis (Page 51-55)