Evaluation - An Alternative Approach: Bootstrapping a Value Chain

4.5 An Alternative Approach: Bootstrapping a Value Chain

5.1.3 Evaluation

To assess the usability of our semi-automatic approach, we compare it to a fully manual metadata extraction and import workflow. We present an approximate metric to measure the manual effort required for importing event metadata using our approach vs. the fully manual one. The following algorithm shows the steps required when performing this task fully manually:

We assume that the user has the level of experience required to carry out these steps. To approximately measure the user’s effort, we count each atomic action the user has to perform manually as one unit To determine the difference between manually gathering information and using SAANSET over a fixed

5.1 Extraction

Algorithm 2 Extraction. Steps for Manual Extraction and Import of Information

1: Prepare a CSV file with the following columns: event title, event date, city, country, field, event type, event homepage (each information counts as one unit in our measurement)

2: for each email do

3: Check (manually) if the email is a CfP email

4: Open the relevant email

5: Read the content and write down the following information in the CSV file: event title, event date, city, country, field,event type, event homepage (each information counts as one unit in our measurement)

6: Import the CSV file into OpenResearch.org

period of time, we calculate the amount of required units. Thus, we will finally be able to tell the improvement that SAANSET provides over the fully manual workflow.

Showcase: Semantic Web Emails over Three Months We applied the evaluation methodology intro- duced previously to the task of importing event metadata from the posts of the semantic web mailing list from March to May 2017. Table 5.2 shows 312 emails over that period, 88 of them being CfPs.

The effort to perform all required actions – let their number be n – is computed as follows:

overallEffort(a) =

∑

i=0

effortPerAction(i)

The following enumeration shows the effort required for each manual action of performing the overall task of extracting event metadata from the given 312 emails. The effort is given in units (u):

1. 1u: create a CSV file

2. 312u: checking each email if it is a CfP 3. 88u: open each CfP email

4. 616u: from each CfP email, extract 7 pieces of information 5. 1u: import the CSV file into OR

6. Thus, the total effort of gathering and importing information amounts to 1018u.

Using SAANSET, the first two steps from this list, i.e., creating a CSV file and checking each email whether it is a CfP or not, are automated, i.e., the remaining manual effort required amounts to 88u + 616u + 1u = 705u. This means in our particular example the user needs to perform ∼31% less manual actions for the complete process when using SAANSET.

Formalization and Generalization From the previous calculation, the benefit gained by using SAANSET in the scenario explained above can be quantified as 31%, which serves as a reference point. For a general benefit analysis, we define a formalization as follows. Let N be the number of total emails received through one or more mailing lists and NCthe amount of CfP emails among them.

Following the same steps as for the specific scenario above, we arrive at the following expression for the effort of manually gathering information:

2u + Nu + 8NCu (5.1)

2u corresponds to the initial, one-time actions of creating a CSV file and importing the metadata to OR. For each of the N emails, the user has to check whether it is a CfP or not (effort Nu). For each of the

NCCfP emails, the user has to open it and has to extract 7 pieces of information from it to the CSV file

(event title, event date, city, country, field, event type, event homepage), resulting in an effort of 8NCu.

Assuming a large number of emails, we can neglect the constant summand 2u, resulting in an approximate effort of Nu + 8NCu. Using SAANSET, which automates the checking of whether or not an email

is a CfP, the user’s remaining manual effort reduces to 8NCu. The ratio of the effort with SAANSET vs.

the all-manual workflow is therefore approximated by the following expression: 8NC

N+ 8NC

(5.2)

Equivalently, the following benefit function answers the user’s question of what percentage of his or her effort will be saved thanks to SAANSET:

savedEffort= (1 − ( 8 · NC N+ 8 · NC )) · 100 (5.3) 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100

Ratio _{Total emails}CfP emails

Benefit of the implementation in %

Figure 5.3: Effectiveness of SAANSET. The behavior of SAANSET is reported in terms of the benefit function; it is computed in terms of the ratio between the number of CfP emails to total number of emails in a mailing list.

Figure 5.3 visualizes the benefit function (5.3). We can clearly see that SAANSET provides greater benefit the lower the ratio of CfP emails is, or, in other words, the higher the ratio of irrelevant emails is. If no CfP emails are sent via a mailing list, SAANSET eliminates all mails and there would be no need for manual user actions. If all emails were CfPs, SAANSET would automate1₉ of the work, leaving the manual extraction of information to the user.

Adaptation to the specific structure of a given mailing list is as easy as changing one XPath expression, as shown by the placeholder XPATH_QUERY in the following listing:

/ / g e t some i n f o r m a t i o n f r o m t h e XML

f o r ( i n t i = 0 ; i < c h i l d r e n . l e n g t h ; i ++) {

XML[ ] i n f o r m a t i o n s = xml . g e t C h i l d r e n (XPATH_QUERY) ; i n f o r m a t i o n = i n f o r m a t i o n s [ i ] . g e t C o n t e n t ( ) ; }

In document Collaborative Integration, Publishing and Analysis of Distributed Scholarly Metadata (Page 142-145)