5.3 Data Retrieval
5.3.2 Template-based Crawler within OSS Infrastructures
As previously described, in OSS, a vast amount of publicly available community information is distributed over the project Web resources: mailing lists, forums, etc. (cf. Subsection 5.2.1). These platforms provide a huge data pool for the further knowledge mining. By collecting this information from Web, we realize the first step in building a comprehensive analysis. To evaluate our template-based crawler, we applied it to crawl data from Eclipse forum18 and Eclipse mailing list archives19.
Both platforms use standardized software packages. EclipseZone, in particular, applies a Java Web application for bulletin boards and discussion forums known as Jive Forum20. Eclipse mailing list archives make use of an OSS application which has been specifically designed for mailing list management, known as Mailman21is
used. The information and communication structure of the Eclipse user community on EclipseZone is depicted in Figure 5.11. The main structural entity are the forums. They provide the context within a particular set of discussions. Each of the discussion boards is usually related to a particular Eclipse plug-in such as the Web Tools Platform. An overview Web page of the project contains a list of all available forums and serves for our crawler as a subject for the root task. From this overview page, several independent subtasks for the individual forums are spawned. They, in turn, crawl the forums’ thread lists and extract information about thread subjects, their authors, timestamps of the post, and other statistics such as view count. So for each thread which is found in the list, a new task is created. Each new task is then responsible for downloading and retrieving the actual contents of the posts. Finally, from the content, we can retrieve links to the user profiles for collecting information about their date of registration, their professional level, their home page and other useful attributes.
For mailing lists the procedure is similar. Older mails can only be found in
18
EclipseZone, http://www.eclipsezone.com/eclipse/forums/c5605.html, last checked 2014/03/17
19
Eclipse Mailing Lists, https://dev.eclipse.org/mailman/listinfo, last checked 2014/03/17 20
Jive Forum, http://www.jivesoftware.com/, last checked 2014/03/17 21Mailman http://www.gnu.org/software/mailman/, last checked 2014/03/17
Figure 5.11: Forum Crawling by Template-based Crawler.
the mailing lists’ archives. Here again, the overview page provides a list of all available mailing lists within the project community. Accordingly, this page serves as an entrance page from which separate tasks for each individual mailing list are spawned as depicted in Figure 5.12. Every mailing list archive page provides links for each month of each year to the postings ordered by either thread, subject, author or date. From this overview page, a list of independent subtasks for every year-month combination is created. To replicate the discussion structure, crawler- templates prescribe following an ‘ordered by threads’ page. Now, each generated subtask extracts information about thread subject lines and date of posts. Then, for each link which is detected, a new task is created. The task is responsible for downloading and retrieving the contents and date of the posts, information about their authors (name and email), and some other useful attributes. In a particular case of the Mailman mailing lists, extraction of posting dates is challenging as the date information is spread across the Web page. Therefore, two nested loops have to be applied: (1) to extract the date of a group of messages and (2) to extract their specific time. Data format and position, of course, depend on the mailing list software package the archive uses and the presentation format which is configured. The information about the thread structure of the messages is derived from reply-relationships. The implemented mechanism verifies for each message as to whether it is a thread starter or a reply to another posting. In the first instance, a new thread is created in a local database with the name of the header message.
Figure 5.12: Mailing List Archive Crawling by Template-based Crawler.
In the second instance, the reply message is assigned to the same thread as the message it replies to.
All information extracted from the Web media by our crawler is stored onto a local DB2 database for further analysis. The XSL templates generate the corresponding SQL queries for checking for duplicates as well as insert or update statements depending on the prior availability of the data. This paradigm is very handy for maintaining a complete and consistent replicated data set. Additionally, the crawler considers only posts which have not been crawled in previous runs. The crawler requests the latest timestamp of the posts which have already been stored in the database and compares it with the timestamps extracted from the Web pages under crawling. As soon as the crawler finds a thread which only contains older posts, the spawning of new tasks is stopped. This saves a considerable amount of time and bandwidth. The mechanism is entirely controlled within the XSL templates, making other more exhaustive scenarios feasible as well. Incremental crawling allows us to retrieve fast and complete updates. Thus, having the crawler installed as a cronjob a true copy of the data with minimal effort can be maintained. The concept of the template-based crawler fits well into Mediabase workflow (cf. Figure 5.2).
It is worth noting that all collected artifacts both from the forums and from the mailing lists are stored in the XML format. Most importantly, we use a pure XML extension of the DB2 database which allows a fully intact replication of the data together with advanced query options. This approach supports an option to select all media objects from the user posts such as embedded Adobe Flash objects or normal JPEG images. Those elements are marked with special HTML/XML tags (e.g. img or object/embed).