4. Data and Analysis
4.1 Overview
4.3.2 Working on the web
Scacchi (2002) noted the extent to which communication about open source project
requirements utilized “informal” online communications. The World Wide Web, in particular,
offers open source software development a means for communicating about projects. Other
communication channels are used for messaging related to project management, such as e-mail
or Internet Relay Chat (IRC), and source code management systems (such as Git107 or
Mercurial108) use file transfers to communicate the actual changes made to code, but it is the
Web which provides the advertising, with the home page of a project giving an important first
impression. In particular, Choi, Chengalur-Smith & Whitmore (2010) analyzed page hits on
project websites and found four significant cues: “project description, screenshot availability,
downloadable initial work availability, and project website availability” (p. 75). Wiki-to-Speech
aimed to ‘speechify’ the delivery of this advertising message through a combination of images
with computer-generated voice overs. An increase in developer interest was expected when
Wiki-to-Speech could ‘talk about itself’. The first example of a talking presentation was
demonstrated for the Auckland Python User Group on 20 April 2011109.
Web access was also required for chatbot entries competing in the Chatterbox Challenge,
the chatbot competition which the original research proposal had suggested as a catalyst to attract
developer interest. Winning this challenge in 2004110 appeared to have contributed to the
107
http://git-scm.com/ 108
http://mercurial.selenic.com/
109 YouTube video “CherryPy and Wiki-to-Speech” http://www.youtube.com/watch?v=8aVL_cG2PZM 110
http://www.daniellechuchran.com/contest_history.html lists Chatterbox Challenge winners: Alice (2004),
popularity and success of AIML, the Artificial Intelligence Markup Language, developed by
Richard Wallace for Alice111. As of the 2011 contest deadline, Wiki-to-Speech was only able to provide a downloadable version112, which played a wiki-based script113 as demonstrated in a YouTube video114 titled “Chatterbox Challenge 2011 Script”. Consequently, rather than winning the Challenge, Wiki-to-Speech was disqualified.
While the chatbot approach was text-only, a combination of slides with computer generated voice overs appeared to offer a better, multimedia approach to project communication. In fact, images combined with voice overs could be used to produce video. The design first aimed to incorporate image links into the wiki-based script, as shown by the lines containing a path parameter, Slide1.JPG and Slide2.JPG in this excerpt115 of a script from the Wiki-to-Speech wiki: 111 https://files.ifi.uzh.ch/cl/hess/classes/seminare/chatbots/style.pdf 112 https://code.google.com/p/open-allure-ds/downloads/detail?name=WikiToSpeech-win-0.1d38-for-Chatterbox- Challenge-2011.exe&can=2&q= 113 https://code.google.com/p/wiki-to-speech/wiki/ChatterboxChallenge2011Script 114 http://www.youtube.com/watch?v=MHp2j4OwVD8 115 https://code.google.com/p/wiki-to- speech/wiki/DocumentClassificationUsingTheNaturalLanguageToolkitByBenHealey
[path=http://dl.dropbox.com/u/12838403/20110905/ben_healey_kiwipycon2011_ presso_text_to_speech/]
Slide1.JPG
Thanks for coming along. I am Ben Healey and this talk will be about using the python-based Natural Language Toolkit to automate document classification . My background is in market research and analytics, so my day job primarily involves coding in SAS, working with databases and excel to extract business insights. I also advise on survey design and development.
I am relatively new to Python and have recently had reason to use the Natural Language Toolkit to help me with some document classification I need to do. So, when the Kiwi Pycon call for papers came around I thought this would be a good opportunity to learn some more about this process and share my experience with others.
Slide2.JPG
My aim today is to cover the overall process involved in developing a document classification algorithm using the NLTK. You'll come away with an understanding of where to start if you want to do something similar yourself. I'll also introduce some terms specific to Machine Learning and the NLTK.
This script produced a slideshow version116 and a video version117 as output. YouTube’s
audience retention measure of the video, shown Figure 64, indicated how using text-to-speech to generate the voice over did not present an insurmountable obstacle for some viewers, who watched the 32 minute video to the end (average view duration: 3 minutes and 52 seconds).
116
http://dl.dropbox.com/u/12838403/20110905/ben_healey_kiwipycon2011_presso_text_to_speech.htm 117
Figure 64: Audience Retention for the Wiki-to-Speech-generated YouTube video, "Document Classification Using the Natural Language Toolkit", uploaded 5 September 2011. N=783 as of 25 March 2013.
Some presentations worked better than others in this format. For example, a shorter presentation titled “Stigmergy”118 showed above average relative audience retention at a point three and half minutes into the 5 minute, 11 second long presentation, as shown in the YouTube report in Figure 65.
118
Figure 65: Relative Audience Retention for the Wiki-to-Speech-generated YouTube video, "Stigmergy”, uploaded 12 September 2011. N=265 as of 25 March 2013.
The greatest value added by a Wiki-to-Speech presentation appeared to be its ‘reach’, as many more people might have a chance to view a presentation than would have without the tool. This was demonstrated by a video created using Wiki-to-Speech119 titled, “Understanding the
software development process: participation, role dynamics and coordination issues,” posted 31 August 2011, which had 1,276 views and above average relative audience retention as of 24 March 2013, as shown in Figure 66. If not for this Wiki-to-Speech version, no more than 30 people would ever have heard this talk (those attending in person).
119
Figure 66: Relative Audience Retention for the Wiki-to-Speech-generated YouTube video,
“Understanding the software development process: participation, role dynamics and coordination issues”, uploaded 12 September 2011. N=1,278 as of 25 March 2013.
For Wiki-to-Speech to function effectively as an advertisement/trainer for prospective Wiki-to-Speech open source project collaborators, however, the presentation production process needed to be simplified and streamlined. By 23 July 2011, a demonstration video120 titled “Wiki- to-Speech ODP Conversion Demo” could run through the instructions twice in under 2 minutes, but the procedure still required a dozen steps, including downloading and installing the Wiki-to-
120
Speech tools. Working on the web allowed taking the additional step of working through the web, providing a presentation conversion service via a server to clients with only a web browser.
The first web service for Wiki-to-Speech utilized a Python-based localhost tunneling solution called PageKite121 on 7 May 2011. Running the presentation converter on an internet- connected PC started a localhost web service, accessible by a web browser on that same PC. By simultaneously running PageKite, the service became visible to the internet, at the web address http://wikitospeech.pagekite.me. With this arrangement, users of the system could upload a presentation and have it converted into a talking presentation. Also, if many users were to be supported, the whole conversion system could be located at a cloud computing service, such as Amazon Web Services122.