Chapter 2: Literature Review
3.5 Content analysis Duolingo discussion forum user statements
In the case of the Duolingo (2012) Discussion forums, the texts for analysis came from a search on the Duolingo website, where anyone signed up to the site can search for any word across the Discussion streams, accessible via the “Discussion” tab across the top of the page. See Figure 3.6 for a screenshot of the Discussion Stream page, showing the message threads which show up under the “Popular” tab, and the Search box which takes the search term.
Figure 3.6: Discussion stream: www.duolingo.com/discussion
Given that these texts are web-based, initially it was felt that they would be easily accessible, however, as Duolingo is a proprietary entity, there were certain restrictions as to the number of message threads available, and the randomness with which they were generated, which caused some challenges for the validity of the study. Once the documents were isolated, the application of the coding schema in Figure 3.5 was the next step in the process.
3.5.1 Access to content and sampling
In order to analyse the statements of Duolingo users, it was necessary to go onto the website (Duolingo, 2012). By creating a user profile, all Discussion forums on the site are accessible, and any discussion can be read by clicking the “Discussion” tab at the top of the Homepage and entering a search term into the “Search” field. All discussion on the site is hosted in this “Discussion stream,” but there is a very high amount of traffic, and at this stage, the fact that the data is owned by the company, and that in scope, it is in the range of “big data” (Neuendorf 2016, p. 204), made access difficult.
84
The initial trawl through the Discussion forums took the form of searching for the specific terms across all of the message threads. The user interface on the site is problematic in that when entering a search term into the Discussion stream, the returned answers are given as links, and the page where the search term was entered is not saved. In practice, this means that it is not possible to use the back arrow to return to the Discussion stream for the particular search term, and returned items need to be opened in subsequent windows. The development team was contacted on a number of occasions, but no information was forthcoming as regards their algorithms for matching search terms, or in terms of granting access to the forums for research purposes. Initially the decision was taken to search for the three specific terms from the SDT Motivation Theory, being Competence, Autonomy and Relatedness, and the results of these searches were kept in Word documents and added to subsequent lists of messages from the Discussion forums. However, the numbers returned on these searches were too high to be workable, necessitating another approach.
In order to sample messages in a more representative fashion, the search term “the” was entered into the Discussion Stream. This was used as a way of identifying how many messages were on the forums altogether. Because Duolingo is a proprietary site, the results were not entirely accurate, and the algorithms for their return were not transparent. The initial search returned 800,000+ answers. Returning to the search, however, the number of messages had increased, until a final search, left working overnight (March 29, 2017), appeared to suggest somewhere around 1.2 million responses altogether. According to Oates, a representative sample for a “population of 1 million or more … (is) … just over 1000” (2006, p. 101). Thus, it was determined that I would conduct the analysis on 1000 message threads.
Isolating 1000 messages in a random manner also proved difficult. Due to the proprietary nature of the data, I was forced to rely on the randomness of the messages returned in Duolingo’s own searches, without knowing how this was determined by the site. This meant that the initial search from the 1.2 million+ responses generated only 920 messages, presumably because the data was too big to be handled by our servers. The documents associated with these responses were copied into a spreadsheet. Among these, duplicates were found, and 51 more results appeared at the end of the original 920 messages. After all duplicates were removed from the 971 messages, 913 entries remained. A further two duplicates were isolated, leaving 911 unique message threads from the Discussion forums.
This left the problem of how to bring the number to a total of 1000. A further 90 messages were generated using the three search terms from SDT and taking the first 30 responses from each search. This ensured that the data set contained a proportion of documents relevant to the analysis, but may slightly neutralise the relative differences between prevalence figures of each motivational component. Although this is not a fully replicable methodology, given the constraints of working with this
85
proprietary website, this was determined to be the best way to approach the generation of a representative sample. That is, it should be noted that if the search were to be conducted on a different day, the results could well be different both because of the lack of transparency in the search algorithms, and also because the site is live, and users visit and update it daily.
As stated in Appendix D, the 1000 message threads used as texts in this analysis were too long to include in the Appendices. For this reason, they are in a separate file contained on a CD in the back cover of the thesis and available from the author, on request.
3.5.2 Method of analysis
The texts taken from the Discussion forums were isolated into units consisting of either the paragraph in which the term appeared, or the entire comment. Sometimes the entire comment needed to be included because the term appeared a number of times within that comment, each time performing the same function, and counting each appearance as a separate use of the term would have skewed the results. The decision on the length of the unit was made when conducting the search for the term as set out below.
After collating the messages into one pdf document, a search was begun, so as to isolate the terms and their synonyms. These synonyms were as identified in sections 3.4.2.1 and 3.4.2.2, but once again the numbers of responses made handling the data in any meaningful way very difficult. It was clear at this stage that the Duolingo official publications were much smaller in number, and I took the decision to conduct the Content analysis on these documents first. Given that I was looking to compare Duolingo from the users’ perspective with Duolingo as described by its creators, it was decided that the search would be most useful for comparison if I used only terms found to be present in the Duolingo official publications. The Discussion forum message threads were explored only after that search was complete.
Terms were entered into the Advanced search box in Adobe Acrobat Pro DC Reader, where users are given the option of entering a word or phrase, and then selecting whether the search should be whole words or not. A separate screen returning all the responses then appears. From here, instances of the terms were isolated, and the entire comment was copied and pasted into a Word document before being copied into a table, with the term highlighted in bold text. Once all of the words were found, these separate units were analysed according to the analysis schema as set out in section 3.4.3.3.
A separate count of occurrences was kept, detailing: overall number of units;
number of foreign words; number of linguistic usages;
86 number of irrelevant usages;
and, where relevant:
o number of positive usages; o number of negative usages.
In Table 3.5, we see an example of how the incremental information was kept. These results will be discussed in detail in Chapter 7 Discussion.
Term Occurrences Percentages
Communicate Overall 54 3.5% of threads including the term
Foreign word -
Linguistic use 3
Irrelevant 40
Positive 9 82% positive usages
Negative 2 18% negative usages
Table 3.5: Results from Duolingo Discussion forums on term: Communicate