Chapter 6: Obtaining Data in the Speech Community
6.6 Corroborating Evidence
6.6.1 Twitter™ Posts
Twitter has become one of the important microblogging sites in modern time and an important source of information on recent events. It is a tool for broadcasting news, expressing opinions, and merely communicating with friends. Short messages, known as taġ īdāt ‘tweets’, are usually informal, which include alternative spelling, slang, neologisms, and links, and mostly ignoring punctuation. Each tweet contains the following information:
1. Username – the author of the tweet.
2. Timestamp – time when the tweet was written.
3. Location (optional) – place where the tweet was written.
4. Posting method – how the tweet is published (e.g. iPhone, web, and others).
The type of language used in electronic communication is ‘hybrid, showing both speech-like and writing-like features, as well as features that are unique to the digital medium and are, to some extent, the result of its technological restrictions’ (Deumert 2009: 860).
As a matter of fact, prior to using Twitter as corroborating evidence, I was actually searching for an online dialect corpus of colloquial Kuwaiti. I accessed the acclaimed online Arabic corpus ‘arabiCorpus’.23
This corpus contains five main categories or genres: Newspapers, Modern Literature, Nonfiction, Egyptian Colloquial, and Premodern. The total number of words of the whole corpus is: 123,854,642. Among the newspapers, I found one Kuwaiti newspaper (Al-Waṭan, 2002 editions)24 which contains 6,454,411 words. However, one of the drawbacks is that the language of journalism in Kuwait is Modern Written Arabic. It is thus
23 <http://arabicorpus.byu.edu> 24 <http://alwatan.kuwait.tt>
hard to find spoken Kuwaiti colloquial material in daily newspapers and magazines, although some columnists code-mix between colloquial and slang terms within Arabic sentences. Therefore, I turned to Twitter to collect samples of language use of modern Kuwaiti colloquialisms for this study. However, all the information retrieval methods had to be done manually, unless I paid for the information to be retrieved. For example, if we consider a keyword like dašš, there are two ways of looking up any word on Twitter: (i) by using Arabic orthography شد, or (ii) for technical reasons, the tweets are sometimes based on a romanised Arabic chat alphabet, i.e. the romanised dash should be typed in instead of the phonemic dašš. This kind of language is known as ʿArabīzī, a portmanteau word coined from ʿa abī (‘Arabic’) and in līzī (‘English’) (cf. Holes 2010: 312, 2011: 139, 2013: 283).
However, when I typed in dašš (in Arabic orthography), I only got tweets with the perfect verb form of the 3rd person masculine singular dašš, without showing its derivations. I also received tweets on the word dišš meaning ‘a satellite dish’ and ‘douche’ (‘shower’), which are English and French borrowings, respectively. Regarding xalla, I mainly looked it up in Arabic but I had to type the following romanised forms in order to retrieve more examples of use: khala, khalla, 5ala, and 5alla. In this orthography, ‘5’ and /kh/ stand for the Arabic voiceless uvular fricative [خ] /x/.
In general, I followed three preliminary steps in using Twitter: (i) logging into the Twitter page,25 (ii) entering the keyword in the search engine box, and (iii) collecting Twitter posts for analysis. I also used a Twitter partner application called Topsy26 that maintains an index of hundreds of billions tweets.
I aim to concentrate solely on the way that some words occur regularly whenever another word is used, i.e., collocations, or as described by Firth (1957: 14), ‘actual words in habitual company’. This is because lexical items tend to co-occur more frequently in natural language use than ‘syntax and semantics alone would dictate’ (Krishnamurthy 2009: 97). Consequently, I mainly used Twitter to extract examples of present-day collocations and to see whether it is the case that the more established the meaning, the more reliant it is on collocation. I also aimed to look at the versatility of each of the verbs in terms of their ability to collocate with a variety of different expressions.
25 <http://twitter.com> 26 <topsy.com/tweets>
I selected Twitter for three reasons: First, it is an excellent source of urban, educated, literate Kuwaitis, and it is indicative of a new domain of use; it happens to be written though. Twitter is a hybrid mode of expression because Twitter users pay crucial attention to economy of effort (140 characters per tweet), which is contrary to talk where we could have a lot of repetition or pauses. In fact, Twitter is an innovative means of collecting colloquial data, but it should be recognised that this is a colloquial variety that appears in a new form and a new dimension.27 Crystal (2001: 67) has highlighted the ‘strong, creative spirit’ that characterises the language of Internet users: ‘The rate at which they have been coining terms and introducing playful variations into established ones has no parallel in contemporary language use’. The second reason for selecting Twitter is because I could not guarantee that the four verbs under study would appear; my audio recordings and TV shows do not necessarily contain simple examples of dašš, xalla, miša, or i a . Third, the Arabic corpora available on the Internet do not contain spoken data from colloquial Kuwaiti. In fact, Twitter posts are seen as confirmatory data, so I believe the Twitter material can be used to answer the following questions: What is the capacity of the target verbs for collocation (the extent to which the four verbs are polysemous)? How robust (frequency) and persistent (stability) are the different senses of the four verbs?
By and large, Twitter is a dynamic corpus; it is constantly growing, compared to a stable corpus, which does not necessarily change in size (cf. Hanks 2013: 32). According to Baker et al. (2006: 64), ‘[d]ynamic corpora are useful in that they provide the means to monitor language change over time’. In this respect, I mainly emphasise the value of Twitter data for the study of meaning and I do not entirely reject introspective data. In many areas of semantics and pragmatics, ‘intuitions are strong and stable, across all native speakers, whether linguistically naïve or trained, and must be given the status of data’ (Stubbs 2002: 71). Despite this, I cannot vouch for its accuracy.28
27 See, for example, the presentation given by Jack Grieve, et al. 2014. ‘Big Data Dialectology:
Analyzing Lexical Spread in a Multi-Billion Word Corpus of American English’ (Northern Arizona University: American Association for Corpus Linguistics).
28 Also, it is difficult to discern the correct pronunciation of the tweets because the Arabic script is