How to create a corpus? - Modeling users’ timelines with LDA

3.3 Modeling users’ timelines with LDA

3.3.1 How to create a corpus?

First, a list of documents is needed. It is possible to obtain it with the following code:

Example 3.9: read data

def r e a d _ d a t a(p a t h) : d o c s = [] for f i l e _ n a m e in os.l i s t d i r(p a t h) : f i l e = o p e n(p a t h + f i l e _ n a m e) d o c s.a p p e n d(u n i c o d e(f i l e.r e a d() , e r r o r s=’ r e p l a c e ’) ) r e t u r n d o c s Data Cleaning

Data cleaning is the most important step in natural language processing. In the case of Twitter, this task is even more difficult due to the length of the Tweets and the topic diversity. First of all, all words have been converted to lowercase. Then, each document has been split into words. After that, all numbers have been removed, but words with numbers have been kept. Next, words with three characters or less have been removed because short words like ”to”, ”I”, ”me”, etc. do not help the algorithm. Next, stopwords have been removed using the nltk stopwords list. After that, words have been lemmatized. It is a transformation of the word to convert it into the word’s lemma, or dictionary form. For instance: ”are” is transformed into ”be” and ”cats” into ”cat”. Furthermore, bigrams have also been taken into account. There are words like New York or San Francisco which are

Figure 3.4: Class Diagram of LDAModel

always together. Using bigrams we can detect them and treat them as only one. In the code below, we find bigrams and then, add them to the original data, because we would like to keep the words ”machine” and ”learning” as well as the bigram ”machine learning”. Computing n-grams of a large dataset can be computationally and memory expensive.

This is the code that used for preprocessing the data.

def p r e p r o c e s s i n g(d o c s) :

# S p l i t the d o c u m e n t s i n t o t o k e n s .

t o k e n i z e r = R e g e x p T o k e n i z e r(r ’ \ w + ’)

s t o p s = set(s t o p w o r d s.w o r d s(’ e n g l i s h ’) ) # n l t k s t o p w o r d s l i s t

for idx in r a n g e(len(d o c s) ) :

d o c s[idx] = d o c s[idx].l o w e r() # C o n v e r t to l o w e r c a s e .

d o c s[idx] = t o k e n i z e r.t o k e n i z e(d o c s[idx]) # S p l i t i n t o w o r d s .

# R e m o v e numbers , but not w o r d s t h a t c o n t a i n n u m b e r s .

d o c s = [[t o k e n for t o k e n in doc if not t o k e n.i s n u m e r i c() ] for

doc in d o c s]

# R e m o v e w o r d s t h a t <= t h r e e c h a r a c t e r .

d o c s = [[t o k e n for t o k e n in doc if len(t o k e n) > 3] for doc in

d o c s]

d o c s = [[t o k e n for t o k e n in doc if t o k e n not in s t o p s] for doc

in d o c s]

# L e m m a t i z e all w o r d s in d o c u m e n t s .

l e m m a t i z e r = W o r d N e t L e m m a t i z e r()

d o c s = [[l e m m a t i z e r.l e m m a t i z e(t o k e n) for t o k e n in doc] for doc

in d o c s]

# Add b i g r a m s to d o c s ( o n l y o n e s t h a t a p p e a r 20 t i m e s or m o r e ) .

b i g r a m = g e n s i m.m o d e l s.P h r a s e s(docs, m i n _ c o u n t= 2 0 )

for idx in r a n g e(len(d o c s) ) :

for t o k e n in b i g r a m[d o c s[idx]]:

if ’ _ ’ in t o k e n:

# T o k e n is a bigram , add to d o c u m e n t .

d o c s[idx].a p p e n d(t o k e n)

r e t u r n d o c s

Dictionary and corpus

Once we have the data preprocessed, creating the dictionary and the corpus is straightforward thanks to Gensim.

Example 3.11: get dictionary

def g e t _ d i c t i o n a r y(t e x t s) : # t u r n our t o k e n i z e d d o c u m e n t s i n t o a id < - > t e r m d i c t i o n a r y d i c t i o n a r y = c o r p o r a.D i c t i o n a r y(t e x t s) # d i c t i o n a r y . f i l t e r _ e x t r e m e s ( n o _ b e l o w =20 , n o _ a b o v e = 0 . 5 ) d i c t i o n a r y.f i l t e r _ n _ m o s t _ f r e q u e n t(5) r e t u r n d i c t i o n a r y

Example 3.12: get corpus

def g e t _ c o r p u s(d i c t i o n a r y, t e x t s) :

# c o n v e r t t o k e n i z e d d o c u m e n t s i n t o a d o c u m e n t - t e r m m a t r i x

u s e r s _ c o r p o r a = [d i c t i o n a r y.d o c 2 b o w(t e x t) for t e x t in t e x t s]

r e t u r n u s e r s _ c o r p o r a

Figure 3.5 shows the class Dictionary, which is a subclass of Mapping. This last class is a subclass of Container, Iterable and Sized. This is possible because Python sup- ports multiple inheritance in contrast to Java. The module abcoll defines abstract base classes for collections.

Figure 3.5: Class Diagram of Dictionary

In document Topic modeling for analysing similarity between users in Twitter (Page 32-35)