2.2 Data Mining in Social Media Environments
2.3.6 Data Mining and Machine Learning Tools
R[81] is a highly capable open-source multi-paradigm programming language, developed specifically for Statistical Computing and Data Mining. R is a highly mathematical influenced language, that shares some similarities with MATLAB, given that it is heavily based on matrix arithmetic. R accomplishes this using developer-friendly data structures such as vectors, matrices, arrays, with the most popular being the DataFrame, a high-level data structure that used to represent data tables. Despite this, R is not strictly a matrix arithmetic tool (although it has similar performance when compared to MATLAB or Octave), as it is highly expandable due to the a repository of user-created packages. Thus, it is possible to extend R’s Data Mining capabilities by importing even more ML algorithms, using different plotting libraries, and even adding a Web Framework to the project, among other options. Besides, given that R makes available a huge amount of matrix based operands (which, once again, can be extended with CRAN packages), this means that R can be used as for the whole pipeline of the KDD process, from the preprocessing to the postprocessing stage[82].
weka
The Waikato Environment for Knowledge Analysis (WEKA) is an open source software developed in the University of Waikato with the goal to expedite research in DM and ML, through the unification of algorithms and knowledge analysis tools in a single suite[83]. It also accounted for the need to create new algorithms for data manipulation and model evaluation without resorting to different infrastructures.
The WEKA suite is accessible through a Graphical User Interface (GUI) and a Java API. The user interface can be maneuvered in three different views, the Explorer, Knowledge Flow and Experiment view. The Explorer view is divided in several sections that deal with different kinds of tasks. The first is called Preprocess and, as the name implies, handle the preprocessing stage of KDD pipelines. It allows to load data from different sources and formats or generate it from manufactured sources, as well as provides the tools for data transformation and filtering. The second is the Classify section, which handles the application of supervised learning algorithms over selected data and offers the tools for model validation and visualization. The third section of this view is the Cluster section, which has the same principle but unsupervised learning algorithms the in family of cluster algorithms and the fourth, "Associate" handles association rule methods. WEKA’s focus is mainly on supervised learning, namely Classification and Regression algorithms. It has less support when it comes to the unsupervised learning tasks[83]. Besides the mentioned capabilities, the Explorer view also offers dedicated methods to attribute selection and data visualization, namely through the Select Attributes and Visualize sections. The second view, Knowledge Flow, presents a more visual setup of the data processing pipeline. It borrows the concept of data flow diagrams to give the users the possibility to drag, drop and connect nodes that are representative of most the functionality presented the Explorer view. This allows to explore the advantages of algorithms that are incremental, i.e. that do not require to load complete data sets into memory to process them. The last view in the WEKA GUI is the Experimenter view, which eases experimentation and performance comparison of different models in different data sets and allows to distribute the computational effort between different machines.
The data processing capabilities described so far can be achieved through the extensive Java API, that emulates the work flows of the visual interface through its style and makes translating it into
code an accessible task.
python data science stack
Python is a well-developed, open-source, multi-paradigm language with general purpose usage. However, over the years Python has gained some notoriety in the data science community. Given its high-level interactive nature, Python was an interesting target for the development of scientific libraries. This lead to the creation of a set of libraries that are widely used for algorithmic development and data analysis.
NumPy[84] is a low-level library that adds support for n-dimensional arrays, as well as a set of functions to operate on them, thus allowing Python to be used as a matrix arithmetic tool. For more advanced operations, SciPy[85] can be used. It adds support for much more complex operations, such as image or signal manipulation and interpolation operations.
The previously enumerated tools pose the inconvenience of being rather low-level. To fill this gap there is Pandas[86]. The main feature of this library is the addition of the DataFrame object, which acts mainly as a wrapper for NumPy arrays. It is as a two-dimensional tabular data structure, adding slicing, grouping and reshaping operations.
Having addressed the main issues of data manipulation there is still missing the Data Mining component, which is where the meaningful patterns and useful knowledge will be exposed. The most relevant option in the Python ecosystem is Scikit-Learn[87]. This library provides implementations of almost every relevant Machine Learning algorithm, while keeping a consistent interface between them, allowing experimentation in a quite accessible way.
One of the main criticisms to the Python language is its low performance. However, all of the previously described libraries’ implementations are extended with C bindings. These bindings are seemingly accessed from Python, therefore increasing the performance of these libraries to nearly compiled languages’ level[87].
Python also has an extensive library dedicated to Natural Language Processing (NLP) called Natural Language Toolkit (NLTK). It offers a broad Python API to deal with text representative of the human language, along with a collection of lexical resources and corpus for Text Mining tasks[88]. It is worth noting that NLTK has dedicated resources to deal with the Portuguese language.
c h a p t e r
3
S y s t e m D e s c r i p t i o n a n d
A r c h i t e c t u r e
This chapter introduces the necessary to requirements to fulfil this dissertations purpose and how its development was conceptualized.