Ebook: Data visualization tools for users (English)

(1)

Tools for

data visualization

02

03

tools

Get the benefit from data with four webinars

(2)

Data Science stands today as a multidisciplinary profession. The

following is intended to be a basic guide of some useful resources

available for each of the facets performed by these professionals.

(3)

Data Science stands today as a

multidisciplinary profession, in which

knowledge from various areas overlap in a

profile more typical of the Renaissance than

from this super-specialized 21st century.

TOOLS AND

LANGUAGES

• SQL

• Sqlite

• SQlite3

• RSQlite

• Toad

• Tora

• RapidMiner

• Knime

• Pentaho

• RODBC

• RJDBC

• pyODBC

• mxODBC

• SQLAlchemy

• pandas

• data.table

• XML

• Jsonlite

• json

Given the scarcity of formal training in

this field, data scientists are forced to

collect dispersed knowledge and tools

to optimally develop their skills.

The following is intended to

be a basic guide, obviously

not exhaustive, of some useful

resources available for each of

the facets performed by these

professionals.

(4)

Data management

Part of the work of the data scientist it to capture, clean-up and store information in a format suitable for its processing and analysis.

The most usual scenario is to access a copy of the data source for a one-time or periodic capture. You will need to know SQL to access the data stored in relational databases. Each database has a console to execute SQL queries, even though most

people prefer to use a graphical environment with information about tables, fields and indexes. Some of the most popular data management tools are Toad, proprietary software for Microsoft’s platform, and Tora, which is open-source and cross-platform. Once the data is extracted we can store it in plain text files which we will upload to our working environment, for machine learning or to be used with a tool such as SQlite.

(5)

SQlite is a lightweight relational database with no external dependencies and which does not require to be installed in a server. Moving a database is as easy as copying a single file. In our case, when processing information we can do it without

concurrence or multiple access to the source data, which perfectly suits the characteristics of SQlite. The languages we use for our algorithms have connectivity to SQlite (Python, through SQlite3and R, trhough RSQlite) so we can choose to import the data before preprocessing or to do part of it in the database itself, which will help us to avoid more than one problem after a certain amount of records.

Another alternative to bulk data capture is to use a tool including the full ETL cycle (Extraction,

Transformation and Load), i.e. RapidMiner, Knime

or Pentaho. With them, we can graphically define the acquisition and debugging cycles of data using connectors.

Once we have guaranteed access to the data source during preprocessing, we can use an ODBC connection (RODBCand RJDBC in R, and pyODBC, mxODBC and SQLAlchemy in Python) and benefit from making connections (JOIN) and groups (GROUP BY) using the database engine and subsequently importing the results.

For the external processing, pandas (a Python library) and data.table (a package in R) are our first choice. Data.table allows to circumvent one of R’s weaknesses, memory management, performing vector operations and reference groups without having to duplicate objects temporarily.

(6)

A third scenario would be to access information generated in real time and transmit it in formats like XML or JSON. These are called incremental learning projects, and among them we find

recommendation systems, online advertising and high frequency trading.

For this we will use tools like XML or jsonlite (R packages), or xml and json(Python modules). With them we will make a streaming capture, make our predictions, send it back in the same format, and update our model once the source system provides us, later on, with the results observed in reality.

(7)

Even though the Business Intelligence, Data

Warehousing and Machine Learning fields are part of Data Science, the latter is the one which

requires a greater number of specific utilities. Hence, our toolbox will need to include R y Python, the programming language most widely used in machine learning.

Data analysis

For Python we highlight the suite scikit-learn, which covers almost all techniques, except perhaps neural networks. For these we have several interesting alternatives, such as Caffe and Pylearn2. The latter is based on Theano, an interesting Python library that allows symbolic definitions and a transparent use of GPU processors.

(8)

If we need to change any R package we will need C++ and some utilities that allow us to re-generate them: Rtools, an environment for creating packages in R under Windows, and devtools, which facilitates all

processes related to development.

• Data.table: Fast reading of text files; creation, modification and deletion of columns by

reference; joins by a common key or group, and summary of data.

• Foreach: Execution of parallel processes against a previously defined backend with utilities such as doMC or doParallel.

• Bigmemory: Manage massive matrices in R and share information across multiple sessions or parallel analyses.

• Caret: Compare models, control data partitions (splitting, bootstrapping, subsampling) and tuning parameters (grid search).

• Matrix: Manage sparse matrices and

transformation of categorical variables to binary (onehote encoding) using the

sparse.model.matrix function.

There are also some general purpose tools that will make our life easier in R:

• Gradient boosting: gbm y xgboost.

• Random forests for classification and regression: randomForest and randomForestSRC.

• Support vector machines: e1071, LiblineaR and kernlab.

• Regularized regression (Ridge, Lasso and ElasticNet): glmnet.

• Generalized additive models: gam.

• Clustering: cluster.

(9)

Distributed environments deserve a special mention. If we have dealt with data from a large institution or company, we will probably have experience working with the so-called Hadoop ecosystem. Hadoop is a distributed file system (HDFS) equipped with algorithms (MapReduce) that allows to perform information processing in parallel.

• Vowpal Wabbit: Online learning methods based on gradient descent..

• Mahout: A suite of algorithms, including among them recommendation systems, clustering, logistic regression, and random forest.

• h2o: Perhaps the tool experiencing a higher growth phase, with a large number of

parallelizable algorithms. It can be executed from a graphical environment or from R or Python.

Among the machine learning tools compatible with Hadoop we find:

The data scientist should also keep abreast of new trends of generational change of Hadoop to Spark. Spark has several advantages over Hadoop to process information and the execution of

algorithms. The main one is speed, as it is 100 times faster because, unlike Hadoop, it uses in-memory management and only writes to disk when necessary.

(10)

Spark can run independently or may coexist as a component of Hadoop, allowing migration to be planned in a non-traumatic way. You can, for example, use HBase as a database, even

though Cassandra is emerging as a storage solution thanks to its redundancy and scalability.

Spark can run independently or may coexist as a component of Hadoop, allowing migration to be planned in a non-traumatic way. You can, for example, use HBase as a database, even

though Cassandra is emerging as a storage solution thanks to its redundancy and scalability.

(11)

Finally, a brief reference to the presentation of results.

The most popular tools for R are clearly lattice y ggplot2,

and Matplotlib for Python. But if we need professional presentations embedded in web environments the best choice is certainly D3.js.

Among the integrated Business

Intelligence environments with a clear approach to presentations we should highlight the well known Tableau, and as alternatives for graphical

exploration of data, Birstand Necto.

(12)

We present you some of the best data visualization tools that you

can use in your business to take full advantage of the large

amount of information created every day in the digital world.

Five data visualization

(13)

The digital universe is reaching new

thresholds. The amount of data

generated by both private users and

companies is growing at a rapid pace.

Actually, according to a

study by IDC and

EMC

, the world of digital data is doubling

its size every two years, and in 2020 it

will have generated 44 zettabytes of

information, or what is the same: 44

trillion gigabytes of structured and

unstructured data.

The fact of creating and accessing a

website, participating in a blog, increasing

our number of followers, post comments,

send a tweet or just surfing the internet

produces a whole range of data that, if

exploited properly, can be of great value

for companies.

VISUALIZATION TOOLS INDEX

• Google Fusion Tables

• CartoDB

• Tableau Public

• iCharts

(14)

The big challenge, however, is to make sense of all that data. That is, to be able to capture, link,

analyze and extract its true value, so that the

information can be presented in an attractive, clear, concise and understandable manner, facilitating decision making in your business. Exploring and analyzing visually customers’ data can also take you to discover new ways to reach them, create a better segmentation, personalized offers for

products or services, and generate innovative ideas, among many other possibilities which can

contribute to maintain the engagement between your brand and your users over time.

Where to start

The first steps in data visualization may be intimidating. Fortunately, the same way data is growing, so do the tools that help us get the most out of it. Here we present the five tools that we consider the best, based on the capabilities they provide and the level of experience required.

(15)

Google Fusion Tables

It’s an excellent tool for beginners or for those who don’t know programming. For more advanced users there is an API that allows to produce graphics or maps from information. One of the advantages of this application is the diversity of data representations it offers. It also offers a relatively fast way to create graphics and maps, including GIS functions to analyze data by geographic area.

This tool is used frequently by The Guardianto produce detailed maps very quickly.

(16)

CartoDB

This is an open source service directed to any user, regardless his technical level, with a friendly

interface. It allows to create a variety of interactive maps, choosing from a catalog of options (which includes Google Maps) or adding your own customized maps.

The most interesting feature of this tool is that it lets you access Twitter’s data to see how users react to a brand, a particular marketing campaign or event. We can see a good example of this on the map tracking tweets that was created last year with the launch of Beyoncé’s latest album. It shows clearly the places where the release had more impact. This is a great source of visual information for marketing professionals and businesses.

It should also be highlighted that it has an active group of developers who provide extensive

documentation and examples. In addition, the open nature of its API allows to create continuously new integrations and to increase the capabilities of the tool with new libraries.

(17)

Tableau Public

With Tableau Public you can create easily

interactive maps, bar and pie charts, etc. One of its advantages is that, like Google Fusion Tables, you can import tables from Excel to facilitate your work. In a matter of minutes you can generate an

interactive graphic, embed it in your website and share it. For example, the news portal Global

Post created with it a series of charts about the best countries to do business in Africa.

In the recently released 8.2 version we can also find the new OpenStreetMap tool, which allows to

produce very detailed maps from local data such as cafes or shops. Tableau Public is a free tool,

(18)

iCharts

You can get started in the world of data

visualization with the service offered by iCharts, which has a free version (Basic) and two premium options (Platinum and Enterprise). With this tool you can create visualizations in just a few steps,

exporting Excel and Google Drive documents or adding data manually.

Through this tool it is possible to share your graphics with your collaborators privately, besides

being able to edit and update them with new data through its cloud computing service. You can even share them with your clients through emails,

newsletters or social networks.

Among the companies using this service we find the prestigious consulting firmIDC, which

uses iCharts to provide visual images of relevant data included in its reports.

(19)

Smart Data Report

Finally, we also recommend Smart Data Report, which is not a tool as powerful as the previous ones but has the advantage of being an affordable data solution for entrepreneurs and small businesses whose workers don’t have much spare time.

Among other services, this website offers free data analysis and the option to receive reports by email, without having to create them yourself. Once the service has your report ready, it generates

an HTML code that you can embed in your corporate website or in your articles.

(20)

Mapping data, visualizing them in geospatial apps and applying

automatic learning. We put our knowledge into practice with the

help of these video tutorials.

(21)

Mapping data

CartoDB explains how to convert location data into knowledge for your business. In this tutorial you can learn how to analyze, visualize and build data apps using the CartoDB tool.

(22)

Machine Learning

Now summer's round the corner, Andrés González, solutions manager for Big Data and Data Prediction at Clever Task, shows us how to make forecasts from data in a very specific area: the tourist sector.

(23)

Geospatial apps

And if you want to learn to create apps and geospatial data, you can't miss this tutorial –also by CartoDB– explaining how you can make the most of an API –in this case the one opened by BBVA for the

(24)

Good examples of visualization

Finally, to finish off this selection, Alberto Cairo, professor of data visualization at the Universidad de Miami, teaches us good practices in data visualization. It's good to learn from our own mistakes and from the

(25)

THIS MIGHT INTEREST YOU

Innovation Edge Big Data: to create business value with data

Emerging Tech: Data visualization beyond the noise

Infographic: the keys of Big Data by DJ Patil

Infographic: Big Data, chronology, present and future

Caso study: data visualization with Illustreets y CartoDB

(26)