• No results found

A Study on Data Science Methodologies, People and Skill Set

N/A
N/A
Protected

Academic year: 2020

Share "A Study on Data Science Methodologies, People and Skill Set"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

A Study on Data Science Methodologies,

People and Skill Set

Kaustav Ghosh, Asoke Nath

Former Student, Department of Computer Science, St. Xavier’s College Autonomous. Kolkata, India

Associate Professor, Department of Computer Science, St. Xavier’s College Autonomous. Kolkata, India

ABSTRACT: Data Science refers to an emerging area of work concerned with the collection, analysis, visualization, management and preservation of large volumes of information. Although the name Data science seems to connect most strongly with areas such as databases and computer science but also many different kinds of skills – including non-mathematical skills are also required. Data science functions on big data and present day organizations thrive on them for their economic growth. Its worthy in more ways than one can think of. The concept of data science being very new requires a clear understanding of the umbrella term ‘data science’ along with how to build a data science team, since data science is not a one man army. A data science team constitutes of various people having complementary skill set, essential for accomplishing complex business as well as data science problems. In the present paper the authors have made a brief overview on scope and challenges in Data science in 21-st century.

KEYWORDS: Hadoop, HBase, MongoDB, IBCF, fault-tolerant, R, Python, JavaScript, PHP, Perl, MATLAB, XML, REST, SQL, Cassandra, PCA technique, data analyst, Singular Value Decomposition (SVD), Microsoft Excel, Statistical Analysis System (SAS), data silo.

I. INTRODUCTION

In today’s world data is available from a wide range of sectors in vast amounts. Companies in almost every industry are

focused on exploiting data to gain competitive advantage. In the days gone by, companies used to employ teams of

statisticians, modelers, and analysts to explore datasets manually. The present day data volume and data variety outpace the capacity of manual analysis hands down. Nowadays more powerful computers are available, networking has become ubiquitous and efficient algorithms have been developed that can connect datasets to enable broader and

deeper analyses that was previously impossible. All these have led to increased use of data science and data mining

techniques in widespread business applications. The widest applications of data science are in marketing, particularly

tasks like targeted marketing, online advertising and recommendations for cross-selling. It is widely applicable in

general customer relationship management that analyzes customer behavior in order to manage attrition and

maximize expected customer value. The financial industry uses data science for credit scoring and trading, and in

operations via fraud detection and workforce management. Big retailers like Walmart and Amazon apply data

science and data mining throughout their businesses, from marketing to supply-chain management.

Data science is the cynosure of several such big projects and applications as it involves certain principles, processes,

and techniques for understanding phenomena by virtue of automated data analysis. Data-driven decision-making or

making decisions based on the analysis of data, rather than on pure intuition, has several advantages, and more often

than not are accurate. A marketer can select advertisements based on the analysis of data, regarding how consumers

are reacting to different advertisements and responding to them. Big datausingtechnologies like Hadoop, HBase, and MongoDB that are abundantthese days, help coping up with the speed and volume of big data and facilitate data engineering and implementation of data science and data mining techniques.

However with all these being said, data science is still a young field; the particular concerns relating to data science are

fairly new and general principles are just beginning to emerge. With theories pertaining to several aspects of it and

general principles being formulated, the field is largely experimental [5].

(2)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

most important information in their data warehouses. Data mining tools predict future trends and behaviors and

allow businesses to make proactive, knowledge-driven decisions. They extract results and find answers that were

too time-consuming to be resolved in the past. They excavate databases for hidden patterns and find predictive

information that experts may miss out, as it lies outside their expectations range. They can be readily and rapidly implemented on existing software and hardware platforms to enhance the value of existing information resources.

They can be integrated with new products and systems and can work online. Implementing them on high

performance client-server or parallel processing computers analyzes the massivedatabases to find answers to

questions like "Which clients are most likely to respond to my next offer or advertisement and why?" [6]

Data warehouse: Data warehouse is a subject-oriented, integrated, time-variant and non-volatilecollection of

data in support of management decision making process. Subject-Oriented: The stored data targets specific

subjects like storing data regarding total sales, number of customers, etc. and not general data on everyday

operations. Integrated: The data may be distributed across heterogeneous sources which have to be integrated like

sales data in a relational database, customer information in flat files and others. Time Variant: The data stored

may not be current but varies with date and time and have an element of time like data of sales in last 5 years.

Non-Volatile: The data warehouse must be separate from the Enterprise Operational Database and hence must

not be subjected to frequent modification. It generally has only two operations performed on it, loading of data

and accessing data [7].

Machine Learning: Machine learning is a type of artificial intelligence that provides computers the ability to

learn without being explicitly programmed. It focuses on the development of computer programs that can teach

themselves to grow and change when exposed to data and varieties of them. The process of machine learning is

kind of similar to that of data mining. Both the systems search through data in order to find out patterns.

However, the difference lies in the fact that data mining is carried out by a person, in a specific situation for his

understanding, on a particular data set, with a goal in mind, while machine learning uses data to detect patterns

in the data and adjust program actions accordingly. A great example of this would be the Facebook's “News

Feed” feature that uses machine learning to personalize each member's feed. If a member “reads” “likes” or “comments” a particular friend's “posts”, the “News Feed” starts to show more of that friend's earlier activities in

the feed. Facebook uses statistical analysis and predictive analytics to identify patterns in the user's data and use

these patterns to populate the “News Feed”. If a member stops to “read”, “like” or “comment” on a particular friend's “posts”, the new data will be included in the data set and the “News Feed” will adjust accordingly.

Machine learning algorithms are categorized into two types: supervised learning and unsupervised learning [8].

Supervised learning: Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples, with each example being a pair consisting of

an input object, mainly a vector, and a desired output value. Supervised learning algorithm analyzes the training

data and produces an inferred function, which can be used for mapping new examples. In an optimal scenario a

supervised learning algorithm correctly determines the class labels for unseen objects[9].

Supervised learning problems can be further categorized into two types:

Classification: A classification problem is when the output variable is a category such as “red” or “blue” or “disease” and “no disease”.

Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”

[12].

Unsupervised learning: Unsupervised learningis the machine learning task of inferring a function to describe

hidden structure from unlabeled data.As the examples given to the learner are unlabeled, there is no error or

reward signal to evaluate a potential solution. It aims to model the underlying structure or distribution in the data to learn more about the data. In unsupervised learning, algorithms devise themselves to discover and represent

the interesting structure in the data [12][13]. Unsupervised learning can be categorized into two types:

Clustering: A clustering problem is where we want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.

Association: An association rule learning problem is where we want to discover rules that describe large portions of our data, such as people that buy X also tend to buy Y. This is dealt with in detail in “section 4.C”,

(3)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

II. ADVANTAGES OF BIG DATA

Big data certainly have several disadvantages as far as their collection, storage, authentication as well as their secured management is concerned and many a difficulty has to be overcome to deal with them and to extract results from them, however the beneficial side of big data is just too powerful. Let us look at some of the advantages that big data brings with it:

 Let’s start with an instance that is quite common these days and involves many of us, online shopping. We all have

been customers on online shopping websites at some point of time or the other and we still are at times. Being today’s consumers we are a tough nut to crack. We look around a lot before buying, talk to friends as well as our social network peers about their purchases and we demand to be treated as uniquely as possible and want to be sincerely thanked for buying products. Big data takes care of all these little details in a far-reaching manner so that we can get to engage in an almost one-on-one, real-time conversation with the online shopping forums. This is not actually a luxury. It’s just that if we are not treated the way we want to be, we will opt for something different in the blink of an eye. In a bank, big data technologies allow the clerk to check a customer’s profile in real-time and learn about relevant schemes or services that he might advise to him, or others. Big data tools help in directing call-center representatives to offer specific incentives based on the caller's customer segment, duration of relationship or association with the company or organization, or type of complaint. Big Data helps in uniting the digital and physical shopping spheres: a retailer could suggest an offer based on the consumer’s indication of a certain need in the social media.

 Big Data has the power to help a company understand how others perceive their products so that it can adapt to

them and strategize its marketing accordingly, if need be. Analysis of unstructured social media text allows the company to uncover the sentiments of its customers and even segment those in different geographical locations or among different demographic groups. Big Data let test many a different variation of computer-aided designs in real quick time so that a company can check how minor changes in say material affect costs, lead-times and performances. The efficiency of the production process can be raised accordingly.

 Success of a company depends not only on just how it is run. Social and economic factors as well are crucial for

accomplishments. Predictive analytics allows scanning and analyzing of newspaper reports and social media feeds so that the company can permanently keep up to speed on the latest developments in the industry and its environment. Big data allow detailed health-tests on the company’s suppliers and customers. This allows taking action when one of them is in risk of defaulting.

 Big data tools allow mapping the entire data landscape across the company, allowing it to analyze the threats that

are faced internally. It brings to notice sensitive information that are not protected in an appropriate manner and makes sure it is stored according to regulatory requirements. Real-time big data analytics allows notification of any situation where say credit card data (16 digit numbers) are stored or emailed out and investigated accordingly.

 Big Data analytics facilitates personalizing the content to look and feel in real time that the website suits each

consumer entering it, depending on their sex, nationality or from where they coming up on the site. Most common

examples include offering readymade recommendations like Amazon’s use of real-time, item-based collaborative

filtering (IBCF) to increase its “frequently bought together” and “customers who bought this item also bought” features. LinkedIn suggestions of “people you may know” or “companies you may want to follow” features are also part of it. Surprisingly Amazon generates about 20% more revenue by virtue of this method.

 Big data tools remove unpractical, costly averages by doing away with estimations. Factories estimate certain types

of equipments to be likely to wear out after a certain amount of time. In doing so they tend to replace every piece of it within that time, even if they had much more useful life left in them. Big data tools access massive amounts of data and use their speed to spot failing grid devices and predict when they would fail. This results in a much more cost-effective replacement strategy and even less downtime by tracking faulty devices faster.

 Once business users requiring analyzing large volumes of varied data required the help of experts and other

professionals, as they themselves lacked the technical skills for doing so. Many often it used to turn out that by the time they received the requested information, it was obsolete or incorrect. With Big data tools having come up, the technical teams can now do the groundwork and build algorithms for faster searches. They can develop systems and install interactive and dynamic visualization tools that allow them to analyze, view and benefit from the data.

 Big data plays even a greater role in health care strategies. When a patient is diagnosed with cancer, they undergo a

(4)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

that is tailored for his individual genes. With human genome mapping and big data tools, light has been shed on this issue. It will soon be a possibility for everyone to have their genes mapped as part of their medical record. This will bridge the gap between medicines and finding the genetic determinants that cause a disease, so that drugs can be developed that are exclusively tailored to treat those cases.

 Big data storing facilities like cloud computing offers keeping large volumes of data in a fault-tolerant manner at

very low costs, with efficient retrieval techniques. Companies can have their private cloud or a group of them can have a community cloud or a hybrid cloud and can facilitate storing large volumes of their data or use a number of applications and IT tools from there. This reduces cost of installing new softwares and hardwares as well as licensing softwares for the companies.

 To cater to common people’s needs, many smart cities have adopted to leveraging big data tools for the benefit of

their citizens and the environment. The city of Oslo in Norway has reduced street lighting energy consumption by 62% with smart solutions. Memphis Police Department has reduced serious crime rates by 30% by virtue of using predictive software in 2006. The city of Portland, Oregon used technology to optimize the timing of its traffic

signals and was able to eliminate more than 157,000 metric tons of CO2 emissions in just six years, which was

equivalent to taking off 30,000 passenger vehicles from the roads for an entire year [10].

III. DATA SCIENCE

Data Science is an umbrella term describing the branch of science that deals with discovering knowledge from huge

sets of data, mostly unstructured and semi structured as well as structured data, by virtue of data inference and

exploration. It involves using mathematics and algorithms to gain insights from these huge volumes of data and solve

most analytically complex business problems. It acquires essential information that lies hidden beneath these huge amounts of data. It circles around rigorous analytical tasks based on evidence and an ability to build decisions based on

them. It involves significant inventiveness as well and technical expertise to find smart solutions. Data science helps

companies to operate and strategize more intelligently thus adding considerable and meaningful value to it, learning from data.

It is a multidisciplinary craft that combines people, process, computational and big data platforms,

application-specific purpose and programmability. They can be described as the 5 P's of Data Science that take significant part in the data science activities.

Purpose: The challenge or set of challenges defined by the big data strategy refers to purpose. The purpose can be related to a scientific analysis with a hypothesis or a business metric that needs to be analyzed based on big data.

People: People refer to the data engineers, data scientists and data managers. These are people who possess skills on various topics ranging from analysis using statistics, machine learning and mathematical knowledge, data management, programming and computing and business domain knowledge. In other words, it is a group of researchers who are people with complementary skills.

Process: As soon as a teamis definedand a purposeis obtained; it’s now the perfect setting for the team to start with a process that essentially requires iteration. The process of data science includes techniques for statistics, machine learning, programming, data management and computing.

The process remains conceptual in the beginning and defines the set of steps and how everyone can possibly contribute to it. Similar reusable processes may be applicable to many applications with different purposes when employed within different workflows. Data science workflows combine steps in executable graphs. Execution of such processes requires accessing many datasets, small or big, bringing new opportunities and challenges.

There exist many steps or tasks like data collection, data cleaning, data processing, data analysis, result visualization, resulting in a data science workflow. Data science processes may require user interaction and other manual operations or be fully automated.

Challenges for the data science process include:

 Easily integrating all needed tasks to build a process as mentioned.

 Finding the best computing resources and efficiently scheduling process executions to the resources based on

process definition, parameter settings and user preferences.

(5)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

required, performing the application. The scalability for facilitating such a platform is an obvious part for any data science solution architecture.

Programmability: Programming languages like R, Python and patterns like MapReduce helps in apprehending and accomplishing scalable data science processes. Tools that provide access to such programming techniques are

the key, making the data science process programmable on a variety of platforms [11].

IV. STEPS IN A DATA SCIENCE PROCESS

A. DATA ACQUIRING:

Data is the raw material based on which the business problems’ as well as data science problems’ solutions are built.

Data can be costly or available for free [5]. Data acquiring is the very first step in the data science process. The step

means anything that deals with retrieving data including finding, accessing, obtaining and moving data [2].

Acquiring certain data is easy while some require great effort [5]. It includes identification of and authenticated access

to all related data and transportation of data from sources to distributed file systems. It is basically obtaining source material before analyzing or acting on it. No stone must be left unturned in finding suitable data related to the problem in hand. Every relevant data may prove to be useful in some way or the other for analysis. Leaving out even a small amount of important data may lead to incorrect conclusions. Data may come from any place, may be local or remote

and in many varieties like structured or unstructured with different velocities. Technologies do exist for accessing

different types of data. A lot of the data exist in conventional relational databases like structured big data from

organizations. The tool to access such data is SQL which is supported by all relational database management

systems. Data can exist in files such as text files or excel spreadsheets. Scripting languages like JavaScript, PHP,

Perl, R, and MATLAB can be used to get data from files. A scripting language can be general purpose or specialized

for specific functions. A popular way to get data is from websites. Web pages are written using a set of standards

approved by W3C that includes a variety of formats and services. A common format is XML (Extensible Markup

Language) that uses markup symbols or tabs to describe the contents of a web page. Many websites also host web

services which produce program access to their data. There are several types of web services with popular being REST

(Representational State Transfer), which is an approach to implement web services with performance, scalability and

maintainability in mind. REST is easy to use as well. Web socket servers allow real time modification from websites.

NoSQL storage systems are increasingly used to manage a variety of data types of big data. Examples of NoSQL

storage systems include Cassandra, MongoDB, and HBase. NoSQL have APIs to allow users to access data. Most

NoSQL systems provide data access via a web service interface like REST [2].

B. DATA PREPARATION:

The raw data got from various sources, in the previous step, are never of the format suitable for analysis. Data needs to be prepared for analysis after acquiring it. Data preparation is divided into two steps based on the nature of the activity. They are exploring data and pre-processing data.

Exploring data: After getting the data it needs be explored. A preliminary investigation is done in order to get a better understanding of the specific characteristics of the data that is got. Things like correlations, general trends

and outliers are looked into, thus enhancing chances of using the data more effectively. Correlation graphs help

explore dependencies between different variables in the data. Graphing the general trends of variables show if

there is a consistent direction in which the values of the variables are moving towards like sales price going up or

down. Plotting outliers helps in cross checking for errors in the data, due to measurements. Outliers that are non

erroneous helps in finding an exceptional event. Summary statistics are quantities that capture various

characteristics of a set of values with a single number or a small set of numbers. Summary statistics can be

calculating like mean, median, mode, range, standard deviation and they provide numerical values to describe

data. Mean and median measure the location of a set of values. Mode measures the value occurring maximum

number of times. Range and standard deviation give a measure of the spread of data. Visualization techniques

also give a quick and useful way to look at data in this phrase. A heat map for instance gives a quick idea about

hotspot locations. Other types of graphs like histograms, give an idea about the distribution of data and show

skewness or unusual dispersion [2]. Box plots, a convenient way of graphically depicting groups of numerical data

(6)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

and spikes are easy to spot here. Scatter plots show the relation between two variables. All these graphs to

visualize data help understand the complexity of data and guide us in the upcoming processes [2].

Preprocessing data:

Preprocessing of data for analysis comes after a bit more is known about the data through exploratory analysis. The

pre-processing of data aims to perform two main things, they are to clean the data to address data quality issues

and to transform the data to make it fit for analysis. Real world data is very messed up and requires a lot of preparation as far as quality of issues in the data are concerned. Instances of quality issues with data from real

world applications include inconsistent data like a customer with multiple addresses, duplicate records like a

customer address recorded at two sales location and they not even agreeing, missing values like customer agent

demographics or studies, customer age in the demographic studies, invalid data like 6 digit zip code, outlier values

and others. Data downstream at great speeds and it is next to impossible, preventing quality issues at such great speed. As a result of this, data has to be relieved of its quality issues by deleting them or correcting them wherever possible.

Some of the methods to overcome data quality issues include removing records with missing values, merging

duplicate records, retaining newer values in case of a conflict, replacing missing values with a reasonable

estimate, removing unimportant outliers, only to mention a few. To carry out work on incomplete or incorrect

data, domain knowledge is essential for making wise decisions.

After cleaning the data, comes the task of manipulating the cleaned data into format essential for analysis. The

steps include data manipulation, further preprocessing, data wrangling, data munging and certain operations

on them including scaling, transformation, feature selection, dimensionality reduction.

(Data wrangling, data munging are all relative terms. They can be defined as the process of manually converting or

mapping data from raw form to a format that allows for convenient consumption of the data, with the help of semi-automated tools. Steps may include further munging the data using algorithms or parsing the data into predefined data structures, data visualization, data aggregation or training a statistical model. It brings data together into cohesive views and quite often referred to as a janitorial work of cleaning up data so that it is polished and ready for further usage [14]).

Scaling involves changing the range of values between specified ranges. This avoids large values from dominating

the results. Transformations can be carried out on data to reduce noise and variability, like aggregation.

Aggregate data results in data with less variability that helps analysis. For example: Daily sales figures have

serious changes and aggregating sales to weekly or monthly sales figures results in similar data. Feature selection

involves removing redundant or irrelevant features, combining features or creating new features. Removing one of

two correlated features does not negatively affect the analysis results. For example: The purchase price of a product and the amount of sales tax paid are correlated and removing one of them like the sales tax will be helpful.

Removing redundant or irrelevant features enhance simplicity of analysis. Combining features or creating

new ones at times too are useful like adding an applicant’s education level as a feature to a loan approval application. Various algorithms too automatically determine the most relevant features based on mathematical

properties. Dimensionality reduction helps when the data consists of a large number of dimensions. It involves

finding a smaller subset of dimensions that capture most of the variations in data. Thus it reduces the dimensions of

the data, removing irrelevant features and making analysis simpler. PCA technique helps in dimensionality

reduction. Organizing data in correct format is also an important method for data preparation. For example: From

samples recorded in daily changes in stock prices, prices captured for a particular market segment like real estate

or health care may help analysis based on, which stock belongs to which segment. Grouping them together eases

performing computations like mean, range, standard deviation for each group [2].

Summing up, data preparation is a very important phase of data science process and requires the maximum amount of time. This is both a time consuming and meticulous process and if good data is not created for analysis, good

results won’t be generated, no matter how good a technology is used. Garbage input is always equal to the

garbage output [2].

C. ANALYZING DATA:

(7)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

data can take multiple iterations, may even require going back to step 1 and step 2 to get more data or package data in a different way.

The input data is used by the analysis technique to build a model. The model generates the output data. Problems exist

of different types hence different types of analysis techniques exist. The main categories of analysis techniques are

classification, clustering, regression, associated analysis and graph analysis. Classification aims to predict the category of the input data. For example: Predicting the weather as being sunny, rainy, windy or cloudy. Another example may be classifying a tumor as being benign or malignant. Identifying handwritten digits as being one of the

ten categories from 0-9 is also a classification problem. Regression technique uses such a model that aims to predict

the numeric value instead of a category. For example: Predicting the price of a stock, which are a numeric value and not a category. Other examples of regression include estimating the weekly sales of a new product or predicting the score

of a test. In clustering the goal is to categorize similar items into groups. For example: Grouping a company’s

customer base into distinct segments like teenagers, adults and seniors, so that more effective marketing is done. Identifying areas of similar topography like mountains, deserts, plain for effective land use applications and determining different groups of weather patterns like rainy, cold or snowy are other instances where clustering

technique is used. Association analysis comes up with a set of association rules to capture associations within items or

events. The rules determine when and which items or events occur together. For example: Market basket analysis.

Market basket analysis is a common example of association analysis that understands customer’s purchasing history based on his buying certain items and presenting items in groups so that he buys them and the company earns benefits. For example: According to a data mining myth a super market chain used association analysis to discover a connection between two unrelated products. They observed that many customers who went to the supermarket late on Sunday night to buy diapers also tend to buy beer, they probably being fathers. Using this information, diapers and beer were

kept together and it saw a massive increase in the sales of both the items. This later turned out to be the famous

diaper-beer connection. Graph analysis is used to analyze data when data can be transformed into a graph representation with nodes and links. These kinds of data come about when there are a lot of entities, and connections exist between those entities, like social networks. Graph analysis can be used in situations like exploring the spread of diseases by exploring the hospital’s and doctor’s records. It can also be used in identification of security threats by monitoring social media, email and text data.

Data modeling embarks by selecting one of the abovementioned techniques that is thought to be appropriate based on

the type of problem. The model is then constructed using the prepared data and validated by using new data samples.

Usually an approach like: dividing the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after construction, is adopted. Evaluating the data model depends on the type of analysis technique used.

There are separate ways for evaluating each technique. For classification and regression, the correct output is there for

each sample in the input data. Comparing the correct output and the output predicted by the model provides a way to

evaluate the model. For clustering, the groups resulting from clustering are put into test to find out if they make a sense

of the application. Certain aims are decisive factors like whether the customer segments are reflecting the proper

customer base or whether they are helping in the targeted marketing campaigns. Association analysis and graph

analysis require certain investigations to see whether the results are correct. For example network traffic delays are investigated to see whether the sources of the delays are where they are predicted to be in real.

After the model is evaluated, a sense of its performance on data is obtained and it guides the upcoming steps. Some

more tasks can still be done on it like the model can be analyzed with more data, analyzed with different types of data and more attributes taken into consideration for analysis like zip code, to obtain finer grained results. To sum it up the

model must perform well with respect to the success criteria that are defined at the beginning of the project [2].

D. COMMUNICATING RESULTS:

Communicating results execute evaluation of analytical results, presenting them in a visual way and creating reports that include an assessment of results with respect to the success criteria. Communicating the insights paves the way for the actions that needs to be followed.

(8)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

as faring, compared to the success criteria determined at the beginning of the project. These criteria must be well backed and ably supported by adequate facts. Some of the analysis results may not be favorable and they may be counter to what was expected. Some results may be inconclusive and puzzling as well, all these need to be showed in the presentation. Domain experts may well find some of these results to be puzzling and inconclusive but these lead to further analysis. All findings need to be presented so that informed decisions can be made. Visualization is an

important aspect in presenting the results. Some of the techniques include scatter plots, line graphs, heat maps and

other types of graphs. It requires plotting the output data much like the input data; as done earlier, with the similar set

of tools. Analysis results can be stored as backups in tables to facilitate a deeper dig into them. Open source

visualization tools that are available these days are R; a software package for general data analysis having powerful

visualization capabilities and Python; a general purpose programming language having a number of packages to

support data analysis and graphics. D3 is a JavaScript library for producing interactive web based visualizations and

data driven documents. Leaflet is a lightweight mobile friendly JavaScript library to create interactive maps. Tableau

Public allows creation of visualizations in public profiles and allows sharing them or putting them on a site or

blog. Timeline is a JavaScript library that allows creation of timelines.

Summing up, findings need to be reported by presenting the results and value is added by using graphs and using visualization tools [2].

E. TURNING INSIGHTS INTO ACTION:

Insights refer to products obtained from data science and it requires extraction from a diverse amount of data. After evaluation of the results from the analysis and generation of reports on the values the results carry, the next step requires taking action based on those insights. The entire purpose of bringing in data and analyzing them was to find actionable insights within the data, to answer questions and improve business processes. For example these may

include changing something in the process to remove bottle necks, adding more data to make applications more

accurate and segmenting data into more well defined groups for effective marketing. After determining what actions need to be taken, it comes down to figuring out how to implement the actions. The business stakeholders take a major part in this. The impact of the action on the application needs to be observed. This in turn leads to further evaluation and next steps. Next steps may include additional analysis to obtain better results, revisiting certain datasets and

exploring additional opportunities [2].

The data science process is an iterative process and findings from one step may require the previous steps to be

repeated with new information. Many a time it happens that a model passes strict evaluation tests in the lab and proves to be extremely accurate but external considerations make it practically inapplicable and economically infeasible. For

example, a common flaw with detection solutions such as fraud detection, spam detection, and intrusion monitoring

is the production of false alarms [5]. This requires going back to the lab and remodeling.

Data science process is never static and not a one-time analysis but involves processes where models generated to lead into insights need constant improvement based on further empirical evidence or data. For example: An Amazon.com book retailer can constantly improve the model of a customer’s book preference using the customer demographic, his purchase history and his book reviews. The book retailer can also use information to predict which customer is likely to buy any particular book and take action to market the book to the customer. It is thus the utility of data science and

analysis, from the past and current information, that data science is generating actions. Business leaders and decision

makers take action based on the evidence provided by their data science teams.

V. BUILDING A DATA SCIENCE TEAM

Data science is a team game. It involves a large group of people working together in solving real life practical data

science projects. The group includes data engineers, data scientists, and data science mangers of teams of data

scientists and data engineers and external people that have been contacted outside of the data science team. A data

science team must involve quality communication between all of them.

A. DATA ENGINEERS:

(9)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

analyze them and building production level machine learning algorithms and implementing them on servers so that they can be applied on huge databases of observations and possibly function in real time. He may buy the hardware, order it or prepare the system for performing the actual computations on it. He looks after what servers need to be bought and organizing them. It’s his decision to decide upon what softwares are going to run on top of the databases and on top of the servers. He monitors the entire data storage issue and looks after which person is using what data. The entire data retrieval is part of his job. Thus he needs skills in infrastructure development both in software side and in hardware side

as well as software engineering. He must be adept in data processing at scale. Handling SQL and other NoSQL

databases like MongoDB and implementing them and running Hadoop are what he needs to deal with, with ease. He is

not involved with the day to day analysis of data. A data engineer can well afford to be from a computer science background or an IT background or a quantitative background having certain computer science expertise. Characteristically he must possess desire to find answers on his own and work hard under pressure. He must be friendly

and easy to communicate with [4].

B. DATA SCIENTISTS:

A data scientist is a person who too is involved with pulling data out of databases like a data engineer. However his other main tasks encompass cleaning the data, running experiments on them, analyzing the data, visualizing them and communicating the results to the data science manager or other people in the organization. He must be adept in statistics, know how to infer and predict and have a solid grip on machine learning algorithms. He must have an exact idea about how strong the data are while making decisions. He must have skills in general purpose data science

languages like R and Python. These are widely used to analyze data. He must also be skilled in SQL, MongoDB and

the likes. He holds the key to make things go forward. He must know how to build machine learning algorithms or prediction algorithms for the data engineers who can use it according to scale. A data scientist can well afford to be from a statistics or biostatistics background having learnt a lot of applications and working on large number of datasets in real life. To enable interactions with other people in the company, he must be trained in software engineering, know little bit about version control and about other programming languages. Characteristically he must be open to learning, be meticulous about getting all details right, get less intimidated by newer data sets and newer data types, new ideas and new softwares. He must react positively to those datasets and try not to push those data sets into box of data he

already knows [4].

C. DATA SCIENCE MANAGERS:

Data science manager is a person who is responsible for building the data science team. He recruits data engineers and data scientists. He sets goals, priorities and identifies problems within the organization that needs attention of the data science team. He puts right people for the right problem. He monitors interaction with each other in the team as well as peoples in the external organizations that are involved with the team. He supervises meetings. He channelizes things in the right direction. Skill wise he must possess knowledge about the softwares and the hardwares being used. Having knowledge in either data science or data engineering is an added bonus as he can provide suggestions regarding the infrastructure, regarding fixing problems and regarding prediction and machine learning algorithms that do not provide the expected results. It’s not mandatory for him to be qualified in data science or data engineering but he must be well acquainted with the work of the data scientists and the data engineers as well as the work of the other teams. He must have a clear picture about what is achievable and what not. Several factors play a part in this like data availability, algorithm availability, scaling up smaller solutions into larger ones and others. He can well afford to be from a data science or a data engineering background or from a managerial background having data science or data engineering knowledge. Characteristically he must have the best communication skills as he acts as an interface between the team and the upper levels of the organization as well as modulator at the same level across the organization to ensure that information is well spread. He also does advertising of the data science team to other people regarding their capabilities and how they can be of help. He must be supportive and motivational. He must also be able to withstand blames for disappointing results [4].

(10)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

tough holding individual meetings. Group meetings are useful in those situations, held by the manager. An ideal situation is when there is a good mix of both individual meetings, where people can share their personal issues and personal success, as well as group meetings, where everybody can come together and communicate. The manager must initiate meetings of the team members with external units that he may be dealing with, for the project. The manager has to monitor the meetings in each and every case, either actively by being involved in the slack channel or him going to the meetings, or indirectly, by hearing feed backs from the external organization or anyone he has embedded there. Intra-team meetings are essential as it helps individuals come out of difficulties to which he may not find the solution of. It’s easy if it is pointed out by another member. This adds to the overall progress of the team. Intra-team meetings also give an idea about what exactly a particular individual is working on. Inter-team meetings of teams working on a project are essential as well. A meeting between the data engineering team and the data science team can find solutions to problems like infrastructure problem, which has been built by the engineering team and is being used by the data science team. Team meeting environments must be such that everyone can chip in and everyone is open to comments, opinions and criticisms that must go well with everyone present. This is where a manager’s action is significant as far as motivation in concerned. Attitude must never be rude or meant to demean others. The manager must monitor each and every communications such that they are neither too slow nor too fast. Slow communications indicate, not getting enough details back and forth and fast communication indicates that there is an external unit making too many demands of the data scientists. At this situation a manager must modulate these issues and take some of the heat off the data scientists or data engineers, those who are at the receiving end of those requests. A data manager must have his door open to consultations for the data scientists and the engineers, whenever they wish to. This may be via emails, chats or meetings. Formal meetings must not be held too frequently so that it hinders the concentration of the data scientists in their work and slows down progress. Data science works require a lot of uninterrupted and intense effort. The manager must look to expand the growth of the organization both in terms of manpower as well as enhancing the skill set of the existing individuals, by introducing them on how to use new tools, take up certain courses for that purpose or by letting them take part in seminars and conferences where knowledge gets shared. He must provide enough opportunities to each and every one of them. He must also look to so that they get to promote themselves by identifying opportunities for them, enabling interactions in external conferences, meet ups so that their connections increase. After the completion of each task, the data managers must celebrate success, both individual as well as group, may it be small or

big, just to motivate people and inspire them for future works [4].

VI. MORE ON SKILLS OF A DATA SCIENTIST

A. MATHEMATICS EXPERTISE:

For deriving insight from data, the first and the foremost requirement in a data scientist is the ability to view data through a quantitative lens. Data can have textures, patterns, dimensions, and correlations that can be expressed numerically, and discovering inference from data requires adequate intelligence and knowledge of mathematical techniques. Many business problem solutions often involve building analytic models that are based on profound mathematical theory. Limiting to not only that but being able to understand how these models work are also as important as knowing how to build them. Scientists who knows just how to build a model and lacks mechanical understanding of how it works ends up in misusing algorithms, misinterpreting results and deriving misleading insights that can channelize a business in the wrong direction.

Data science is not all about statistics. Statistics though are important, however it is not the only type of mathematics that should be well-understood by a data scientist. Statistics has two main branches – Classical statistics and Bayesian statistics. Colloquially stats refer to classical statistics but knowledge of both types is very important for a data scientist. Many inferential techniques and machine learning algorithms rely heavily on the knowledge of linear algebra. For

example, SVD (Singular Value Decomposition), a popular method to discover hidden characteristics in a data set is

based on matrix mathematics and does not have much to do with classical statistics. Thus a data scientist should have

substantial in-depth knowledge of mathematics [1].

B. TECHNOLOGY AND HACKING:

(11)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

using tools far more sophisticated than Microsoft Excel, like SQL, SAS (Statistical Analysis System), and R, all of

which require technical and coding ability. With the capacity to use these high-performance tools, a data scientist must possesses the ability to use ingenious problem solving ability to achieve mastery in data exploration or rather able to reconstruct together unstructured information and eke out most vital insights.

A data scientist needs to be an algorithmic thinker. He must be able to break down unorganized problems and recompose them in ways that are solvable. It’s indispensable for a good data scientist who works intimately with existing algorithmic frameworks, must create algorithms of their own to solve complex problems, when situations demand. He needs to possess a strong mental comprehension of high-dimensional data and complicated data control flows. In the end he must have a clear picture about all the broken up fragments coming together to form a cohesive solution [1].

C. BUSINESS ACUMEN:

The skill is not much essential for a data scientist as the other two skills; a data scientist needs to be a tactical business consultant. Having business acumen is essential along with having a grasp over technology and algorithms. Dealing with data from breakfast to dinner, a data scientist is positioned to learn from data in ways no one else can. This demands responsibility from him to translate observations into shared knowledge, and contribute to strategies on how to solve core business problems. This means a data scientist must be able to tell a convincing story from the data. It demands presenting a cohesive narration of problems and their solutions, using data insights as supporting pillars, that lead to guidance. There must be a clear alignment between a data science project and its business goals. To sum it up the value doesn't come from data, mathematics and technology individually, it comes from leveraging all of the above

to build valuable capabilities, having a strong business influence [1].

VII. RESULT

 Fig 1 [2] shows how the entire Data Science process would look like if a flow chart of it is drawn. A description of

this is given in Section IV.

 Fig 2 [16] looks into some of the methods to explore data like 2(a): Correlation graph 2(b): Plotting general trends

of variables 2(c): Heat map 2(d): Histogram 2(e): Line graph 2(f): Scatter plot, are plotted to get some views about the characteristics of the data. It is described in section IV.B.

 If a Venn diagram is drawn showing the main skills needed for a Data Scientist, it would be an intersection

between mathematics expertise, hacking skills and business acumen. Fig 3(a) [1] shows exactly that. If a profound

look is taken into the skill set, it can be seen that it demands skills in fields like Statistics, Advanced Computing,

Domain knowledge, Scientific methods, Visualization, etc., only to mention some, as shown in 3(b) [16]. Section

VI. Provides a description of the main skills necessary for a data scientist.

(12)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

Fig: 2

(a)

(b)

(c)

(d)

(e)

(13)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

Fig: 3

VIII. CONCLUSION

Data science is an evolution over traditional business intelligence. They share common characteristics like being information centric and analytical, relying on numerical values and using statistical methods to extract insights from data and demanding skills of visualization and lucid results presentation for newer realizations’ discovery. However with changing times as data started growing in volume and variety, businesses started inclining on Data science to capitalize on market opportunities and gaining competitive edge. Business intelligence systems deliver answers to questions that are pre-known, being essential. They are single systems that don’t allow predictions. They view relationships among various variables without helping in easily obtaining newer meanings from those relationships. They are inefficient as far as obtaining insights from new data are concerned. Data science on the other hands helps in achieving real time insights and quick future predictions. It provides customers insights that the companies can use to predict present as well as future patterns. This helps in reacting to customers’ behaviors and reaching out to them in a better way. Business intelligence works on past records and makes organizations retrospective and reactive in the analysis of data. Data science makes people predictive, proactive and empirical. Today it is the fundamental aspect of all data driven organizations putting business intelligence far behind. Nowadays much of the data, if not all, being semi-structured and unstructured, requires tools like Hadoop and NoSQL databases for storage as well as handling, that too in real time and on the go. Business intelligence does not allow this flexibility in the use of data and works mainly on structured data captured in data warehouses and data silos. Because of its static nature, business intelligence data sources tend to be pre-planned and add slowly. This puts businesses behind by making them miss opportunities. By virtue of its properties, data science also takes pressure off the data warehouses and decreases data storage cost. Business intelligence does not offer room for exploration and experimentation in terms of how the data is collected and managed as provided by data science. Business intelligence provides a single version of truth while data science provides precision, confidence level and wider probabilities by virtue of its findings. Business intelligence systems, in the days gone by were often owned and operated by the IT department, sending along intelligence to analysts who interpreted it. With the advent of data science, the analysts are in charge. The new big data solutions are designed to be owned by analysts, who tend to spend as little time as possible on IT management and devote most of their time

analyzing data, and making predictions upon which to base business decisions on [15].

REFERENCES

[1] https://datajobs.com/what-is-data-science, Last visited: 9th September 2016 [2] Coursera: Introduction to Big Data. University of California, San Diego. [3] https://en.wikipedia.org/wiki/Box_plot, Last visited: 9th September 2016 [4] Coursera: Building a Data Science Team. Johns Hopkins University.

[5] Foster Provost and Tom Fawcett “Data Science for Business”. O’ REILLY publications. [6] http://www.thearling.com/text/dmwhite/dmwhite.htm, Last visited: 9th September 2016.

[7] Rohan Sharma, Kalpit Shah, Yeshesvini Shirahatti, Smruti Patel. “Data Warehouse and OLAP Technology. Part–I” [8] http://whatis.techtarget.com/definition/machine-learning, Last visited: 9th September 2016

(14)

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

I

nternational

J

ournal of

I

nnovative

R

esearch in

C

omputer

and

C

ommunication

E

ngineering

(An ISO 3297: 2007 Certified Organization)

Vol. 4, Issue 9, September 2016

[10] http://datascienceseries.com/stories/ten-practical-big-data-benefits, Last visited: 9th September 2016.

[11] http://words.sdsc.edu/words-data-science/data-science, Last visited: 9th September 2016.

[12] http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/, Last visited: 9th September 2016

[13] https://en.wikipedia.org/wiki/Unsupervised_learning, Last visited: 9th September 2016. [14] https://en.wikipedia.org/wiki/Data_wrangling, Last visited: 10th September 2016

[15] http://www.itproportal.com/2016/08/18/10-differences-between-data-science-and-business-intelligence/, Last visited: 11th September 2016

[16] https://www.google.co.in/ (various sources)

BIOGRAPHY

Kaustav Ghosh is a post graduate in Computer Science from St. Xavier’s College (Autonomous), Kolkata. He has completed his masters in 2016. He has already published papers on Cognitive Radio and Big Data in International Journals. Currently he is doing research work in Data Science, Big Data Analytics and Cloud Computing.

References

Related documents

Nurses feel that both the software and the nurse are essential to clinical decision-making, and describe a process of ‘dual decision- making’, with the nurse as active decision

The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item.. Where records

human body can persist through death is equally a reason to suppose that a. human animal can persist through death, and any reason to deny

In this scenario total energy consumed is above the “target” energy demand for the transport sector for this scenario of 403 TWh (Table 3) and again, it was not possible to push

The 10 resident domains cluster into three groups : universal requirements for older people living in residential settings (privacy, the ability to personalise their

to the Convention for the Protection of Human Rights and Dignity of the Human Being with Regard to the Application of Biology and Medicine, on the Prohibition of Cloning Human

The role of dopamine in chemoreception remains to be fully established, but it is clear that stimulus evoked transmitter release from type I cells on to afferent nerve endings is a

The International Board held that any organisation composed of at least 75% of Business and Professional Women was eligible for membership in the Federation, and the Council of