• No results found

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

N/A
N/A
Protected

Academic year: 2021

Share "CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

20

CAP4773/CIS6930

Projects in Data Science, Fall 2014

[Review] Overview of Data Science

Dr. Daisy Zhe Wang CISE Department University of Florida

August 25th 2014

(2)

Review

Overview of Data Science

• Why Data Science?

• What is Data Science?

• What are some prominent examples of Data Science?

• How to become a Data Scientist?

• Who are hiring Data Scientists Now?

(3)

The Dawn of Big Data

• Google, Yahoo today

– Core Business: Web Search and Computational advertising – Google: 35,000 searches/sec

– Yahoo! scale: 600 million users per month, 4 billion clicks per day, 25 terabytes of data collected every day

• Netflix 2007

– Movie recommendations

– 100 million ratings, 500,000 users, 18,000 movies – netflix prize

• Amazon 2003

– Product recommendations, reviews

– 29 million customers, millions of products

• Word Economic Forum 2011 at Davos

– Personal data – digital data created by and about people – represents a new economic “asset class”, touching all aspects of society.

22

(4)

How Big is Your Data?

• Kilobyte (1000 bytes)

• Megabyte (1 000 000 bytes)

• Gigabyte (1 000 000 000 bytes)

• Terabyte (1 000 000 000 000 bytes)

• Petabyte (1 000 000 000 000 000 bytes)

• Exabyte (1 000 000 000 000 000 000 bytes)

• Zettabyte (1 000 000 000 000 000 000 000 bytes)

• Yottabyte (1 000 000 000 000 000 000 000 000 bytes)

It is not only about volumes!

(5)

5 Vs of Big Data

• Raw Data: Volume

• Change over time: Velocity

• Data types: Variety

• Data Quality: Veracity

• Information for Decision Making: Value

24

(6)

Multimodal Data Sources on

World Cup Common

Twitter

~5M tweets with 244,000

associating images

News & Blogs 776 blog posts and associated

pictures

Youtube 460 minutes of

Flickr

913 images and associated

captions Facebook

250 posts and

associated images Google Images

1,920 images and associated

captions

(7)

Current Trends

• Applications has bigger data and need more advanced analysis

– Example: Web, Corporate documents and Emails

• Natural Language Processing

– Example: Social Media

• Network/Graph Analysis

• IT Infrastructure moving to Cloud Computing (2006)

– Parallel Computing, SaaS

– Elastic Computing – pay-as-you-go, scale up/down

• Data Science arise given this application pull and technology push (2011)

26

(8)
(9)

What is Data Science?

28

(10)

Data Science – A Definition

Data Science is the science which uses computer science, statistics and machine learning,

visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products.

(11)

Goal of Data Science

Turn data into data products.

30

(12)

Data to Data Products

• Web Text Data  Knowledge Bases with Facts & Relationships

• Twitter Text/Images  Real-time Events and News

• Knowledge bases  first-order inference rules and constraints

• Text Data, Social Media Data  Product Review and Consumer Satisfaction

• Software Log Data  Automatic Trouble Shooting

• Genotype and Phenotype Data  common patterns  New treatment for Cancer

(13)

Other Data Products

• Financial products for investment or retirement funds

• Legal profession uses e-discovery tool for retrieval and review of legal documents

• Political campaign management

• Sports (e.g., Oakland baseball team)

• Remote Sensing for Environment Monitoring

32

(14)

What are some prominent examples of Data Science?

(15)

Data Products – Google

• Web Search

• Google Ads

• News Recommendation Engine

• Google Maps

Currently one of the best if not the best IT

company to work for. (Google event on Jan 21/22)

34

(16)

Data Products – Netflix

• Personalized Movie Ratings

• Movie Recommendations

• Similar Movies

• Movie Categories (e.g., 80’s movie with a strong female lead, Kung Fu movies)

BlockBuster is out of the business …

(17)

Data Products – LinkedIn/Facebook

• People you may know

• Applications you may like

• Jobs/Events you might be interested

• Classifier for bad users and bad content

• With high accuracy, Facebook can guess whether you are single or married

Who does not have LinkedIn/Facebook Account?

36

(18)

Data Products – Twitter

• Text Analysis – Spam Filter/Similarity Search

• User Sentiment/Satisfaction/Feedback

• News Breakout

• Trend and Topics

200 million users as of 2011, generating over

200 million tweets and handling over 1.6 billion search queries per day

(19)

Data Products – Splunk

• Degradation, Failure Detection

• Identify Security Breach

• Event Monitoring

• Troubleshoot Tools

• Cross-platform Event Correlation Founded 2004, IPO in 2012

38

(20)

How to become a Data Scientist?

(21)

The Life of Data (state-of-the-art)

Collect Clean Integrate Visualization Interface

Users

Data Sources

Analysis

40

(22)

Challenges in Data Science

• Find Data (Data Collection, Quality)

• Prepare Data (Noisy, Incomplete, Diverse, Streaming …)

• Analyze Data (Scalable, Accurate, Real-time, Advanced Methods, Probabilities and

Uncertainties ...)

• Represent Analysis Results (i.e. data product) (Story-telling, Interactive, explainable…)

(23)

Skill Set of a Data Scientist

• Data Management

– Data collection, storage, cleaning, filtering, integration

• Large-scale Parallel Data Processing

– Parallel computing

• Statistics and Machine Learning

– Data modeling, inference, prediction, pattern recognition …

• Interface and Data Visualization

– HCI design, visualization, story-telling …

42

(24)

Who are hiring Data Scientists Now?

(25)

“Sexy Job” in the next 10 years

“The sexy job in the next ten years will be … The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a

hugely important skill.”

-- Hal Varian, Google Chief Economist, 2009

44

(26)

Who’s hiring Data Scientist?

• IT companies: Google, Twitter, Lexis/Nexis, Facebook, Pivotal/EMC…

• Media and Financial sectors Fox, CNN, NYT, Bloomburg,…

• Research: Biology, Medicine, Physics, Psychology,…

• Information office in government and corporations…

• Law firms: e-discovery tools…

(27)

Books on Data Science

46

(28)

Additional Reading Pointers

• Data Science Summit (Strata)

(http://www.datascientistsummit.com/ )

• Kaggle Competitions

(http://www.kaggle.com/)

• Data Science course at Berkeley

(http://datascienc.es/ ) & on Coursera

(https://www.coursera.org/course/datasci/ )

(29)

Summary

• Why now: Dawn of Big Data <<- Need for Advanced Analytics and Cloud Computing

• What is it: Data ->> Data Product, many examples incl. Google, Netflix, Splunk, LinkedIn

• How to become: Data management, parallel

computing and data processing, statistical machine learning, and visualization skills

– Life of Data

• Who are hiring: Data Scientists are in great

demands, from industry to government to science.

48

(30)

Announcements

• Course Website:

http://www.cise.ufl.edu/class/cis6930fa14pro

• New registration slots will be available @ 4pm TODAY

– For first year MS students with no data mining

background: please first take Introduction to Data Science

(31)

Next Few Lectures

• Graph Algorithms and Graph Databases

– Offline queries: page-rank over Web link structure

– Online queries: SPARQL (sub-graph matching) over RDF stores

• Shortest paths (social network transactions)

• Text and Image Extraction and Retrieval

– Features Extraction

– Bag of Words model + Solr indexing

– More advanced: Entity and event extractions/classifications

• Knowledge Base Construction, Querying, Integration and Mining

– General, Health, Ecology, Medical, … – Rule/Constraint Learning

– More advanced: Uncertainty Management and reasoning

50

References

Related documents

Late Miocene lacustrine microporous micrites of the Madrid Basin (Spain) have a similar matrix microfabric as Cenomanian to Early Turonian shallow-marine carbonates of the

Hertel and Martin (2008), provide a simplified interpretation of the technical modalities. The model here follows those authors in modeling SSM. To briefly outline, if a

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

In this section we’ll discuss two of the most typical kinds of sore throat—the common, mild sore throat often associated with other illnesses or environmental factors, and strep

Thus, using the same sample reported by Viding and colleagues, our primary goal was to examine whether teacher-rated CU traits in 7-year-old twins demonstrated different genetic

Do the parts or materials within your supply chain contain any of the following. REACH 17 SVHC Substances (released in Jul.