• No results found

The Library (Big) Data scien4st

N/A
N/A
Protected

Academic year: 2021

Share "The Library (Big) Data scien4st"

Copied!
35
0
0

Loading.... (view fulltext now)

Full text

(1)

The Library (Big) Data scien4st

IFLA/ALA webinar: “Big Data: new roles and opportuni4es for new librarians” June 15th 2016 IFLA Big Data Special Interest Group (SIG) Wouter Klapwijk, Stellenbosch University, SIG convenor [email protected]

(2)

IFLA Big Data SIG

•  Proposed at WLIC 2014, Lyon •  Pe44oned for at WLIC 2015, Cape Town •  Endorsed by the IFLA Professional CommiXee in December 2015 •  SIG sponsor: IT Sec4on •  Objec4ves: 1.  Provide a focus point for developing ideas regarding Big Data as it affects libraries 2.  Provide a pla[orm within IFLA to assess and develop the avenues of response from IFLA to this developing area

(3)

Deconstruc<ng “Big Data” and “data science”

This talk is based on a number of everyday ques4ons: 1.  What does “data science” mean? ²  is it only happening in Tech places like Facebook and Google? 2.  What are Data Scien4st? ²  can Librarians also be Data Scien4sts? 3.  Is data science the science of Big Data? ²  what is the rela4onship between Big Data and data science? 4.  Exactly what is “Big Data” anyway? ²  just how big is Big? or is Big rela4ve? is library data Big?

(4)

Data science

§  A set of fundamental principles that guide the extrac4on of knowledge from data §  The “civil engineering of data”: turning data into data products Goal of data science:

Ø to improve decision-making, for the beXerment of organiza4ons and society at large

(5)

Rela<on to other “engineering” concepts

“data mining” ü helps accomplish data science goals via technologies that incorporate data science principles ü but … its techniques are much more extensive than the set of principles comprising data science “data warehousing” ü a facilita4ng technology for “data mining” ü but … not always included as part of “data mining”

(6)

Rela<on to other compu<ng concepts

“data processing” ü is more general than data science ü there is processing involved in all aspects of compu4ng “Big Data technologies” ü are ocen used for data processing in support of data mining techniques ü … and other data science ac4vi4es

(7)

Science or CraG?

v The term data science has existed for over 30 years – it is a field onto itself v Founda4on rests in century old prac4ces of Sta4s4cs, Mathema4cs, and since mid-20th century, also Computer Sciences v It is not just a rebranding of Sta4s4cs and Machine Learning in the context of the Tech industry v Much of the field development is happening in Industry, and not in Academia

(8)

Examples of data science products

Domain Example Internet Recommenda4on systems (Amazon = books; Facebook = friends) Finance Credit ra4ngs Educa<on Personalized learning and assessment Government Policies based on data

(9)

Prac<cing data science

What do Data Scien4st do?

(10)

The Data Scien<st

Two aspects to consider: 1.  understand what they DO in business 2.  understand which SKILLS they must possess

(11)

What do they DO?

1.  They ask ques4ons o  Probe, being curious 2.  They (try to) solve problems o  Analy4cal thinking, making new discoveries 3.  They cul4vate (new) soc skills o  Communica4ng and visualizing data

(12)

What do they DO?

1.  They ask ques4ons o  Probe, being curious 2.  They (try to) solve problems o  Analy4cal thinking, making new discoveries 3.  They cul4vate (new) soc skills o  Communica4ng and visualizing data How much of the above do you as librarian do?

(13)

The library professional’s genes?

Is it in our pedigree to con4nuously ask ques4ons? Do we have the taste and mindset for analy4cal thinking? Do we only do ad hoc analysis, or do we prefer an ongoing conversa4on with data? Is there enough of a Business Analyst or Social Scien4st in us?

(14)

Which SKILLS do they need?

Data Scien4st

Domain-specific skills Soc skills Hard skils

(15)

Which SKILLS do they need?

Data Scien4st

Domain-specific skills Soc skills Hard skils Communica4on Linear algebra Sta4s4cs Ar4ficial Intelligence Machine Learning Understand the business, e.g. librarianship

(16)

Data Scien<sts are team members

Sta4s4cian Mathema4cian Data programmer Social scien4st Systems administrator ? Different skills are embedded across mul<-disciplinary team members

(17)

The Data Scien<st team profile

It is important to understand data science even if you never intend to do it yourself 0 2 4 6 8 10 12 14

(18)

Prac<cing data science

What does the crac of data science look like?

(19)

3 disciplinary areas

SOURCES •  DATA ANALYTICS •  COLLECT •  CLEAN •  INTEGRATE •  PROCESS VISUALIZATION •  COMMUNICATE

(20)

3 disciplinary areas

SOURCES •  DATA ANALYTICS •  COLLECT •  CLEAN •  INTEGRATE •  PROCESS VISUALIZATION •  COMMUNICATE Each disciplinary area requires different skills Systems Administrator Data Programmer App designer

(21)

SOURCES

Database (1960 - ) First integrated datastore (Bachman), 1963 Rela4onal data model (Codd), 1970 SQL (Boyce & Chamberlain), 1970+ Data Warehouse (1975 - ) First commercial RDBMS (Oracle), 1979 DB2 (IBM), 1983 First KDD workshop, 1989 First KDD data mining conference (Fayyaad, Shapiro), 1995 Big Data (2005 - ) NoSQL (Evans), 2009

(22)

SOURCES

SMALL DATA Databases Data Warehouses Quan4ta4ve and qualita4ve Mostly structured and indexical Metadata “Long tail data” BIG DATA Mostly unstructured (80%) Various sources Needs to be related and combined Social A/V Logs Incomplete Data Taxonomy: some data are neither just big nor just small

(23)

SOURCES

Small data

•  The term denotes the opposite of Big Data •  Data usually housed in databases and data warehouses •  Usually structured, qualita4ve and indexical in nature •  Examples: Library data, Research Data (RDM) •  Research data = primary data

(24)

SOURCES

Big data

•  Datasets that are too large for tradi4onal data processing and storage systems (3V’s, 4V’s, 5V’s) •  Classified into 3 classes of “datafica2on”: 1.  Directed data (e.g. surveillance data) 2.  Automated data (e.g. device generated data) 3.  Volunteered data (e.g. social networks data)

(25)

ANALYTICS

Database (1960 - ) First integrated datastore (Bachman), 1963 Rela4onal data model (Codd), 1970 SQL (Boyce & Chamerlain), 1970+ Data Warehouse (1975 - ) First commercial RDBMS (Oracle), 1979 DB2 (IBM), 1983 First KDD workshop, 1989 First KDD data mining conference (Fayyaad, Shapiro), 1995 Big Data (2005 - ) NoSQL (Evans), 2009

(26)

ANALYTICS

There are 4 broad classes of analy4cs (ocen used in combina4on): 1.  Data mining and paXern recogni4on v  AI – Machine Learning – Data Mining 2.  Data visualiza4on and visual analy4cs v  App development 3.  Sta4s4cal analysis v  Sta4s4cal techniques and principles (regression, etc.) 4.  Predic4on, simula4on, and op4miza4on v  Algorithms

(27)

Prac<cing data science

In Libraries

(28)

1. Data at Scale

Value and Insight can be extracted from small data by scaling them up into

(29)

2. Analyzing Exhaust data

Exhaust data = produced as a by-product of the main func4on of a device or system Most exhaust data is transient in nature – it is never examined and simply discarded! Example: log of a self-checkout unit

(30)

Example of analyzing Exhaust data

Structured and Unstructured data VISITS PATRONS LOANS LOCATIONS DIGITIZED BOOKS Books DVDs Journals Ac<onable Insights BeXer forecasts for future library planning BeXer usage of systems and resources Produc4vity gain with beXer decision-making

(31)

Examples of Analysis and Visualiza<on

Library analy4cs toolkit – Harvard University: hXps://osc.hul.harvard.edu/liblab/projects/library-analy4cs-toolkit Text analy4cs – Google Books Ngram Viewer: hXps://books.google.com/ngrams Open Source implementa4on – Bookworm: hXp://bookworm.culturomics.org

(32)

In summary

The fundamental principle of data science is that data,

and the capability to extract useful knowledge from it,

should be regarded as a key strategic asset. Libraries must learn to start thinking data-analy<cally. Do we only use gut and intui4on, or also data and rigor, in our decision-making?

(33)

In summary

You can apply the same principles, tools and techniques for small data than you would for big data “…the tools of data science are as appropriate for gigabyte as they are for petabyte scale datasets…” (hXps://datascience.berkeley.edu/about/what-is-data-science/)

(34)

In summary

Challenges for librarians: q There is a shortage of Big Data talent q The Big Data SIG is aXemp4ng to understand and frame Big Data problems Opportuni4es for librarians: q Grow your data analy4cal skills q AXend online courses: Khan Academy, Coursera, Socware Carpentry, digital books q There are free socware tools: R, (SQL Server 2016 includes R), Python, app visualiza4on tools

(35)

Thank you

[email protected]

References

Related documents

Abstract In this paper the well-known minimax theorems of Wald, Ville and Von Neumann are generalized under weaker topological conditions on the payoff function ƒ and/or extended

If you expect search engine spiders to execute Flash, Java or Javascript code in order to access links to further pages within your site, you'll usually be disappointed with

According to a recent Ventana Research report on big data technology, only 22% of 163 organizations that Ventana polled last year were using Hadoop, and 45% said they had no plans

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Commercial quality tools to manage Big Data, allowing relational data access (SQL) Integration of relational DBs with unstructured data GemFire XD. Brings real-time data processing

International Rectifier has MOSFET Spice models on www.irf.com that can be used for pre- prototype circuit validation for a multitude of power application topologies.

Delivery can be arranged and will be charged on a pallet basis.. Stock will be available on a first come, first

[r]