The Library (Big) Data scien4st
IFLA/ALA webinar: “Big Data: new roles and opportuni4es for new librarians” June 15th 2016 IFLA Big Data Special Interest Group (SIG) Wouter Klapwijk, Stellenbosch University, SIG convenor [email protected]IFLA Big Data SIG
• Proposed at WLIC 2014, Lyon • Pe44oned for at WLIC 2015, Cape Town • Endorsed by the IFLA Professional CommiXee in December 2015 • SIG sponsor: IT Sec4on • Objec4ves: 1. Provide a focus point for developing ideas regarding Big Data as it affects libraries 2. Provide a pla[orm within IFLA to assess and develop the avenues of response from IFLA to this developing areaDeconstruc<ng “Big Data” and “data science”
This talk is based on a number of everyday ques4ons: 1. What does “data science” mean? ² is it only happening in Tech places like Facebook and Google? 2. What are Data Scien4st? ² can Librarians also be Data Scien4sts? 3. Is data science the science of Big Data? ² what is the rela4onship between Big Data and data science? 4. Exactly what is “Big Data” anyway? ² just how big is Big? or is Big rela4ve? is library data Big?Data science
§ A set of fundamental principles that guide the extrac4on of knowledge from data § The “civil engineering of data”: turning data into data products Goal of data science:Ø to improve decision-making, for the beXerment of organiza4ons and society at large
Rela<on to other “engineering” concepts
“data mining” ü helps accomplish data science goals via technologies that incorporate data science principles ü but … its techniques are much more extensive than the set of principles comprising data science “data warehousing” ü a facilita4ng technology for “data mining” ü but … not always included as part of “data mining”Rela<on to other compu<ng concepts
“data processing” ü is more general than data science ü there is processing involved in all aspects of compu4ng “Big Data technologies” ü are ocen used for data processing in support of data mining techniques ü … and other data science ac4vi4esScience or CraG?
v The term data science has existed for over 30 years – it is a field onto itself v Founda4on rests in century old prac4ces of Sta4s4cs, Mathema4cs, and since mid-20th century, also Computer Sciences v It is not just a rebranding of Sta4s4cs and Machine Learning in the context of the Tech industry v Much of the field development is happening in Industry, and not in AcademiaExamples of data science products
Domain Example Internet Recommenda4on systems (Amazon = books; Facebook = friends) Finance Credit ra4ngs Educa<on Personalized learning and assessment Government Policies based on dataPrac<cing data science
What do Data Scien4st do?
The Data Scien<st
Two aspects to consider: 1. understand what they DO in business 2. understand which SKILLS they must possessWhat do they DO?
1. They ask ques4ons o Probe, being curious 2. They (try to) solve problems o Analy4cal thinking, making new discoveries 3. They cul4vate (new) soc skills o Communica4ng and visualizing dataWhat do they DO?
1. They ask ques4ons o Probe, being curious 2. They (try to) solve problems o Analy4cal thinking, making new discoveries 3. They cul4vate (new) soc skills o Communica4ng and visualizing data How much of the above do you as librarian do?The library professional’s genes?
Is it in our pedigree to con4nuously ask ques4ons? Do we have the taste and mindset for analy4cal thinking? Do we only do ad hoc analysis, or do we prefer an ongoing conversa4on with data? Is there enough of a Business Analyst or Social Scien4st in us?Which SKILLS do they need?
Data Scien4st
Domain-specific skills Soc skills Hard skilsWhich SKILLS do they need?
Data Scien4st
Domain-specific skills Soc skills Hard skils Communica4on Linear algebra Sta4s4cs Ar4ficial Intelligence Machine Learning Understand the business, e.g. librarianshipData Scien<sts are team members
Sta4s4cian Mathema4cian Data programmer Social scien4st Systems administrator ? Different skills are embedded across mul<-disciplinary team membersThe Data Scien<st team profile
It is important to understand data science even if you never intend to do it yourself 0 2 4 6 8 10 12 14Prac<cing data science
What does the crac of data science look like?
3 disciplinary areas
SOURCES • DATA ANALYTICS • COLLECT • CLEAN • INTEGRATE • PROCESS VISUALIZATION • COMMUNICATE3 disciplinary areas
SOURCES • DATA ANALYTICS • COLLECT • CLEAN • INTEGRATE • PROCESS VISUALIZATION • COMMUNICATE Each disciplinary area requires different skills Systems Administrator Data Programmer App designerSOURCES
Database (1960 - ) First integrated datastore (Bachman), 1963 Rela4onal data model (Codd), 1970 SQL (Boyce & Chamberlain), 1970+ Data Warehouse (1975 - ) First commercial RDBMS (Oracle), 1979 DB2 (IBM), 1983 First KDD workshop, 1989 First KDD data mining conference (Fayyaad, Shapiro), 1995 Big Data (2005 - ) NoSQL (Evans), 2009SOURCES
SMALL DATA Databases Data Warehouses Quan4ta4ve and qualita4ve Mostly structured and indexical Metadata “Long tail data” BIG DATA Mostly unstructured (80%) Various sources Needs to be related and combined Social A/V Logs Incomplete Data Taxonomy: some data are neither just big nor just smallSOURCES
Small data
• The term denotes the opposite of Big Data • Data usually housed in databases and data warehouses • Usually structured, qualita4ve and indexical in nature • Examples: Library data, Research Data (RDM) • Research data = primary dataSOURCES
Big data
• Datasets that are too large for tradi4onal data processing and storage systems (3V’s, 4V’s, 5V’s) • Classified into 3 classes of “datafica2on”: 1. Directed data (e.g. surveillance data) 2. Automated data (e.g. device generated data) 3. Volunteered data (e.g. social networks data)ANALYTICS
Database (1960 - ) First integrated datastore (Bachman), 1963 Rela4onal data model (Codd), 1970 SQL (Boyce & Chamerlain), 1970+ Data Warehouse (1975 - ) First commercial RDBMS (Oracle), 1979 DB2 (IBM), 1983 First KDD workshop, 1989 First KDD data mining conference (Fayyaad, Shapiro), 1995 Big Data (2005 - ) NoSQL (Evans), 2009ANALYTICS
There are 4 broad classes of analy4cs (ocen used in combina4on): 1. Data mining and paXern recogni4on v AI – Machine Learning – Data Mining 2. Data visualiza4on and visual analy4cs v App development 3. Sta4s4cal analysis v Sta4s4cal techniques and principles (regression, etc.) 4. Predic4on, simula4on, and op4miza4on v AlgorithmsPrac<cing data science
In Libraries
1. Data at Scale
Value and Insight can be extracted from small data by scaling them up into
2. Analyzing Exhaust data
Exhaust data = produced as a by-product of the main func4on of a device or system Most exhaust data is transient in nature – it is never examined and simply discarded! Example: log of a self-checkout unitExample of analyzing Exhaust data
Structured and Unstructured data VISITS PATRONS LOANS LOCATIONS DIGITIZED BOOKS Books DVDs Journals Ac<onable Insights BeXer forecasts for future library planning BeXer usage of systems and resources Produc4vity gain with beXer decision-makingExamples of Analysis and Visualiza<on
Library analy4cs toolkit – Harvard University: hXps://osc.hul.harvard.edu/liblab/projects/library-analy4cs-toolkit Text analy4cs – Google Books Ngram Viewer: hXps://books.google.com/ngrams Open Source implementa4on – Bookworm: hXp://bookworm.culturomics.orgIn summary
The fundamental principle of data science is that data,
and the capability to extract useful knowledge from it,
should be regarded as a key strategic asset. Libraries must learn to start thinking data-analy<cally. Do we only use gut and intui4on, or also data and rigor, in our decision-making?