• No results found

HathiTrust + HTRC: An Overview

N/A
N/A
Protected

Academic year: 2022

Share "HathiTrust + HTRC: An Overview"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

HathiTrust + HTRC:

An Overview

ROBERT H. MCDONALD IU LIBRARIES

HATHITRUST RESEARCH CENTER

@HATHITRUST @MCDONALD

(2)

Topics in this Deck

What is the HathiTrust?

What is the HathiTrust Research Center What are HTRC Tools and Services?

What is HTRC Advanced Collaborative Support?

What’s new with the HTRC Version 4.0 Release?

(3)

HathiTrust Mission and Purpose

To contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human

knowledge.

A trusted digital preservation service enabling the broadest possible access worldwide.

An organization with over 130 research libraries partnering to develop its programs.

A range of transformative programs enabled by working at a

very large scale.

(4)

The Name

The meaning behind the name

◦ Hathi (hah-tee)--Hindi for elephant

◦ Big, strong

◦ Never forgets, wise

◦ Secure

◦ Trustworthy

Illustration of “Hathi” the elephant from 1895 edition of The Jungle Book found in HathiTrust.

(5)

HathiTrust

Membership

(6)

George Mason University Georgetown University Graduate College of the City

University of New York Harvard University Library Haverford College

Indiana University Iowa State University Johns Hopkins University Kansas State University Lafayette College

Library of Congress Macalester College

Massachusetts Institute of Technology

McGill University`

Michigan State University Montana State University Mount Holyoke College New Mexico State University New York Public Library New York University

North Carolina Central University Allegheny College

American University of Beirut Arizona State University Auburn University Baylor University Boston College Boston University Brandeis University Brown University Bryn Mawr College Bucknell University

Carnegie Mellon University Case Western Reserve Claremont Colleges Colby College

Columbia University Cornell University Dartmouth College DePaul University Dickinson College Duke University Emory University

Getty Research Institute

North Carolina State University Northeastern University

Northwestern University Oklahoma State University Ohio State University

Pennsylvania State University Princeton University

Purdue University Rutgers University Smith College

Southern Methodist University Stanford University

State University System of Florida Swarthmore College

Syracuse University SUNY Buffalo

Temple University Texas A&M University Texas Christian University Texas Tech University Tufts University Tulane University Union College

HathiTrust Members…..

(7)

…More HathiTrust Members

University of Illinois Chicago The University of Iowa University of Kansas University of Maryland University of Mass.Amherst University of Miami

University of Michigan University of Minnesota University of Mississippi University of Missouri

University of Nebraska-Lincoln University of Nevada-Las Vegas University of New Mexico University of North Carolina at

Chapel Hill

University of Notre Dame University of Oklahoma University of Oregon University of Pennsylvania University of Pittsburgh University of Queensland University of Rochester Universidad Complutense

de Madrid

University of Alabama University of Alberta University of Arizona

University of British Columbia University of Calgary

University of California Berkeley

Davis Irvine

Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz

California Digital Library The University of Chicago University of Connecticut University of Delaware University of Houston

University of Illinois at Urbana

University of Tennessee, Knoxville University of Texas

University of Utah University of Vermont University of Virginia University of Washington

University of Wisconsin-Madison University of Wyoming

University System of Georgia Utah State University

Vanderbilt University

Virginia Commonwealth University Virginia Tech

Wake Forest University Washington University Washington State University Wesleyan University

West Virginia University Williams College

Wichita State University Yale University Library

(8)

Membership

Membership available to academic/research libraries.

◦ All members have a specific user community that they support, e.g., university libraries.

Member fees support 100% of operations.

2018 fees begin at about $9,500 US.

◦ All members pay an equal share of cost for open content.

◦ Members pay a proportional share for in copyright materials

◦ Based on the overlap between physical collection/HathiTrust.

Membership is not synonymous with “subscription.”

Focus is on cooperative efforts and cooperative benefits.

(9)

Cooperative Work

We draw upon distributed expertise among members

Michigan Indiana Illinois California

Administration ✔

Preservation &

Access Repository ✔ ✔

Research Center ✔ ✔

Metadata

Management (Zephir)

(10)

Members Govern HathiTrust

Program Steering Committee

Board of Governors

Executive Director

Committees and Working

Groups Operations

(11)

HathiTrust’s Portfolio of Work

Collection Development

Mass Digitization

Member Digitization

Born Digital

Preservation

TRAC Certification

Integrity Monitoring

Format Consistency &

Migration

Use

Discovery

•Catalog

•Full Text

•Discovery services

Access

•Print Disabled Services

•Differs for members

Rights Management

Investigation

&

Determination

Licensing

Collection Management

Holdings Analysis

Shared Print Retentions

Computational Research

HathiTrust Research

Center

Derived Data Releases

Enhancements to the Corpus

(12)

Collections

(13)

HathiTrust Collections Today

16.07 million total digitized items (volumes)

◦ 7.87 million book titles

◦ 435,000 serial titles

6.02 million items open for reading

◦ Includes public domain & CC-licensed items

The collection includes (mostly) published materials in

bound form, (mostly) digitized from library collections.

(14)

2.47

5.22

7.83

9.96 10.59 10.87

13.00

15.93

14.81

15.97

0 2 4 6 8 10 12 14 16 18

2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Growth of HathiTrust Collections

(millions of volumes)

(15)

Titles by Language

English 50%

German 9%

French 7%

Spanish 7%

Russian 3%

Chinese 3%

Japanese 3%

Italian 2%

Portuguese 2%

Arabic 1%

451 other langugaes 13%

(16)

Volumes by View/Rights Status

Limited View 62.75%

Public Domain 17.97%

US Fed Docs 6.02%

Public Domain US 13.05%

Open Access/Creative Commons

<0.01%

Full View 37.25%

(17)

About copyright….

Yes this is legal.

Our policies are primarily based on US law…

◦ Exceptions for “fair use”

◦ Exceptions for print disabled

◦ Exceptions for preservation

◦ Some other exceptions which we haven’t fully exploited

We respect copyright law in other jurisdictions.

◦ But we aren’t able to support local copyright laws as easily as

we can US laws.

(18)

Access in a Nutshell

Anybody anywhere

◦ Full text search of entire collection (via web)

◦ Text and data mining

◦ In copyright data mining is in pilot mode (via HTRC)

◦ Services require additional registration

◦ Read public domain and open access works (via web)

Members only

◦ Download public domain and open access works.

◦ Replacement access for lost and damaged print copies (in US).

◦ Access for users who are blind or with print disabilities (where

law allows).

(19)

Service for print-disabled users

Provides eligible users with access to any item in the HathiTrust collection, regardless of copyright status.

Eligibility is determined by the member institution following their own established practices.

Access for the user is managed by a service provider on

campus.

(20)

HT Addressing Library

Issues of Scale

(21)

Collective Action: Copyright Review

Copyright Review Management System

◦ Systematic manual review of copyright registrations to determine status of portions of the HathiTrust Collection, Supported generously by IMLS

WINNER OF THE 2016 L RAY PATTERSON AWARD

Project Reviewed Out of

Copyright CRMS US

Pub in US 1923-1963

(includes US State Documents)

375,576 203,172 (54.1%)

CRMS World

Pub in UK (1875-1944),

Canada and Australia (1894-1964)

312,149 159,195 (51%)

(22)

Shared Print

Monographs Program

Focus & Goals

◦ Ensure preservation of print and digital collections

◦ Catalyze national/continental collective management of collections

◦ Commit to retain print holdings that mirror book titles in the HathiTrust digital collection

◦ Maintain a lendable print collection distributed among HathiTrust members

◦ Build on existing arrangements

Original proposal, task force charge, & preliminary recommendations:

◦ https://www.hathitrust.org/print_monograph_archiving

(23)

US Federal

Documents Program

Focus:

◦ expanded coverage & enhanced access to U.S. federal documents.

Near term activities:

◦ Developing a registry of US Federal Government Documents

◦ https://www.hathitrust.org/usdocs_registry

◦ Digitize! Focus first on known and cataloged materials

◦ Gap analysis driven, focused on print, post-1976 materials

◦ Improve discoverability/findability of collection

◦ https://is.gd/HathiFedDocs

(24)

HathiTrust

Research Center

(HTRC)

(25)

HathiTrust Research Center Mission and Purpose

Research arm of HathiTrust Established: July, 2011

Collaborative center: Indiana University & University of Illinois Mission: Enable researchers world-wide to accomplish tera-scale text data-mining and analysis

◦ Develop cyberinfrastructure to enable HPC access to the HathiTrust Digital Library

◦ Develop cutting-edge software tools for processing, analyzing text

◦ Develop translational tools and data that can be used to enhance HathiTrust Digital Library services to users

Educause article - http://go.iu.edu/1SEn

(26)

HathiTrust Research Center

Executive Management Committee

J. Stephen Downie, Co-Director Harriett Green

Eleanor Dickson John Walsh, Co-Director

Robert H. McDonald

Marie Ma

(27)

HTRC Org Chart

HTRC Executive Mgmt

Administrative

Support Core

Development Advanced Research

Advanced Collaborative

Support

Scholarly

Commons

(28)

HathiTrust Research Center (HTRC)

Facilitates text analysis of HathiTrust content

Develops extracted feature work sets that can be used outside of HT+HTRC

Research & Development

Located at Indiana University-Bloomington and the

University of Illinois at Urbana-Champaign

(29)

“Non-consumptive” Research:

The HathiTrust Research Center

HathiTrust Research Center

◦ Persistent and sustainable structure to enable original and cutting edge non-consumptive research

◦ Developed collaboratively by Indiana University and University of Illinois.

◦ Additional funding from HathiTrust and foundations

◦ Analytics Portal

◦ https://analytics.hathitrust.org/

◦ Advanced Collaborative Support Programs

◦ https://www.hathitrust.org/hathitrust-research-center-awards-three- acs-projects

◦ Dataset distribution:

◦ https://www.hathitrust.org/datasets

(30)

VS

.

Tracking Technology Diffusion Through Time in the HathiTrust Corpus

Michelle Alexopoulos, University of Toronto

Dr. Alexopoulos, an economist, is using the vast historical record contained in the HathiTrust to study the diffusion of various technologies over time. By tracking word usage trends of 1,214 technology-related terms identified by Alexopoulos, such as the steam engine, her research based on HathiTrust book content has the potential to overturn accepted theories about the economic and societal impacts of a technology.

1,012,633 volumes analyzed.

Over 22 hours of processing using a 32-node cluster on Indiana University’s high-performance supercomputer, Big Red II.

Each node had 32 cores and 64 GB of RAM

.

Linkages to “Steam Engines” implied by the Library of Congress Classification

From HT text: Selected subject terms linked to “Steam engine” n-gram by 1910

Example Advanced Collaborative Support Projects

(31)

Current HTRC ACS Proposals

1. Measuring Literary Novelty McGrath, Hintze, and Higgins (Michigan State)

◦ This work draws on ongoing collaborative efforts to develop a method for applying genetic sequencing tools to the study of literature in order to identify and measure literary novelty, and address questions of literary history, canonicity, and prestige.

2. Computational Support for Reading Chicago Reading Burke, Shanahan, Lucic (DePaul)

◦ The Reading Chicago Reading team will seek to extend their own research on the “One Book, One Chicago” city-wide reading program by incorporating textual analysis on books chosen for the OBOC program, as well as comparison texts

3. The Power of Place: Structure, Culture, and Continuities in U.S.

Women’s Movements Nelson (Northeastern)

◦ The project will study the women's movement in the United States from 1848-1975 in

two cities, New York City and Chicago, using new advances in network analysis and

computational text analysis to identify structural and cultural diversity.

(32)

Current HTRC ACS Proposals

4. Modeling the History of Book Design Bamman & Hartmann (UC Berkeley)

◦ This project will utilize the HTRC Data Capsule to conduct feature extraction on page images from 10,000 in-copyright books in the HathiTrust repository, extracting features such as page construction, line justification, leading between baselines, kerning between letter pairs/combinations, line density per page, characters per line, position of images, typeface (serif, sans-serif) and font size.

5. A Computational History of the U.S. Novel, 1950-2000 Richard Jean So (McGill)

◦ This project seeks to write a new history of the American novel by examining a series of large textual datasets focused on the full cycle of the U.S. literary field from production to reception to canonization.

6. A Writer’s Workshop Workset with the Program Era Project (PEP) Glass/White/Kelly (Iowa Writers Project, U. Iowa)

◦ This project will compile a proof-of-concept workset with, at first, prominent individuals

(faculty, staff, students) who were involved with the Iowa Writers’ Workshop (IWW),

then produce “style cards” for each author’s works (by volume), based on stylometric

data gathered through text analysis of the IWW workset within the HTRC Data Capsule.

(33)
(34)

HTRC Updates

Funded HTRC Projects

◦ NEW: Data Capsule Appliance (IMLS)

◦ Digging Deeper, Reaching Further (IMLS)

◦ WCSA + Data Capsules: Phase 1 (Mellon) Policy

◦ Non-Consumptive Use Research Policy (https://www.hathitrust.org/htrc_ncup)

◦ Data Capsules Terms of Use (https://www.hathitrust.org/htrc_dc_tou)

◦ Internal data and software policies Leadership

◦ New Executive Management members: Eleanor Dickson and Marie Ma

◦ John Walsh & Harriett Green joined as IU Co-Director and UI Library

representative, respectively

(35)

HTRC Updates

Tools & Services

◦ HTRC v4.0

◦ Data Capsules upgrades

◦ Bookworm release

Data ◦ Extracted Features Dataset

◦ New HathiTrust gov doc worksets

◦ Environmental Protection Agency

◦ Bureau of Indian Affairs

◦ Incremental rsync with HathiTrust Advanced Collaborative Support (ACS)

◦ Finished Round 2 (4 projects)

◦ Released Round 3 RFP and work underway (6 projects)

(36)
(37)
(38)

Current HTRC Architecture

(39)

HTRC UnCamp 2018

Talks and Posters are available at OSF Conferences

◦ https://osf.io/view/htrc_uncamp2018/

(40)

Current IU HTRC Related

Project

(41)

THANK YOU and QUESTIONS?

Robert H. McDonald

Associate Dean for Research & Technology Strategies Deputy Director Data to Insight Center

[email protected]

References

Related documents

UNITED STATES OF AMERICA Postage 1938 – Presidential Issue - A Benjamin Franklin 1938 – Presidential Issue George Washington Martha Washington. John Adams Thomas

Patients with IRMA that are definitely seen should be referred into the Hospital Eye Service.. a) If an IRMA is found, the grader should return to the colour image. IRMA is

On the other hand, the value in the total duration and the number of fixations in the left mirror is greater during the merging manoeuvre because the driver has to pay attention to

We observed that in Skype, ordinary nodes send con- trol traffic including availability information, instant messages, and requests for VoIP and file-transfer ses- sions over

Held at Logan, Utah Outstanding Senator-. Ronald Fessenden, Ogden, Utah Superior

COLLEGE/UNIVERSITY MAT Amherst College 8 Bates College 9 Boston College 17 Boston University 8 Bowdoin College 11 Brandeis University 5 Brown University 20 Univ. of

Bates College 72 Bentley University 74 Berea College 40 Boston College 76 Boston University 78 Bowdoin College 80 Bradley University 82 Brandeis University 84K. Brigham

(a) Schematic of formation of a composite solid-liquid-air interface for rough surface, (b) contact angle for rough surface (θ) as a function of the roughness factor (R f )