• No results found

Selected Practical Software Development Lessons From A Large Digital Library System. Tibor Šimko August 2011 / openlab talk

N/A
N/A
Protected

Academic year: 2021

Share "Selected Practical Software Development Lessons From A Large Digital Library System. Tibor Šimko August 2011 / openlab talk"

Copied!
125
0
0

Loading.... (view fulltext now)

Full text

(1)

Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Selected Practical Software Development Lessons

From A Large Digital Library System

Tibor Šimko

<[email protected]>

Department of Information Technology

CERN

(2)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

(3)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

(4)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

(as opposed to print, microform, or other media) and

accessible by computers”

(1) institutional document repositories

(2) world-wide subject-based information systems

Example #1: CERN Document Server

managing CERN and selected non-CERN high-energy

physics and related documents since

1993

more than 1,000,000 records

articles, books, theses, photos, videos, and more

powered by Invenio, free digital library software

(5)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

(as opposed to print, microform, or other media) and

accessible by computers”

(1) institutional document repositories

(2) world-wide subject-based information systems

Example #1: CERN Document Server

managing CERN and selected non-CERN high-energy

physics and related documents since

1993

more than 1,000,000 records

articles, books, theses, photos, videos, and more

powered by Invenio, free digital library software

(6)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(7)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(8)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(9)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(10)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(11)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(12)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(13)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(14)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(15)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(16)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Example #2: INSPIRE

world-wide high-energy physics information system

run by CERN, DESY, FNAL, SLAC

metadata curation since 1960s, Invenio technology

since 2007

citation analysis, author/affiliation analysis

close partnership with arXiv and ADS

(17)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(18)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(19)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(20)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(21)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

(22)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

navigable collection tree

(regular, virtual)

powerful search engine

Google-like speed for up to 5M records

combined metadata, reference and fulltext search

flexible metadata

(MARC, OA)

handling any kind of document (multimedia)

customizable input, formatting and linking

personalization

and

collaborative

features:

alerts, baskets, groups, reviews, comments

internationalisation (28 languages)

open source

, GNU General Public License

co-developed by CERN (2002–), EPFL (2004–),

DESY/FNAL/SLAC (2008–), CfA (2009–)

installed at

∼30 institutions world-wide

(23)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

(24)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

(25)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

Database

(26)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

Database

Sources

(27)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

Database

Sources

Processing

(28)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

Database

Sources

Processing

Dissemination

User

(29)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

Database

Sources

Processing

Dissemination

User

Curation

Librarian

(30)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

Ingestion

Database

Sources

Processing

Dissemination

User

Curation

Librarian

Overview

(31)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

(32)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

WebSubmit

WebSession, WebAccess

(33)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

WebSubmit

WebSession, WebAccess

Metadata

Full-text

full-text document

metadata

(34)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

WebSubmit

WebSession, WebAccess

Metadata

Full-text

full-text document

BibUpload

BibSched

BibConvert

metadata

MARCXML

(35)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

WebSubmit

WebSession, WebAccess

Metadata

Full-text

full-text document

BibUpload

BibSched

BibConvert

metadata

MARCXML

BibHarvest

(36)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

WebSubmit

WebSession, WebAccess

Metadata

Full-text

full-text document

BibUpload

BibSched

BibConvert

metadata

MARCXML

BibHarvest

OAI Data Source

ElmSubmit

(37)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Author

WebSubmit

WebSession, WebAccess

Metadata

Full-text

full-text document

BibUpload

BibSched

BibConvert

metadata

MARCXML

BibHarvest

OAI Data Source

ElmSubmit

Non-OAI Data Source

(38)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

(39)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

RefExtract

Full-text

(40)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

RefExtract

Full-text

BibClassify

(41)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

RefExtract

Full-text

BibClassify

(42)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

RefExtract

Full-text

BibClassify

Clusters

BibIndex

(43)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

RefExtract

Full-text

BibClassify

Clusters

BibIndex

BibRank

(44)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

RefExtract

Full-text

BibClassify

Clusters

BibIndex

BibRank

BibFormat

(45)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

(46)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

(47)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

(48)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

(49)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

(50)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

(51)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

(52)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

WebComment

(53)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

WebComment

WebMessage

WebJournal

(54)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

WebComment

WebMessage

(55)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

WebComment

WebMessage

WebJournal

BibCirculation

(56)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

WebComment

WebMessage

WebJournal

BibCirculation

(57)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

Full-text

Clusters

WebSearch

User

WebBasket

WebTag

WebAlert

BibHarvest

OAI Harvester

WebComment

WebMessage

WebJournal

BibCirculation

(58)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

(59)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

(60)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

(61)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

(62)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

(63)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

(64)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

(65)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

(66)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

(67)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

RefExtract

Tasks

BibCatalog

(68)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

(69)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

(70)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

(71)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

BibMerge

(72)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Metadata

BibEdit

Librarian

Full-text

BatchUploader

BibCheck

BibCirculation

BibDocFile

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

BibMerge

Curation

(73)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

33 modules

codebase

∼290,000 lines of Python code

∼12,000 lines of JavaScript code

∼6,000 lines of XSL code

∼5,000 lines of autotools code

75 authors since inception

∼25 authors and contributors in 2010

many short-term students

importance of

informal

coding standards

10 years of development

started at CERN, first release in 2002

now co-developed world-wide (EU, US)

lego programming... but no silver bullet

(74)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

(75)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

easy to read and understand

(good for many temporary developers)

suitable for rapid prototyping

(good for organic-growth software development model)

write code to throw it away

(76)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Japanese art of flower

arrangement

“way of flowers”

natural shapes, graceful

lines

minimalism

“disciplined art form in

which nature and

humanity are brought

together”

(77)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Java?

new Callable() {

public Object call(Object x) {

return x.times(k)

}

}

Python!

(78)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Java?

new Callable() {

public Object call(Object x) {

return x.times(k)

}

}

Python!

(79)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

bytecode interpreted language: what about speed?

Cython

permits to write C extensions easily

combining efficiency of C with high-levelness of Python

Example: intbitset.pyx

ctypedef unsigned long long int word_t

ctypedef struct IntBitSet:

int size

int allocated

word_t trailing_bits

int tot

(80)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

(81)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

good for distributed teams

offline development possible

“pull on demand” collaboration model

(as opposed to “shared push” collaboration model)

inherent,natural code review process

commit early, commit often (to private repositories)

rebase and clean (before pushing for public

consumption)

interplay with SVN

(82)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

master

(83)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

master

(84)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

(85)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

(86)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

maint

enance

(87)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

(88)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

(89)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

(90)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

(91)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

N1

next

(92)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

N1

next

C6

(93)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

N1

next

C6

N2

(94)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

N1

next

C6

N2

C7

v1.1.0

(95)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

N1

next

C6

N2

C7

v1.1.0

M5

v1.0.2

(96)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

C2

C3

master

v1.0.0

C4

M1

M2

maint

enance

C5

M3

v1.0.1

M4

N1

next

C6

N2

C7

v1.1.0

M5

v1.0.2

maint

— release maintenance branch

master

— new feature branch

(97)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

(98)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

(99)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

(100)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

(101)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

(102)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

(103)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

(104)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

F2

(105)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

F2

C4

merge

(106)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

F2

C4

merge

N3

merge

(107)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

F2

C4

merge

N3

merge

E1

some-experimental-feature

(108)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

F2

C4

merge

N3

merge

E1

E2

some-experimental-feature

(109)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

C1

M1

N1

master

maint

next

B1

some-bugfix

C2

M2

N2

B2

M3

merge

C3

merge

F1

some-new-feature

F2

C4

merge

N3

merge

E1

E2

some-experimental-feature

N4

merge

(110)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(111)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

(112)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

test-driven development

when appropriate

e.g. before/while developing

strip_accents()

, write:

Example: search_engine_tests.py

class TestStripAccents(unittest.TestCase):

"""Test for handling of UTF-8 accents."""

def test_strip_accents(self):

"""search engine - stripping of accented letters"""

self.assertEqual("memememe",

search_engine.strip_accents('mémêmëmè'))

self.assertEqual("MEMEMEME",

(113)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

testbed site (Atlantis of Institute Fictive Science)

e.g. Python

mechanize

module to emulate browser

Example: websearch_regression_tests.py

class WebSearchSearchEnginePythonAPITest(unittest.TestCase):

"Check typical search engine Python API calls on the demo data."

def test_search_engine_python_api_for_failed_query(self):

"websearch - search engine Python API for failed query"

self.assertEqual([],

perform_request_search(p='aoeuidhtns'))

def test_search_engine_python_api_for_successful_query(self):

"websearch - search engine Python API for successful query"

self.assertEqual([8, 9, 10, 11, 12, 13, 14, 15, 16, 17,

18, 47],

(114)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

e.g. pages with heavy JavaScript

using

Selenium IDE

extension for Firefox

record and replay browser actions

test for text existence or non-existence on pages

test for link labels and targets

Example: test_search_ellis.html

<tr><td>open</td>

<td>http://localhost</td>

<td></td>

</tr>

<tr><td>type</td>

<td>p</td>

<td>ellis</td>

</tr>

<tr><td>clickAndWait</td>

<td>action_search</td>

<td></td>

</tr>

<tr><td>verifyTextPresent</td>

<td>1. Thermal conductivity of dense quark matter and cooling of stars</td>

<td></td>

</tr>

(115)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

3

Conclusions

(116)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

performance-driven design

assumptions:

high number of selects, low number of updates

fast searching, slow indexation

cache everything cacheable

search functionality:

search for words, phrases, regular expressions

search in any field, authors, titles, etc

index design:

forward indexes:

word1

−→

[

rec1

,

rec2

, . . . ]

word2

−→

[

rec2

,

rec7

, . . . ]

reverse indexes:

rec1

−→

[

word1

,

word8

, . . . ]

rec2

−→

[

word1

,

word2

, . . . ]

Zipf’s law

on word frequency:

few words occur very often (e.g.

the

)

(117)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing
(118)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

three important

speed factors

to consider:

speed of finding sets (DB Server)

speed of demarshaling sets (DB

Web App Server)

speed of intersecting sets (Web App Server)

Example: speed of various parts (2002, before optimization)

action / query:

"CERN 2002" "of the this"

---fetching

0.28 sec

0.34 sec

demarshaling

0.78 sec

1.10 sec

adding colls

0.37 sec

0.63 sec

intersecting

0.64 sec

1.19 sec

---total search time

2.07 sec

3.22 sec

(119)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

‘sorted’ (lists, Patricia trees)

‘unsorted’ (hashed sets, binary vectors)

fast prototyping

: (Python, Lisp in 2002)

throw-away coding to test ideas

Example: lists vs dicts, 350K sets in 800K universe

marshaling lists ... 532616+532571 bytes in 1.33 sec

demarshaling lists ... 350000+350000 items in 0.10 sec

merging lists ... 546965 items in 0.34 sec

intersecting lists ... 153035 items in 0.35 sec

marshaling dicts ... 576491+576450 bytes in 0.87 sec

demarshaling dicts ... 350000+350000 items in 0.36 sec

merging dicts ... 546965 items in 0.09 sec

(120)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

binary vectors

found the best compromise!

using

Numeric

Python module (in 2002)

typical search time gain: 4.0 sec

0.2 sec (in 2002)

typical indexing time loss: 7 hours

4 days (in 2002)

mostly spare data modelled via mostly dense data

structure?

free your mind, think critically

further optimization:

Numeric

module not addressing real bits, only bytes

so home-made

intbitset

C extension in 2007

addressing real bits (factor of 8 already)

saving space, saving (indexing) time

(121)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

1

Introduction

Digital Library

Invenio

2

Case Studies

Episode 1: Python

Episode 2: Git

Episode 3: Test Suite

Episode 4: Building Efficient Indexes

Episode 5: Load-balancing

3

Conclusions

(122)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

load of CDS Web and DB servers at the split time:

split leads to efficient use of OS resources by lone,

non-competing Web and DB daemon processes

(123)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

useful for “LHC First Beam Day” rush situations with

many concurrent visitors

Apache

mod_proxy_balancer

User

Load Balancer

App Worker 2

App Worker 1

App Worker 3

(124)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

measure throughput on a sample of typical URLs

Example: inspirebeta.net under gentle siege

$ siege -d 1 -c 20 -t 1m -f inspirebeta_urls.txt

Transactions:

1329 hits

Availability:

100.00 %

Elapsed time:

60.23 secs

Data transferred:

37.12 MB

Response time:

0.41 secs

Transaction rate:

22.07 trans/sec

Throughput:

0.62 MB/sec

Concurrency:

8.96

Successful transactions:

1329

Failed transactions:

0

Longest transaction:

3.05

Shortest transaction:

0.01

(125)

Tibor Šimko

Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

selected lessons from building a digital library system

∼300,000 LOCs from

∼75 authors over

∼10 years

value of rapid prototyping

value of organic-growth software development model

value of coding aesthetics and minimalism

morale from selected anecdotes?

http://cdsweb.cern.ch/ http://inspirebeta.net/

References

Related documents