Conference Introduction

(1)

SKG200

Introduction

http://www.culturegrid.net/SKG2006/

Guilin China

November 2 2006

Geoffrey Fox

Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401

(2)

SKG2006

n Last year saw the first conference of this series in Beijing

covering

• Knowledge sharing • Semantic networking • Grid computing

n These areas underlie

• Electronic Science (eScience) • Scholarship and

• Communities (the real world)

n This year we are pleased to present the second conference which

had an 18% acceptance rate for regular papers

n We look forward to the meeting next year in Xi’an n Listen and ask lots of questions!

n Lets thank Hai Zhuge and CAS for their wonderful vision and

implementation

(3)

Web 2.0, Knowledge

and the Semantic Grid

SKG 2006

http://www.culturegrid.net/SKG2006/ Guilin China

November 2 2006

Geoffrey Fox

Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401

(4)

Motivation

n

Build

Cyberinfrastructure

(Grids) that

• Support science from beginning (planning, instruments)

through middle (analysis) and end (refereed publications, follow-on work)

• Integrates with the popular Web 2.0 (community) tools whose

successes point to interesting ways of working together

• Integrate with Digital Library technology

• Does not redo previous work but rather augments it

• Assumes a heterogeneous fragmented world with multiple

platforms

• Allows one to specify and manage all the services and data

that a project needs with a mix of synchronous,

asynchronous, close (classic workflow) and loose (including zero) coupling

(5)

Application Drivers

n

Semantic analysis

of scientific documents as in case of

chemistry which has very precise naming rules for

compounds that allow accurate searches in documents

•

Suggesting how to tag scientific documents either

when writing it or after the fact

n

Journal web site

of the future as illustrated by Nature

building social bookmarking tool Connotea

n

Conference

support tools as can benefit from features

needed by journals

n

This gives Digital Library (document) enhanced

Cyberinfrastructure

(CI)

(6)

The Science Drivers

n

From Workshop on Challenges of Scientific Workflows

http://vtcpc.isi.edu/wiki/index.php/Main_Page

n

Workflow is underlying support for current science

model

• Distributed interdisciplinary data deluged scientific

methodology as an end (instrument, conjecture) to end (paper, Nobel prize) process is a transformative approach

n

Reproducibility

core to scientific method and requires

rich provenance, interoperable persistent repositories

with linkage of open data and publication as well as

distributed simulations, data analysis and new

algorithms.

n

Distributed Science Methodology

publishes all steps in a

new

electronic logbook

capturing scientific process (data

analysis) as a rich

cloud

of resources including emails,

PPT, Wikis as well as databases, compiler options, build

time/runtime configuration…

(7)

Community Tools

 e-mail and list-serves are oldest and best used

 Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration

– text, audio-video conferencing, files

 del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared

bookmarks

 MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create

(upload) community resources and share them; Friendster, LinkedIn create networks

• http://en.wikipedia.org/wiki/List_of_social_networking_websites

 Writely, Wikis and Blogs are powerful specialized shared document systems  ConferenceXP and WebEx share general applications

 Google Scholar tells you who has cited your papers while publisher sites tell

you about co-authors

• Windows Live Academic Search has similar goals

 Note sharing resources creates (implicit) communities

• Social network tools study graphs to both define communities and extract their properties

(8)

How to use Web2.0 Community tools in CI

 Nearly all of them have “profiles”, “users”, “groups”, “friends”

etc.

• Need to integrate these

 P2P File Sharing: Maybe this is useful for sharing files in

research groups (virtual organizations)

• Will modify Maze http://maze.pku.edu.cn – popular Chinese social P2P system with 2.5 million users

 BitTorrent: more popular than FTP – why not use for higher

performance fault tolerant cached file sharing?

 MySpace etc.: Could consider MyGridSpace or MyScienceSpace

that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest

• Could include uploaded material in workflows

 Social Bookmarking and linking: discuss later

(9)

Mashups and Grids

 http://www.programmableweb.com

 There are 303 “commodity”

service Web 2.0 API’s on October 30 2006

 Mashups are composed from

JavaScript, AJAX and REST

and not usually BPEL WSDL

and SOAP

 Architecture of Mashups and

Grids “identical”

 See Amazon S3 Storage and

EC2 Elastic Computing services

 Mashups enable everybody to

(10)

MashUp API’s with use indicated by size



Note most

Mashups

are

implemented

client side

inside

Browser



Most

Grid

(11)

(12)

Existing User Interface

Document-enhanced Cyberinfrastructure

etc. Google Scholar Submit Journals Science.gov Windows Live Academic Search Citeseer CMT Conferenc Management Existing Documen based Research Tools Web servic Wrappers New Document-enhanced Research Tools Integration Enhancement User Interface Community Tools Generic Document Tools

MyResearc Database Bibliographic Database Export RSS, Bibte

Endnote etc. CiteULike

(13)

Digital Library-enhanced Cyberinfrastructur

aka Semantic Scholar Grid I

n

Citeseer

and

Google Scholar

scour the Internet and analyze

documents for incidental metadata

•

Title

,

author

and

institution

of documents

•

Citations

with their own metadata allowing one to match

to other documents

n

Science.gov

extracts traditional library metadata from lots

of US Government databases

n

These capabilities are sure to become more powerful and to

be extended

•

Give “

Citation Index

” in real time

•

Tell you all authors of all papers that cite a paper that

cites you etc. (Note it’s a small world so don’t go too far

in link analysis)

•

Tell you all

citations of all papers in a workshop

(14)

Digital Library-enhanced Cyberinfrastructur

aka Semantic Scholar Grid II

n It is natural to develop knowledge extraction document Services

such as those used in Citeseer/Google Scholar but applied to

“your” documents of interest that may not have been processed yet

• As paper just submitted to a conference perhaps

n These tools can help form useful lists such as authors of all cited

or submitted papers to a journal

n OSCAR3 (from Peter Murray-Rust’s group at Cambridge)

augments the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms

• This tool is a Service that can be applied to “your” document or to a set of

documents harvested in some fashion

• Other fields have natural application specific metadata and OSCAR like

tools can be developed for them

n Such high value tools could appear on “publisher” sites of future

(15)

OSCAR Chemistry

Document analysis

n It detects “magic”

chemical strings in text and then

• Stores them as

metadata associated with document

n Queries

ChemInformatics

repositories to tell you lots of information

about identified compounds

n Tells you which other

documents have this compound

(16)

Scholar Grid III

n Search and annotation provide unstructured and structured

Semantic Web/Grid for documents

n Other Web 2.0 tools address linkage of people together and

people to information

n Information is metadata as in profiles or personal publication as

in Blogs, Wikis, YouTube, MySpace

• All of these involve some sort of collaboration

• Comments on Blogs and uploads to Collaborative editing in a Wiki

n Our projects usually use Wikis as central control (group

logbook) and each researecher (including students) can use Blogs

to define progress (an experimental web 2.0 electronic notebook))

• I can comment on student progress with Blog comment • Other students can keep abreast of group progress

• Security model not clear

n There is also P2P file transfer with BitTorrent

(17)

Delicious Semantic Web/Grid

n

http://del.icio.us

purchased by

Yahoo

for ~$30M

n

h

ttp://www.CiteULike.org

n

http://www.connotea.org (

Nature)

n

Associate

metadata

with

Bookmarks

specified by

URL’s, DOI’s (Digital Object Identifiers)

n

Users add

comments

and

keywords

(called

)

n

Users are linked together into

groups

(communities)

n

Information such as title and authors extracted

automatically

from some sites (PubMed, ACM, IEEE,

Wiley etc.)

n

Bibtex

like additional information in CiteULike

n

This is perhaps

de facto Semantic Web

– remarkable

for its simplicity

(18)

Connotea

(19)

Connotea queried by SERVOGrid

(20)

Biolicious

automatically

produces

(interesting)

scientific lists

Advertising!

(21)

Chemical Informatics as a Grid Application

n Chemical Informatics is the application of information technology to

problems in chemistry.

• Example problems: managing data in large scale drug discovery

and molecular modeling

n Building Blocks: Chemical Informatics Resources:

• Chemical databases maintained by various groups

n NIH PubChem, NIH DTP, http://nihroadmap.nih.gov/

• Application codes (both commercial and open source)

n Data mining such as clustering

n Quantum chemistry and molecular modeling

• Screening centers (with HTS High Throughput Screening devices)

measuring interaction of chemicals with biological samples

• Visualization tools

• Web resources: journal articles, etc.

n Chemical Informatics Grid http://www.chembiogrid.org needs to

integrate these into a common, loosely coupled, distributed computing environment.

(22)

OSCAR3 Service from

Cambridge UK

n

Oscar3 is a tool for shallow, chemistry-specific

natural language parsing of chemical documents

(i.e. journal articles).

n

It identifies (or attempts to identify):

q Chemical names: singular nouns, plurals, verbs etc., also

formulae and acronyms.

q Chemical data: Spectra, melting/boiling point, yield etc. in

experimental sections.

q Other entities: Things like N(5)-C(3) and so on. n

Uses SMILES, InChI and CML

n

There is a larger effort, SciBorg, in this area

q http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html

(23)

Workflows Using Chemical Literature

OSCAR3 program

All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red

SMILES NAME Pubmed ID

CCC propane 1425356 CC ethane 3546453 ... ... ...

Bulk download of Pubmed abstracts Extract chemical structures OSCAR3 Service Find similar molecules Searchable (structure/similari ty) Grid database Local DTP database PubChe m PDBBind Find similar documents

(24)

Initial Results

n We have a small sample (100) of full text Chemistry papers

selected at random from 15 years of PubMed with over 5 million

abstracts

• OSCAR3 generates 4.17 compound names per abstract • and 36.7 compound names per full text

n Illustrates how much knowledge journal publishers are hiding

from us

(25)

Clustering

Documents

from

chemica

properties

(26)

Provenance and Delicious CI

n

We can use

del.icio.us style interface to annotate

Application Data

with (extra) provenance and user

comments of any type (describing quality of data or a

keyword relating different data etc.)

• All data should be labeled by a URI to enable this • One has in addition Citeseer/OSCAR metadata

n

Current major tagging systems support

flat list of tags

without name=value (RDF triple) or schema

organization

• Tradeoff between features and pervasive deployment

n

Some extra features are easy to add as a

custom service

n

Features not supported by del.icio.us can be

uploaded

as comments

(27)

Implementation Strategy

 Doesn’t seem useful to build the 251st community tool  In fact a major barrier to use of existing tools is

• What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web)

 So assume use existing tools but wrap them all as web services so

can transfer information to new tools and integrate information between tools

• Need some “glue” logic, a “unification” database and minimal user interface

 Bookmarking tools: del.icio.us, Connotea, CiteULike (includes

plug-ins to major publisher sites)

 Document: Google Scholar, Windows Live, Citeseer tools,

OSCAR3 for Chemistry, Science.gov (later)

 Journals: Manuscript Central

(28)

Current Status

n Google Scholar, Windows Live Academic Search, del.icio.us,

Connotea, CiteULike, OSCAR3 are Web Services

n Debugging on 500 presentations and papers from my CGL

research group

n Experiment with GGF Presentations, Broad collection of

Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience

Web site (?business model for journals)

http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid

(29)

Knowledge Model for Scientific Journals

n There are classes of scientific journals

• Large circulation society journals effectively subsidized by fees of

professional society membership; circulations can be more than 10,000

• “Popular” magazine style journals • A few prestigious journals

• Many specialized journals publishing archival refereed papers with

circulations from one hundred to a few thousand

n The specialized journals largely sell a mix of paper and (a

growing number of) electronic subscriptions to libraries and very few individuals subscribe

• Access is limited and expensive

• Even if one subscribes, one is often restricted on the number of full text

papers one can access

• Collections like PubMed only include abstracts

n Systems like Google Scholar, Microsoft Academic Live and

Citeseer cannot fully analyze knowledge in papers unless get access to full text

n Current publishing model hindering and not helping science n Similar discussion for journal papers and research data

(30)

Internet Business Models

n

How to

make money

on the

Internet

has been debated

for many years

n

One can offer

content

(data on web) and/or

services

(user customizable transformations of web data)

n

Advertising

is dominant model in large sites.

n

Content

and

Services

can be free or paid by

Transactions or Subscriptions.

• Often there is a mixed model with basic content/services frees

and one pays for premium features

n

One can

charge reader

or

publisher

.

• Advertising charges publisher of Advert

• In the past, journals were funded by page charges i.e. one

charged the authors (institution) that produced paper

(31)

Examples of Internet Information and

Knowledge Content and Business Model

n

Itunes

and other music sources; at right price, people

will pay for convenience

n

News web sites

supported by a mix of advertising and

premium content.

• Not clear latter successful except in specialized areas

n

Sites like htt

p://www.chessbase.com/ wit

h

collections of

Chess Games

with occasional annotation

n

Several

Financial Service

sites

• Yahoo Google etc. Financial Services with premium for

real-time stock quotes

• Other sites feature commentary that is either free (supported

by advertising) or premium content (such as Wall Street

Journal and many stock picker sites) which you subscribe to

(32)

Examples of Internet Information and

Knowledge Services and Business Model

n

Google etc.

online Office

versus more sophisticated

paid Microsoft Office which also has "history"

advantage as owned field before Internet

n

WebEx

collaboration services

paid by transaction or

subscription; not obviously a viable long term model

n

ICC Chess Site

ht

tp://www.chessclub.com/ su

pports the

community of chess players with free basic access but

valuable premium features including better game

playing, rating and real-time commentary. Other

gaming sites similar

n

Amazon S3

and

Computing Cloud

paid services copuld

be successful as alternative (buy your own computers)

costs real money and perhaps less reliable

(33)

Publishing Business Model in the Internet

Age

n Journal publishing currently has a business model where the

price reflects neither the cost nor the value-added

n Publishers currently do not have significant internal expertise in

new approaches/technologies to drive new business models

n However much is outsourced already and so one can outsource

to organizations with new expertise e.g. to those that know Web 2.0 rather than putting ink on paper

n There is no clear new business model but plausible that current

model will not survive for that long

• So need to change even if less lucrative or success unclear

n Note libraries provide funds to publishers and libraries will

continue

• Not clear how fast libraries will change as they also don’t obviously have

expertise to support new models

• Some think that one role of university libraries will be curation of data

produced by university faculty

(34)

Strengths of Current Publishing Model

n

Permanent “guaranteed”

archival storage

but there are

other approaches such as Amazon S3 to this

n

Uniform look and feel

and

copyediting

to remove

language errors.

• Useful but not so valuable that we can trade access for this. • In particular can only correct some language errors as only a

subject expert can really rewrite in good grammar and expression

n

Refereeing of

a

quality

implied by the journal and the

editorial board

• Most important strength but business model does not directly

reflect this as only a small part of subscription price goes to editorial function

• For most papers cost of refereeing much less than other costs

of producing paper

• Not clear why viewer should pay for refereeing

n

Large amount of

pre-existing papers

from old issues of

journals

(35)

Pressures on Current Publishing Model

n Mandated open access to scholarly work funded by government

• Cornyn-Lieberman bill in the US

• NIH PubMed Central requires deposited of full text of articles after a

length of time

n Electronic access to publisher sites is not especially good n Division of articles into journals and publishers is not very

helpful today where technology does not care about location of information

• Location is just a rather simple annotation (meta data) specifying aspects

of provenance of article

• Note a special issue of SKG2006 is just an annotation roughly

characterizing nature and quality of work

n Publishing on the Internet is not a valuable service and has been

addressed by Web servers in general and by Web 2.0 in attractive ways

n Essentially nobody reads or even has access to paper copies of

journal

• Not clear it is useful to print specialized journals on paper

(36)

Scholarly Research Community Site

n Best product should allow one to make best use of knowledge in scholarly publications and data

n Should integrate journal and conference publications and services n Should contain integrated or support outside services for curation,

annotation, analysis and search

• Looking at Web 2.0 successes, one needs to conveniently share data and set up

communities

n Content is scholarly journals and data n Services include

• Annotation as in Connotes, CiteULike, Del.icio.us

• Semantic analysis for citations, authors, chemical compounds etc. • Biolicious style custom classifications including added value contacts • Search as in Google Scholar, Microsoft Academic Live

• MySpace/Facebook/LinkedIn style services for existing or new contacts • Support of conference and journal refereeing

• Other conference/journal services such as registration, advertising • Integration with research such as electronic log books

• Internal integration e.g. Authors in citations are linked to community • Links to more general document services such as:

n Online Office style Tools n WebEx type collaboration

(37)

Business Model for Scholarly

Journal/Research Community Site

n One can charge for advertising, better content, better services or

better implementation

n Natural is to start with a basic free content and services with

advertising.

• Content must be free eventually “by law”

• Services will have open source versions anyway so counter this with free

basic services

n One could use page charge model for charging for refereeing. n One charges user for features that add value. These include:

• Better or better implemented community/digital library services • Premium Content possibly contracted by site owner

n Problem with Advertising Business model: Audience specialized

(i.e. small) but upscale

n Problem with charging for Community Tools: Competing with

free software but likely can offer much better service than free software just as WebEx does fine in spite of free VNC