SKG200
Introduction
http://www.culturegrid.net/SKG2006/
Guilin China
November 2 2006
Geoffrey Fox
Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401
SKG2006
n Last year saw the first conference of this series in Beijing
covering
• Knowledge sharing • Semantic networking • Grid computing
n These areas underlie
• Electronic Science (eScience) • Scholarship and
• Communities (the real world)
n This year we are pleased to present the second conference which
had an 18% acceptance rate for regular papers
n We look forward to the meeting next year in Xi’an n Listen and ask lots of questions!
n Lets thank Hai Zhuge and CAS for their wonderful vision and
implementation
Web 2.0, Knowledge
and the Semantic Grid
SKG 2006
http://www.culturegrid.net/SKG2006/ Guilin China
November 2 2006
Geoffrey Fox
Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401
Motivation
n
Build
Cyberinfrastructure
(Grids) that
• Support science from beginning (planning, instruments)
through middle (analysis) and end (refereed publications, follow-on work)
• Integrates with the popular Web 2.0 (community) tools whose
successes point to interesting ways of working together
• Integrate with Digital Library technology
• Does not redo previous work but rather augments it
• Assumes a heterogeneous fragmented world with multiple
platforms
• Allows one to specify and manage all the services and data
that a project needs with a mix of synchronous,
asynchronous, close (classic workflow) and loose (including zero) coupling
Application Drivers
n
Semantic analysis
of scientific documents as in case of
chemistry which has very precise naming rules for
compounds that allow accurate searches in documents
•
Suggesting how to tag scientific documents either
when writing it or after the fact
n
Journal web site
of the future as illustrated by Nature
building social bookmarking tool Connotea
n
Conference
support tools as can benefit from features
needed by journals
n
This gives Digital Library (document) enhanced
Cyberinfrastructure
(CI)
The Science Drivers
n
From Workshop on Challenges of Scientific Workflows
http://vtcpc.isi.edu/wiki/index.php/Main_Page
n
Workflow is underlying support for current science
model
• Distributed interdisciplinary data deluged scientific
methodology as an end (instrument, conjecture) to end (paper, Nobel prize) process is a transformative approach
n
Reproducibility
core to scientific method and requires
rich provenance, interoperable persistent repositories
with linkage of open data and publication as well as
distributed simulations, data analysis and new
algorithms.
n
Distributed Science Methodology
publishes all steps in a
new
electronic logbook
capturing scientific process (data
analysis) as a rich
cloud
of resources including emails,
PPT, Wikis as well as databases, compiler options, build
time/runtime configuration…
Community Tools
e-mail and list-serves are oldest and best used
Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration
– text, audio-video conferencing, files
del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared
bookmarks
MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create
(upload) community resources and share them; Friendster, LinkedIn create networks
• http://en.wikipedia.org/wiki/List_of_social_networking_websites
Writely, Wikis and Blogs are powerful specialized shared document systems ConferenceXP and WebEx share general applications
Google Scholar tells you who has cited your papers while publisher sites tell
you about co-authors
• Windows Live Academic Search has similar goals
Note sharing resources creates (implicit) communities
• Social network tools study graphs to both define communities and extract their properties
How to use Web2.0 Community tools in CI
Nearly all of them have “profiles”, “users”, “groups”, “friends”
etc.
• Need to integrate these
P2P File Sharing: Maybe this is useful for sharing files in
research groups (virtual organizations)
• Will modify Maze http://maze.pku.edu.cn – popular Chinese social P2P system with 2.5 million users
BitTorrent: more popular than FTP – why not use for higher
performance fault tolerant cached file sharing?
MySpace etc.: Could consider MyGridSpace or MyScienceSpace
that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest
• Could include uploaded material in workflows
Social Bookmarking and linking: discuss later
Mashups and Grids
http://www.programmableweb.com There are 303 “commodity”
service Web 2.0 API’s on October 30 2006
Mashups are composed from
JavaScript, AJAX and REST
and not usually BPEL WSDL
and SOAP
Architecture of Mashups and
Grids “identical”
See Amazon S3 Storage and
EC2 Elastic Computing services
Mashups enable everybody to
MashUp API’s with use indicated by size
Note most
Mashups
are
implemented
client side
inside
Browser
Most
Grid
Existing User Interface
Document-enhanced Cyberinfrastructure
etc. Google Scholar Submit Journals Science.gov Windows Live Academic Search Citeseer CMT Conferenc Management Existing Documen based Research Tools Web servic Wrappers New Document-enhanced Research Tools Integration Enhancement User Interface Community Tools Generic Document ToolsMyResearc Database Bibliographic Database Export RSS, Bibte
Endnote etc. CiteULike
Digital Library-enhanced Cyberinfrastructur
aka Semantic Scholar Grid I
n
Citeseer
and
Google Scholar
scour the Internet and analyze
documents for incidental metadata
•
Title
,
author
and
institution
of documents
•
Citations
with their own metadata allowing one to match
to other documents
n
Science.gov
extracts traditional library metadata from lots
of US Government databases
n
These capabilities are sure to become more powerful and to
be extended
•
Give “
Citation Index
” in real time
•
Tell you all authors of all papers that cite a paper that
cites you etc. (Note it’s a small world so don’t go too far
in link analysis)
•
Tell you all
citations of all papers in a workshop
Digital Library-enhanced Cyberinfrastructur
aka Semantic Scholar Grid II
n It is natural to develop knowledge extraction document Services
such as those used in Citeseer/Google Scholar but applied to
“your” documents of interest that may not have been processed yet
• As paper just submitted to a conference perhaps
n These tools can help form useful lists such as authors of all cited
or submitted papers to a journal
n OSCAR3 (from Peter Murray-Rust’s group at Cambridge)
augments the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms
• This tool is a Service that can be applied to “your” document or to a set of
documents harvested in some fashion
• Other fields have natural application specific metadata and OSCAR like
tools can be developed for them
n Such high value tools could appear on “publisher” sites of future
OSCAR Chemistry
Document analysis
n It detects “magic”
chemical strings in text and then
• Stores them as
metadata associated with document
n Queries
ChemInformatics
repositories to tell you lots of information
about identified compounds
n Tells you which other
documents have this compound
Scholar Grid III
n Search and annotation provide unstructured and structured
Semantic Web/Grid for documents
n Other Web 2.0 tools address linkage of people together and
people to information
n Information is metadata as in profiles or personal publication as
in Blogs, Wikis, YouTube, MySpace
• All of these involve some sort of collaboration
• Comments on Blogs and uploads to Collaborative editing in a Wiki
n Our projects usually use Wikis as central control (group
logbook) and each researecher (including students) can use Blogs
to define progress (an experimental web 2.0 electronic notebook))
• I can comment on student progress with Blog comment • Other students can keep abreast of group progress
• Security model not clear
n There is also P2P file transfer with BitTorrent
Delicious Semantic Web/Grid
n
http://del.icio.us
purchased by
Yahoo
for ~$30M
nh
ttp://www.CiteULike.org
n
http://www.connotea.org (
Nature)
n
Associate
metadata
with
Bookmarks
specified by
URL’s, DOI’s (Digital Object Identifiers)
n
Users add
comments
and
keywords
(called
tags
)
n
Users are linked together into
groups
(communities)
nInformation such as title and authors extracted
automatically
from some sites (PubMed, ACM, IEEE,
Wiley etc.)
n
Bibtex
like additional information in CiteULike
n
This is perhaps
de facto Semantic Web
– remarkable
for its simplicity
Connotea
Connotea queried by SERVOGrid
Biolicious
automatically
produces
(interesting)
scientific lists
Advertising!
Chemical Informatics as a Grid Application
n Chemical Informatics is the application of information technology to
problems in chemistry.
• Example problems: managing data in large scale drug discovery
and molecular modeling
n Building Blocks: Chemical Informatics Resources:
• Chemical databases maintained by various groups
n NIH PubChem, NIH DTP, http://nihroadmap.nih.gov/
• Application codes (both commercial and open source)
n Data mining such as clustering
n Quantum chemistry and molecular modeling
• Screening centers (with HTS High Throughput Screening devices)
measuring interaction of chemicals with biological samples
• Visualization tools
• Web resources: journal articles, etc.
n Chemical Informatics Grid http://www.chembiogrid.org needs to
integrate these into a common, loosely coupled, distributed computing environment.
OSCAR3 Service from
Cambridge UK
n
Oscar3 is a tool for shallow, chemistry-specific
natural language parsing of chemical documents
(i.e. journal articles).
n
It identifies (or attempts to identify):
q Chemical names: singular nouns, plurals, verbs etc., also
formulae and acronyms.
q Chemical data: Spectra, melting/boiling point, yield etc. in
experimental sections.
q Other entities: Things like N(5)-C(3) and so on. n
Uses SMILES, InChI and CML
n
There is a larger effort, SciBorg, in this area
q http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html
Workflows Using Chemical Literature
OSCAR3 program
All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red
SMILES NAME Pubmed ID
CCC propane 1425356 CC ethane 3546453 ... ... ...
Bulk download of Pubmed abstracts Extract chemical structures OSCAR3 Service Find similar molecules Searchable (structure/similari ty) Grid database Local DTP database PubChe m PDBBind Find similar documents
Initial Results
n We have a small sample (100) of full text Chemistry papers
selected at random from 15 years of PubMed with over 5 million
abstracts
• OSCAR3 generates 4.17 compound names per abstract • and 36.7 compound names per full text
n Illustrates how much knowledge journal publishers are hiding
from us
Clustering
Documents
from
chemica
properties
Provenance and Delicious CI
n
We can use
del.icio.us style interface to annotate
Application Data
with (extra) provenance and user
comments of any type (describing quality of data or a
keyword relating different data etc.)
• All data should be labeled by a URI to enable this • One has in addition Citeseer/OSCAR metadata
n
Current major tagging systems support
flat list of tags
without name=value (RDF triple) or schema
organization
• Tradeoff between features and pervasive deployment
n
Some extra features are easy to add as a
custom service
nFeatures not supported by del.icio.us can be
uploaded
as comments
Implementation Strategy
Doesn’t seem useful to build the 251st community tool In fact a major barrier to use of existing tools is
• What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web)
So assume use existing tools but wrap them all as web services so
can transfer information to new tools and integrate information between tools
• Need some “glue” logic, a “unification” database and minimal user interface
Bookmarking tools: del.icio.us, Connotea, CiteULike (includes
plug-ins to major publisher sites)
Document: Google Scholar, Windows Live, Citeseer tools,
OSCAR3 for Chemistry, Science.gov (later)
Journals: Manuscript Central
Current Status
n Google Scholar, Windows Live Academic Search, del.icio.us,
Connotea, CiteULike, OSCAR3 are Web Services
n Debugging on 500 presentations and papers from my CGL
research group
n Experiment with GGF Presentations, Broad collection of
Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience
Web site (?business model for journals)
http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid
Knowledge Model for Scientific Journals
n There are classes of scientific journals
• Large circulation society journals effectively subsidized by fees of
professional society membership; circulations can be more than 10,000
• “Popular” magazine style journals • A few prestigious journals
• Many specialized journals publishing archival refereed papers with
circulations from one hundred to a few thousand
n The specialized journals largely sell a mix of paper and (a
growing number of) electronic subscriptions to libraries and very few individuals subscribe
• Access is limited and expensive
• Even if one subscribes, one is often restricted on the number of full text
papers one can access
• Collections like PubMed only include abstracts
n Systems like Google Scholar, Microsoft Academic Live and
Citeseer cannot fully analyze knowledge in papers unless get access to full text
n Current publishing model hindering and not helping science n Similar discussion for journal papers and research data
Internet Business Models
n
How to
make money
on the
Internet
has been debated
for many years
n
One can offer
content
(data on web) and/or
services
(user customizable transformations of web data)
n
Advertising
is dominant model in large sites.
nContent
and
Services
can be free or paid by
Transactions or Subscriptions.
• Often there is a mixed model with basic content/services frees
and one pays for premium features
n
One can
charge reader
or
publisher
.
• Advertising charges publisher of Advert
• In the past, journals were funded by page charges i.e. one
charged the authors (institution) that produced paper
Examples of Internet Information and
Knowledge Content and Business Model
n
Itunes
and other music sources; at right price, people
will pay for convenience
n
News web sites
supported by a mix of advertising and
premium content.
• Not clear latter successful except in specialized areas
n
Sites like htt
p://www.chessbase.com/ wit
h
collections of
Chess Games
with occasional annotation
n
Several
Financial Service
sites
• Yahoo Google etc. Financial Services with premium for
real-time stock quotes
• Other sites feature commentary that is either free (supported
by advertising) or premium content (such as Wall Street
Journal and many stock picker sites) which you subscribe to
Examples of Internet Information and
Knowledge Services and Business Model
n
Google etc.
online Office
versus more sophisticated
paid Microsoft Office which also has "history"
advantage as owned field before Internet
n
WebEx
collaboration services
paid by transaction or
subscription; not obviously a viable long term model
n
ICC Chess Site
ht
tp://www.chessclub.com/ su
pports the
community of chess players with free basic access but
valuable premium features including better game
playing, rating and real-time commentary. Other
gaming sites similar
n
Amazon S3
and
Computing Cloud
paid services copuld
be successful as alternative (buy your own computers)
costs real money and perhaps less reliable
Publishing Business Model in the Internet
Age
n Journal publishing currently has a business model where the
price reflects neither the cost nor the value-added
n Publishers currently do not have significant internal expertise in
new approaches/technologies to drive new business models
n However much is outsourced already and so one can outsource
to organizations with new expertise e.g. to those that know Web 2.0 rather than putting ink on paper
n There is no clear new business model but plausible that current
model will not survive for that long
• So need to change even if less lucrative or success unclear
n Note libraries provide funds to publishers and libraries will
continue
• Not clear how fast libraries will change as they also don’t obviously have
expertise to support new models
• Some think that one role of university libraries will be curation of data
produced by university faculty
Strengths of Current Publishing Model
n
Permanent “guaranteed”
archival storage
but there are
other approaches such as Amazon S3 to this
n
Uniform look and feel
and
copyediting
to remove
language errors.
• Useful but not so valuable that we can trade access for this. • In particular can only correct some language errors as only a
subject expert can really rewrite in good grammar and expression
n
Refereeing of
a
quality
implied by the journal and the
editorial board
• Most important strength but business model does not directly
reflect this as only a small part of subscription price goes to editorial function
• For most papers cost of refereeing much less than other costs
of producing paper
• Not clear why viewer should pay for refereeing
n
Large amount of
pre-existing papers
from old issues of
journals
Pressures on Current Publishing Model
n Mandated open access to scholarly work funded by government
• Cornyn-Lieberman bill in the US
• NIH PubMed Central requires deposited of full text of articles after a
length of time
n Electronic access to publisher sites is not especially good n Division of articles into journals and publishers is not very
helpful today where technology does not care about location of information
• Location is just a rather simple annotation (meta data) specifying aspects
of provenance of article
• Note a special issue of SKG2006 is just an annotation roughly
characterizing nature and quality of work
n Publishing on the Internet is not a valuable service and has been
addressed by Web servers in general and by Web 2.0 in attractive ways
n Essentially nobody reads or even has access to paper copies of
journal
• Not clear it is useful to print specialized journals on paper
Scholarly Research Community Site
n Best product should allow one to make best use of knowledge in scholarly publications and data
n Should integrate journal and conference publications and services n Should contain integrated or support outside services for curation,
annotation, analysis and search
• Looking at Web 2.0 successes, one needs to conveniently share data and set up
communities
n Content is scholarly journals and data n Services include
• Annotation as in Connotes, CiteULike, Del.icio.us
• Semantic analysis for citations, authors, chemical compounds etc. • Biolicious style custom classifications including added value contacts • Search as in Google Scholar, Microsoft Academic Live
• MySpace/Facebook/LinkedIn style services for existing or new contacts • Support of conference and journal refereeing
• Other conference/journal services such as registration, advertising • Integration with research such as electronic log books
• Internal integration e.g. Authors in citations are linked to community • Links to more general document services such as:
n Online Office style Tools n WebEx type collaboration
Business Model for Scholarly
Journal/Research Community Site
n One can charge for advertising, better content, better services or
better implementation
n Natural is to start with a basic free content and services with
advertising.
• Content must be free eventually “by law”
• Services will have open source versions anyway so counter this with free
basic services
n One could use page charge model for charging for refereeing. n One charges user for features that add value. These include:
• Better or better implemented community/digital library services • Premium Content possibly contracted by site owner
n Problem with Advertising Business model: Audience specialized
(i.e. small) but upscale
n Problem with charging for Community Tools: Competing with
free software but likely can offer much better service than free software just as WebEx does fine in spite of free VNC