1
Semantic Research Grid
Open Grid Forum Web 2.0 Workshop OGF21, Seattle Washington
October 15 2007
Geoffrey Fox, Aurel Cami, Ahmet Fatih
Mustacoglu, Ahmet E. Topcu
Community Grids Laboratory,
Indiana University Bloomington IN 47404
Existing User Interface
Semantic Scholars Grid
etc. Google Scholar Manuscript Central Science.gov Windows Live Academic Search Citeseer CMT Conferenc Management Existing Documen based Tools Web servic Wrappers New Document-enhanced Research Tools Integration Enhancement User Interface Community Tools Generic Document Tools
Delicious Semantic Web/Grid
n
http://del.icio.us
purchased by
Yahoo
for ~$30M
nh
ttp://www.CiteULike.org
n
http://www.connotea.org (
Nature)
n
Associate
metadata
with
Bookmarks
specified by
URL’s, DOI’s (Digital Object Identifiers)
n
Users add
comments
and
keywords
(called
tags)
n
Users are linked together into
groups
(communities)
nInformation such as title and authors extracted
automatically
from some sites (PubMed, ACM, IEEE,
Wiley etc.)
n
Bibtex
like additional information in CiteULike
n
This is perhaps
de facto Semantic Web
– remarkable
Example
n ParallelComputing Collection selected on
Cell Tag
n So far no clear
“winner” in tagging space
n Maybe
CiteUlike with different
metadata better
n How do I
preserve
General Document Semantic Analysis
n Citeseer and Google Scholar scour the Internet and analyze documents
for incidental metadata
• Title, author and institution of documents
• Citations with their own metadata allowing one to match to other
documents
n These capabilities are sure to become more powerful and to be
extended
• Give “Citation Index” in real time
• Tell you all authors of all papers that cite a paper that cites you etc.
(Note it’s a small world so don’t go too far in link analysis)
• Tell you all citations of all papers in a workshop
• Helps journal editor by suggesting referees based on document
Possible challenges
n
Use of Web 2.0 tools
in science (and business) is very
promising but adoption is currently
small
n
Which of many tools will be popular with your
colleagues?
n
What happens if
tool
you chose is not adopted or worse
– just
disappears
in a industry “shake-up”?
n
How to best
integrate web-tagged
document with
Word
and
Latex
citations?
n
Need to tag
URI’s – e.g. database entries, not just
URL’s (did for journal control system)
n
Is currently
security
model sufficient?
n
Can we
link virtual organization
of tagging system with
Roughly what we are doing
n We are NOT building a new tagging or search systemn We are building tools integrating and adding value to existing
systems
n We built a mashup linking to del.icio.us, CiteULike, Connotea
allowing exchange of tags between sites and between local repositories
n Repositories also link to local sources (PubsOnline) and Google
Scholar (GS) and Windows Academic Live (WLA)
• GS has number of cited publications.
• WLA has Digital Object Identifier (DOI)
n We implement a rather more powerful access control mechanism n We build heuristic tools to mine “web lists” for citations
n We have an “event” based architecture (consistency model)
allowing change actions to be preserved and selectively changed
• Supports integrating different inconsistent views of a given document and
del.icio.us Tags
Download to Local System
Key Concepts of System Architecture
n
Digital Entity (DE):
a digital collection of metadata for
a citation
n
Event:
a time-stamped action on a digital entity. Our
event-based model consists of:
• Major Events:
n
Insertion or deletion of a digital entity
• Minor Events:n
Modifications to an existing digital entity
• Dataset:n
Collection of major and minor events
n
Service-based Framework
(SOAP over Http)
Example Subsystem
n Transfer
n Download/Upload n Modify Digital
Entity (DE)
n Share DE with
other users
n Add/Get More info
on a DE
n History (as a set of
events) of a DE and rollback
03/02/2020 11
CiteULike Connotea Delicous
Research
Database ResearchDatabase ResearchDatabase Core Web
SRG System Modules I
n
Digital Entity
(DE)
Management Service
• Manual DE entity into the system • DE history
• DE versioning and flexible choices (rollback)
• Editing and more info tools for a DE (Update Model)
n
Session
and
Event
Management Services
• Event and dataset management • DE view options
• User credentials (username/password) - cookie-based
n
Annotation Tools Service
• Transfer Service • Download service • Upload Service
• Extract DE and tags from web lists
SRG System Modules II
n Search Tools Services• Google Scholar/Windows Live Academic • Google Scholar Advanced
• Local Database Search:
n Via integrated PubsOnline Tool from Indiana University n My Research Database
n My Research Database Advanced
n Authentication and Authorization Services
• Login and Logout service
• DE Access rights management
• Database access rights management • Administrative tools
n Other Services
• User Registration
• Username and password recovery
• User’s Profile Management • DE metadata view options
Technical Issues
n Event-based model
• Manipulating data and metadata • How to build event-based model ?
n Major and Minor events
n Datasets (collection of minor events)
• How to apply event-based model ?
• How to apply modifications to a record (Digital Entity) ?
n Keep them in user’s session and let user apply them
n Or apply them automatically to a DE
• How to merge metadata fields of Event and Digital Entity ?
n Identification of metadata fields as dynamic or static
field
n How to apply service-based framework as wrapper?
Some recent Features of SRG
• Hybrid Consistency Framework Implementation
– Data-centric strict consistency model
– Implements primary-copy based consistency protocol
– Pull-based:
• Time-based consistency approach.
• Communicates with Annotation Tools to collect updates
periodically
– Push-based:
• Updates are distributed to Annotation Tools immediately once
they occurred on the primary copy
• Periodic Search Tools Implementation
– Search, compare and apply the updates made to a Digital Entity
(DE) in the system.
• Unique (128 bit) UUID assignment for each Digital Entity
• User Tags view in the system
– Displays all tags belongs to a user
Metadata Collection from CGL web
pages
• The aim is to
– Eliminate duplicate data entry in different web platforms.
– Building richer metadata in SRG using base collected Digital Entities from web pages.
– Share new Digital Entities with other tools and users in SRG
Methodology for Collection
• Collect:
– Digital Entities in Community Grid Publication web pages.
• Analyze:
– Using heuristic methodology to extract metadata fields of the Digital Entities for CGL publications
• Build:
– RSS objects using collected Digital Entities. – New tags using collected Digital Entities.
• Compare:
– Collected Digital Entities from CGL web pages with the existing Digital Entities in SRG.
• If they are:
– different: Store new Digital Entities in SRG storage. – same: Option to update tags and other fields.
• Share:
Security Model
n
Security in Web 2.0
can be limited
n
We implement a simple but
more powerful
security
model around local tools that wrap Web 2.0 systems
n
We used an
access-control matrix
model to provide
security for our information system
• Supports multiple groups and multiple users for each object. • Similar to UNIX file system
n The Unix RWX bits corresponds to Read, Write, and Execute operation for each file and directory.
• In SRG, DE (Digital Entity) correspond to the file element and
folder corresponds to the directory element.
• For each DE and folder, there are three types of access rights
Security Model II
n
We have a security model that supports
• Level of Authorization
n Roles are defined as Super Administrator (SA) and Group
Administrator (GA), User (U)
n The system allows having more than one SA. n An existing SA can add other SAs to the system.
n SA can assign any U to become GA, and remove GA from
group.
n Each group should at least one GA. GA add/remove U
from group
• User profile
Current Usage of
Semantic Research Grid Project
n
We have used/tested Semantic Research Grid (SRG) (a
prototype model) for published scientific research
publications in Community Grids Lab at Indiana
University
n
In CGL 20 students ,post-docs and faculty members
are testing
n
They are using the prototype model for collecting of
Summary
n Integration
• We have successfully integrated Google Scholar and Windows Live
Academic search tools and CiteUlike, Delicious, and Connotea annotation tools which provide a system that allow dynamic publication.
n Flexibility and Extensibility
• We provides flexibility allowing integration of different tools having
common metadata.
• Easy to add and extend service mechanism
n Management and Consistency Scheme of Digital Entities
• Allows the manipulation of a digital entity
• Applies Event-based model based on the concept of:
n Major events n Minor events n Datasets
• Provides a rollback feature to:
n Support for history tool for a DE
n Merge and change the content of a digital entity
• A service-based framework for using existing annotation tools through web
services
Domain Specific Semantic Document
Analysis
n It is natural to develop core document Services such as those
used in Citeseer/Google Scholar but applied to “your”
documents of interest that may not have been processed yet
• As just submitted to a conference perhaps
n These tools can help form useful lists such as authors of all cited
or submitted papers to a journal
n OSCAR3 (from Peter Murray-Rust’s group at Cambridge)
augments the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms
• This tool is a Service that can be applied to “your” document
or to a set of documents harvested in some fashion
• Luis Rocha has developed related ideas for Biology
• Other fields have natural application specific metadata and
OSCAR like tools can be developed for them
OSCAR3 Chemistry
Document analysis
n It detects “magic”
chemical strings in text and then
• Stores them as
metadata associated with document
n Queries
ChemInformatics
repositories to tell you lots of information
about identified compounds
n Tells you which other
Initial Results from OSCAR on PubMed
n We have a small sample (100) of full text Chemistry papers selected at
random from 15 years of PubMed with over 5 million abstracts
• OSCAR3 generates 4.17 compound names per abstract • and 36.7 compound names per full text
• 555,007 PubMed abstracts of 2005 – 2006 (part) used for Abstracts (on
Big Red)
CICC Chemical Informatics Cyberinfrastructure
Collaboratory
PubMed Database OSCAR Text Analysis POV-Ray Parallel Rendering Initial 3D Structure Calculatio n Toxicity Filtering Cluster Grouping Docking Molecular Mechanics Calculatio ns Quantum Mechanics Calculation s IU’s Varuna Database NIH PubChem Database NIH PubChem DatabaseProduct databases are wrapped with Web service interfaces and are suitable for inclusion in Taverna workflows.
PubChem Database
MOAD Database