FishGraph: A Network-Driven
Data Analysis
Patrícia Cavoto*, Victor Cardoso*, Régine Vignes Lebbe§, André Santanchè*
*UNICAMP – University of Campinas, São Paulo, Brasil §ISYEB - UMR 7205 – CNRS, MNHN, UPMC, EPHE UPMC Univ. Paris 06, Sorbonne Universités, Paris, France
Outline
• Motivation
• Goal
• ReGraph: from FishBase to FishGraph
• Data Experiments
Motivation
Collaborative research involving:
LIS - Laboratory of Information Systems – UNICAMP, Brazil
MNHN - National Museum of Natural History and Sorbonne Univertès – Paris, France
FishBase Consortium
Motivation
FishBase: a relational database and information system for
biological data storage of fish species, with millions of records
containing:
Species, taxonomic classification and predators
Locations (country and ecosystem)
Motivation
Identification Key:
A biology mechanism to identify a specific specimen
Composed by a set of questions that guides scientists in this identification
Has one or more species associated
Similar to a decision tree
6 - Freshwater fishes of Africa
Five pairs of external gill slits
Single, or single pair of gill
openings
Head without extended rostrum,
gill slits lateral
Head with extended rostrum,
gill slits ventral
Body without scales, or scales
small and not clearly visible.
Body with clearly visible scales.
Body slender, elongate and
eel-like Body not eel-like … …
Identification Key
Example
…Identification Key Problem
6 - Freshwater fishes of Africa 1419 - Species of Schilbe of Africa adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 7?
Identification Key Problem
6 - Freshwater fishes of Africa … … Adipose fin present Adipose fin absentIdentification Key Problem
6 - Freshwater fishes of Africa 1419 - Species of Schilbe of Africa … … Adipose fin present Adipose fin absent Adipose fin present Adipose fin absent … … adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 11Motivation
Biological data (as in FishBase) form a big network
Biologists need network analysis for:
Identify the most important species in an specific food chain;
Define areas (or species) for preservation;
Find relations in a network of identification keys.
Motivation
How to support biologists in
network-driven analysis?
Goal
Build a
network
database for
analysis from a
relational
ReGraph: from FishBase to FishGraph
Graph databases: Very effective in network analysis
Flexible structure
Easy to run transitive relationships
ReGraph: from FishBase to FishGraph
ReGraph: a framework that generates a graph database from a relational database.
ReGraph: from FishBase to FishGraph
ReGraph: a framework that generates a graph database from a relational database.
FishGraph: A Network-Driven Data Analysis 15
ReGraph: from FishBase to FishGraph
ReGraph: maintain the graph database synchronized with the relational database (one-way synchronization).
ReGraph: from FishBase to FishGraph
ReGraph: maintain the graph database synchronized with the relational database (one-way synchronization).
ReGraph: from FishBase to FishGraph
ReGraph: Relational and Graph Databases keep their native form.
Current systemsCurrent
ReGraph: from FishBase to FishGraph
ReGraph: allows adding new data in the graph database.
FishGraph: A Network-Driven Data Analysis 19
ReGraph: from FishBase to FishGraph
ReGraph: mapped and annotated subgraphs are integrated and avaiable for running analysis.
ReGraph: from FishBase to FishGraph
ReGraph: connects data in the local graph with global graphs on the web.
FishGraph: A Network-Driven Data Analysis 21
Semantic Web
ReGraph: from FishBase to FishGraph
STEPS:
1. Map data from relational database to graph
2. Run the ETL process to load initial data
3. Synchronism process starts to run after the first loading
4. Add new information as annotation (optional)
ReGraph: from FishBase to FishGraph
ReGraph: used to generate FishGraph (graph database) from FishBase (relational database)
FishGraph: A Network-Driven Data Analysis 23
ReGraph: from FishBase to FishGraph
GENERA
FAMILIES
ORDERS CLASSES
ReGraph: from FishBase to FishGraph
FishGraph: A Network-Driven Data Analysis 25
SPECIES ECOSYSTEM COUNTRY KEY GENUS FAMILY ORDER CLASS belongs_to
Experiments: Identification Key Analysis
Data used:
Identification keys
Species
Geographic locations (countries and ecossystems)
Experiments: Identification Key Analysis
Long term goal:
Start the identification process from any node across several identification keys.
Goals in this analysis:
Find similarities between keys
Find differences between keys
Analyze groups of keys
The annotated “share” edge connects keys that share at least one species.
Experiments: Identification Keys Analysis
SPECIES COUNTRY GENUS FAMILY CLASS belongs_to
Experiments: Identification Keys Analysis
FishGraph: A Network-Driven Data Analysis 29
Components based on the “share” edge connecting two or more distinct keys with their associated species.
A component is a subgraph in which there is a path from any node to another one.
Experiments: Identification Keys Analysis
205 316
Identification key
Species colored by family (same order and class)
Experiments: Identification Keys Analysis
FishGraph: A Network-Driven Data Analysis 31
205 - Key to the species of scorpionfishesoccurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish(Genus Scorpaenopsis).
205 316
Identification key
Species colored by family (same order and class)
Experiments: Identification Keys Analysis
205 316
Identification key
Species colored by family (same order and class)
Experiments: Identification Keys Analysis
FishGraph: A Network-Driven Data Analysis 33
205 - Key to the species of scorpionfishesoccurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish(Genus Scorpaenopsis).
205 316
Identification key
Species colored by family (same order and class)
Experiments: Taxonomic Classification Analysis
Data used: Class Order Family Genus SpeciesExperiments: Taxonomic Classification Analysis
Goals:
Compare data in FishGraph with data in a global graph (DBpedia)
Find divergences
Propose reviews
Experiments: Taxonomic Classification
Experiments: Taxonomic Classification Analysis
Conclusions
Graph databases to perform network analyses
One-way synchronization
Annotations and connection with other sources on the Web
Network-driven data analysis for knowledge discovery:
Identification keys
Conclusions
Future Work:
Register provenance from data obtained from web sources
Organize “same as” nodes in the local graph
Enable distinct graph mappings from one relational model
FishGraph: A Network-Driven
Data Analysis
Thank you!
Acknowledgments:Unicamp, LIS members, FishBase Consortium, FAPESP, CNPq, CAPES