CS 846 Software Engineering for Big Data Regular Paper Presentation

(1)

Viziometrics:

Analyzing Visual Information

in the Scientiﬁc Literature

Authors: Po-Shen Lee, Jevin D. West, and Bill

Howe

Source: IEEE Transactions on Big Data (Volume:

4, Issue: 1, March 1st 2018)

Presented by: Cheryl Lao Date: 03/11/20

CS 846 Software Engineering for Big Data

Regular Paper Presentation ₁

Hello, I’m Cheryl Lao and today I’ll be presenting a paper called “Viziometrics: Analyzing Visual Information in the Scientific Literature”.

(2)

Overview

01

Key Findings Motivation

Here you could describe the topic of the section

Viziometrics

A summary of the Viziometrics paper

The key insights from the paper

Viziometrics and CS 846

How does this paper relate to the class?

Discussion

Discussion points and questions

References

References for papers mentioned

02

03

04

05

06

2

Here is an overview of what we will be talking about today.

Before I dive into the content of the paper, I’m going to start by introducing the motivation behind the research.

The bulk of this presentation will focus on the content of the Viziometrics paper. The Key Findings section will summarize the important observations, ideas, and results in the paper.

The Viziometrics and CS 846 section will relate this paper to the content of this course.

The Discussion section will bring up strengths, weaknesses, related works and future work for the paper and offer questions for further discussion.

The References section will have references to all the cited works.

(3)

● Figures are information-dense objects that are critical to scientiﬁc communication ● Existing bibliometric and scientometric studies focus on measuring citation networks and

scientiﬁc literature, respectively

● However, little research has focused on the relationship between ﬁgure use and citation information

Motivation

3

Visual information is a key part of scientific literature, and yet there is little research on how these visual encodings differ across disciplines or how they relate to scientific impact of the papers in which they appear. Bibliometrics and scientometrics focus on measuring citation networks and scientific literature, respectively, but visuals are difficult to study because they aren’t directly machine-readable. Humans are known to retain information better when it is presented visually, so it makes sense that visuals in scientific papers play a large role in aiding reader understanding.

(4)

Research Goals

1. Build new tools and services based on visual information to help researchers ﬁnd information more effectively

2. Create a set of best practices for scientiﬁc communication across disciplines

a. Do the patterns of encoding visual information vary across disciplines and over time?

b. Are visual information encoding patterns related to scientiﬁc impact?

4

The researchers had two main goals:

● to build new tools and services that help researchers find visual results more efficiently

● to create a set of best practices for visual communication in scientific literature across disciplines

This paper also aims to answer the following questions:

● How do patterns of encoding visual information in the literature vary across disciplines?

● How have patterns of encoding visual information in the literature evolved over time?

● Is there any link between patterns of encoding visual information and scientific impact?

(5)

● http://viziometrics.org/

● Analysis pipeline and result presentation

● 4.8 million ﬁgures from more than 650,000 PubMed Central (PMC) papers ○ Equations ○ Diagrams ○ Photos ○ Tables ○ Plots

● Over 10 million unique ﬁgures classiﬁed in total

Dataset & Methodology

5

(6)

Figure Analysis

6

This is a diagram explaining the figure analysis pipeline for classification. The process starts with downloading images from Amazon’s AWS S3 service. Metadata related to the images is stored in the database as well.

Each of the images is classified as a singleton or multi-chart figure. Multi-chart figures are identified and dismantled. The singletons and dismantled multi-chart figure

components are then classified into one of the 5 categories.

The classified images are stored in a database and accessible from viziometrics.org

(7)

● Create “ﬁngerprints” to classify images ● 1) Extract patches and normalize them

● 2) Cluster with k means (k=200) to ﬁnd codebook patches ● 3) Compute histograms

● 4) Classify with SVM

● 91.5% classiﬁcation accuracy!

Figure Classiﬁcation

7

The authors use a technique where “fingerprints” are used to classify images. Patches from images are extracted, clustered into groups and then re-encoded as a histogram “fingerprint”.

In this paper, images are normalized to 128px X 128px greyscale images, from which patches can be extracted.

The patches are then clustered using k-means (k=200). K-means is a vector

quantization technique that groups some number of samples into k groups that each have a representative sample that is closest to the group mean. The representative image patch is called the codebook patch.

Patches from each training image are matched to the codebook patches and the codebook counters are summarized into histograms

(8)

Figure Dismantling

● Multi-chart figures complicates classification ● ~30% of figures needed dismantling

● The algorithm splits the figure recursively, classifies the fragments and then repeats the classifications on a recursive merge

● 82.9% recall, 84.3% precision

8

About 30% of the figures extracted from the corpus were multi-charts that required dismantling. The multicharts are split recursively into vertical and horizontal sections based on background colour and layout patterns until it can’t find any more fragments. An SVM-based binary classifier tells us if an image is a complete chart or not. The fragments are then merged recursively and the same classifier re-classifies the merged pieces. Discrepancies between the merging and the splitting steps are resolved with a heuristic score function that chooses between the two different subfigure fragment options based on visual homogeneity. This process resulted in a 82.9% recall rate and 84.3% precision. The figure above shows how the dismantling would occur (left to right).

(9)

● Pre-classiﬁer to prevent wasting time on dismantling every single ﬁgure ● Two factors can be used to identify multicharts:

○ Size and shape of image

○ The layout as described by the splitting algorithm ● 91.8% accuracy achieved

Multi-Chart Figure

Classiﬁcation

9

(10)

Measuring

Scholarly Inﬂuence

● Scholarly inﬂuence can be measured using citation networks

● Article-level Eigenfactor (ALEF) score is a ranking method based off the PageRank algorithm

○ Number of steps that the random walker uses is reduced compared to PageRank ○ Random walker teleports to links rather than nodes like in PageRank

10

Scholarly influence can be measured using the citations of a paper. The Article-level Eigenfactor (ALEF) score is a ranking method that can be used to build a citation graph where each paper is a node and each citation is a directed edge. The authors modify the PageRank algorithm (the webpage ranking algorithm by Larry Page) to reduce the number of steps that the random walker takes and to teleport the random walker to links instead of nodes. The ALEF ranking method has been proven to outperform PageRank.

(11)

● The classiﬁed charts were used to study patterns in the use of visual content over: ○ Scientiﬁc domain

○ Publication venue

○ Time

○ Scholarly impact

● How are complex results communicated between disciplines and to the general public?

Visual Patterns in Literature

11

(12)

Dataset Details &

Preprocessing

● Avoid bias by removing papers that meet any of these criteria: ○ No ALEF score

○ Page count could not be determined ○ Published before 1997

○ No ﬁgures or inaccessible ﬁgures ● Other sources of bias distributed uniformly ● Note: PMC submission is voluntary

● This particular dataset is also skewed towards human biology

12

Almost all datasets contain bias and it it up to researchers to preprocess it to avoid bias. The authors preprocess the data by removing four kinds of papers from the dataset:

1) Remove papers without an ALEF score: The analysis depends on an ALEF

score so papers with zero citations of other factors preventing them from having an ALEF score need to be removed

2) Remove papers where page count could not be determined: The figure density

calculation depends on page count, but a small percentage of the papers (<10%) did not have a provided PDF file or PMC page count

3) Remove papers published before 1997: For this particular dataset of papers from

PMC, the number of papers per year before 1997 was less than 300 and skewed towards a small number of journals

4) Remove papers with no figures or inaccessible figures: Papers with no figures

were either published without figures or their figures were not uploaded to PMC. Older papers with more citations were more likely to not provide their figures to PMC, and their high ALEF scores would skew the data if included

These four categories cover most of the sources of biases and the other biases are distributed uniformly, so removal will not have a big impact on the results. It should be noted that submission to PMC is voluntary, so the dataset is biased towards authors

(13)

(14)

● Journal and research topic variations

● ArticleInﬂuence (AI) scores were used to rank journals

● Journals with high ﬁgure-per-page count tended to have higher AI as well ○ Exception: journals with prose-oriented case studies

● Papers from the top ⅓ journals have more ﬁgures

● Certain ﬁelds have a certain “syntax” which may account for differences

Patterns Across Disciplines

13

To identify patterns, papers were aggregated by journal and research topics to see how figure types vary across venues and disciplines. The research topics were the ones defined in Thompson Reuters’ JCR. ArticleInfluence (AI) scores were used to rank journals by their influence. With the exception of prose-oriented journals, journals with high figure-per-page count tended to have higher AI as well. In fact, the top one-third (16 out of 50) journals had more figures than their less-influential

counterparts. This difference between journals may be partially explained by “syntax” and culture differences between journals, which lead to different expectations for how a paper should be formatted.

(15)

Patterns Over Time

● Different journals had different patterns

● The PMC dataset is that it did not have a substantial number of papers until 1997 ● Figure usage stayed relatively stable after 2006

14

As mentioned earlier in the presentation, a limitation of the PMC dataset is that it did not have a substantial number of papers until 1997. While the earliest paper goes all the way back to 1937, the number and diversity of papers across the years has fluctuated. In 1997-2002, there was a sampling bias because the dataset was dominated by three journals that made up 77% of the papers in total. As the journals diversified, this effect was reduced. After 2006, the number of figures per page

(16)

● Higher impact papers tend to have:

○ higher density and higher proportion of plots and diagrams ○ lower proportion of photos

● Diagrams tend to have a stronger relationship with impact than plots

● Explaining a new idea visually (diagrams) holds more weight than displaying experimental results (plots)

● Papers with no ﬁgures tended to be very low-impact

Patterns & Research Impact

15

Through an analysis of the relationship between impact and figure usage, it was found that higher impact papers tend to have a higher density and proportion of plots and diagrams as well as a lower proportion of photos (see the figure above).

Explaining new ideas visually with diagrams holds more weight than displaying experimental results in plots, which may explain why diagrams tend to have a stronger association with impact than plots. Although they were omitted in earlier statistics, papers with an ALEF score of 0 (no diagrams) were included to see if it changes the results. The average score of the data decreased by 5%, but it was found that papers with no figures have very low impact in general.

(17)

A Browser for the

Visual Literature

● Figure-centric search application (http://viziometrics.org/)

● Results are ordered by their ALEF scores and the recommender system is based on a hierarchical clustering of an article-level citation network

16

(18)

● A ﬁgure-based method search task was used to evaluate Viziometrics

○ 1) Does Viziometrics return relevant figures for figure-oriented search tasks? ○ 2) Which fields should be indexed to maximize accuracy?

○ 3) Does ﬁltering the results for an expected ﬁgure type improve accuracy?

Figure Search Evaluation

17

A figure-based method search task was used to evaluate Viziometrics. The idea is to use a keyword to search for a particular method which is associated with a figure. The researchers wanted to verify that the relevant figures were being returned on each search. In a test with 7 key phrases (phylogenetic, metabolic pathway, electrophoresis gel, confocal microscopy, fluorescence, survival curve, and ROC curve) and expert labelling as ground truth, the researchers found that 50-100% of the results were relevant for each of the search terms under the best conditions.

Another goal of the study was to identify which fields should be indexed to maximize accuracy. They found that caption-only indexing provided the highest accuracy because keywords in the title or abstract were being associated with all figures in a paper, regardless of whether it was relevant for the individual figures.

The last goal of the study was to see if filtering for an expected figure type improved accuracy. They found that filtering by figure type yielded 2-10 additional relevant figures out of the top 30 results. However, there were some cases where accuracy was reduced with filtering.

The figure above shows the number of relevant returned items for each of the search terms using different filtering and indexing configurations.

(19)

Key Findings

● Figure usage varies across disciplines

● Regardless of discipline, more inﬂuential papers tend to have more plots and diagrams ● The most impactful papers have higher ﬁgure density and more diagrams in comparison

with other ﬁgure types

● Viziometrics.org is an effective tool for ﬁgure-based search

Viziometrics.org screenshot 18

This paper presented series of tools that facilitate research on Viziometrics, the role of visual information encodings in literature. A figure processing pipeline was built to classify figures into equations, diagrams, plots, photos and tables. Using the results from this processing pipeline, the researchers were able to analyze past PMC papers to identify trends across journals, disciplines, and over time as they relate to impact. It was found that figure usage varies across disciplines, but most influential papers tends to have more diagrams per page and a higher proportion of diagrams compared to other figure types.

The proposed system, viziometrics.org, provided users with an effective way to search for figures directly in a corpus of papers. Given the effectiveness of

(20)

● These are ﬁndings that you can use to guide your ﬁnal project papers! ● Go to http://viziometrics.org/ to try the tool

● Small pipeline optimizations can result in big resource savings when scaled to a big data context

● Preprocessing data is also important

● Past presenter have also talked about using machine learning on massive datasets

Viziometrics & CS 846

19

The findings presented in this paper can guide students to think critically about the visual information that they include as part of their future research papers.

CS846 is all about Big Data and this paper works with a massive dataset of 4.8 million figures. The preprocessing done in Viziometrics to clean and standardize the input is a technique that has been employed in almost all of the papers we have discussed so far in class. The pipeline optimizations to avoid unnecessary computation with the multi-chart classifier have also been echoed in many of the other regular papers. Machine learning and big data go hand-in-hand, and this paper presented a very compelling tool that leverages big data and machine learning to extract trends from a massive dataset. I will also mention some papers from the course that relate to this paper in the Related Papers section

(21)

Discussion-Strengths and

Weaknesses

Strengths

● Extensible for potential use on other datasets

● Robust tool for classiﬁcation of ﬁgures

● Impactful insights

Weaknesses

● Colour information is lost in ﬁgure classiﬁcation

● Statistically insigniﬁcant results (50-100% of ﬁgure search results were relevant)

● Lack of diversity in the dataset used

20

Strengths

● The Viziometrics system is extensible enough to be used with other figure datasets, which is great for extension to new disciplines.

● The classification tools seems to be robust against a variety of figure types that are common in scientific literature, including multi-charts.

● The insights on figure usage from this paper can have a big impact on how researchers present their findings in papers. Visual encodings are a key part of almost all scientific literature and knowing how to leverage it for better communication benefits authors as well as readers.

Weaknesses

● Colour information is lost in the figure classification step of the pipeline. While colour does pose a large barrier to classification due to its endless variations and lack of standardization, there may be important information related to colour-coding that is lost in the classification.

● The results of the study show that 50-100% of the results were relevant for each search term under the best conditions. The results are not statistically significant because of the small number of trials coupled with the massive variance in the percentage of relevant results.

(22)

Discussion - Related papers

● Papers from class:

○ An acquisition, curation and management workﬂow for sustainable, terabyte-scale marine image analysis [1]

○ Comparative Evaluation of Big-Data Systems on Scientiﬁc Image Analytics Workloads [2]

○ SPARK-MSNA: Efﬁcient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning [3]

● External papers:

○ Deep mapping of the visual literature [5]

○ Figureseer: Parsing result-ﬁgures in research papers [6]

○ PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms [7]

21

From CS 846:

[1] talks about a workflow for managing a massive flow of image data from a marine sciences source.

[2] is a comparative evaluation of SciDB, Myria, Spark, Dask, and TensorFlow for handling large volumes of image data.

[3] proposes SPARK-MSNA: an algorithm for aligning multiple similar RNA/DNA sequences using supervised learning. The application of machine learning to image data is similar to Viziometrics.

External papers:

[5] identifies changes in image data in literature over time using a CNN and presents it as a heatmap.

[6] presents an end-to-end framework for parsing result-figures.

[7] proposes a tool for extracting phylogenies from dendrograms and serves as a good example of specialized figure manipulation.

(23)

Discussion - Future Work

1. Case studies with a larger and more diverse data set

2. Paper “syntax” differences across disciplines or publishing venues 3. Different types of ﬁgures

4. Crowdsourcing for important ﬁgure features

22

1. Case studies with a larger and more diverse data set

a. This paper used the papers from PMC as a dataset, but this tool could be extended to use different datasets as input. Future work could focus on getting a larger database of papers from a website like the ACM digital library (if copyright laws allow it)

2. Paper “syntax” differences across disciplines or publishing venues a. The authors mentioned that an in-depth study on the differences

between different paper formatting expectations between disciplines was beyond the scope of the paper.

3. Different types of figures

a. The Viziometrics system could be extended to classify more specific types of figures like flow charts, process diagrams, or specialized diagrams like UML diagrams.

4. Crowdsourcing for important figure features

a. Since publishing this paper, the authors have added some

(24)

Discussion

1. Do you think that researchers will use viziometrics.org as part of their set of literature review tools?

2. What is the current bottleneck of the Viziometrics system and how can it be overcome? 3. Do you predict that diversifying the input dataset to different disciplines will reduce the

accuracy of the tool? How can that effect be minimized?

4. What are some examples of use cases where a researcher may want to look up ﬁgures directly instead of the papers associated with them?

23

To wrap up the presentation, here are some questions for you to consider:

1. Do you think that researchers will use viziometrics.org as part of their set of literature review tools?

2. What is the current bottleneck of the Viziometrics system and how can it be overcome?

3. Do you predict that diversifying the input dataset to different disciplines will reduce the accuracy of the tool? How can that effect be minimized?

4. What are some examples of use cases where a researcher may want to look up figures directly instead of the papers associated with them?

These are my answers for the questions, but I’m looking forward to seeing what your answers are!

1. Yes, but only when it has been extended to cover more disciplines. I would use it if I was a life sciences researcher, but unfortunately it does not provide results relevant to my area of research for now

2. I think the bottleneck is in the multi-chart dismantler. Although it is only run on ~30% of the images, the recursive splitting, recursive merging, and heuristic comparison steps are very time-intensive compared to the other parts of the algorithm. If there is any way to improve the splitting heuristic to be more accurate, then we can avoid the need for the merging and comparison steps. 3. I think that diversifying the training input does carry the risk of generalizing the

model too much. I think that a solution could be to have the user specify which domain they would like to search and load up a different model depending on the general domain

(25)

(26)

References

[1] T. Schoening, K. Köser, and J. Greinert, “An acquisition, curation and management workﬂow for sustainable, terabyte-scale marine image analysis,” Sci. Data, vol. 5, 2018, doi: 10.1038/sdata.2018.181.

[2] P. Mehta et al., “Comparative evaluation of big-data systems on scientiﬁc image analytics workloads,” in Proceedings of the VLDB Endowment, 2017, vol. 10, no. 11, doi: 10.14778/3137628.3137634.

[3] V. Vineetha, C. L. Biji, and A. S. Nair, “SPARK-MSNA: Efﬁcient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning,” Sci. Rep., vol. 9, no. 1, 2019, doi: 10.1038/s41598-019-42966-5.

[4] P.-S. Lee, J. D. West, and B. Howe, “Viziometrics: Analyzing Visual Information in the Scientiﬁc Literature,” IEEE Trans. Big Data, vol. 4, no. 1, 2017, doi: 10.1109/tbdata.2017.2689038.

[5] B. Howe, P. S. Lee, M. Grechkin, S. T. Yang, and J. D. West, “Deep mapping of the visual literature,” 2019, doi: 10.1145/3041021.3053065.

[6] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, and A. Farhadi, “Figureseer: Parsing result-ﬁgures in research papers,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9911 LNCS, doi: 10.1007/978-3-319-46478-7_41.

[7] P. S. Lee, S. T. Yang, J. D. West, and B. Howe, “PhyloParser: A Hybrid Algorithm for Extracting Phylogenies from Dendrograms,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2017, vol. 1, doi: 10.1109/ICDAR.2017.180.

24

(27)

Thank you!

If you have any questions, you can contact me at [email protected]!

Slide Template from SlidesGo.com 25