• No results found

Applications and examples

This section describes and demonstrates some possible uses of the grImport package.

A straightforward use of the grid.symbols() function is to import an external image as a custom plotting symbol for a scatterplot, as was previously demonstrated in Figure 8 (Sec-tion2.3).

A straightforward use of the grid.picture() function is to add a company logo to a plot. This has been done “for real” within a large pharmaceutical research and development company, but unfortunately legal constraints prevent the publishing of that example. Instead, the code below demonstrates the basic idea by adding the GNU logo from Section3.8as a “watermark”

background to a lattice barchart.

R> barchart(~ cit, main = "Number of Citations per Year", xlab = "", + panel = function(...) {

+ GNUlogo()

+ grid.rect(gp = gpar(fill = rgb(1, 1, 1, .9))) + panel.barchart(...)

+ })

This code washes out the logo by simply drawing a semitransparent rectangle over the top.

The resulting plot is shown in Figure21.

4.1. Scraping data from images

A less obvious application of the grImport package, and one that demonstrates working directly with the Picture objects in R, was suggested by Daniel Jackson of the MRC

Bio-Number of Citations per Year

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

0 500 1000 1500 2000

Figure 21: A barplot of the exponential growth in the number of citations of R, with a GNU logo watermark.

statistics Unit in Cambridge (private communication). The context for this application is the practice of meta-analysis, specifically in the area of survival data.

A problem that researchers face in this area is how to obtain data from published articles when the original data are not provided. An article may include summary tables and plots, but raw data values may not be available. In practice, researchers sometimes resort to measuring plots, such as survival curves, with a pencil and ruler in order to retrieve at least some raw data points. If an article of interest is published in an electronic format, it may be possible to use the grImport package to radically improve both the accuracy and efficiency of such a task.

In order to demonstrate this idea, the following example will process a survival curve that was published in the newsletter of the R project for statistical computing (Lumley 2004).

A survival plot appears on page 27 of this R News issue. Figure22shows the original context of the plot.

A number of tools can be used to extract just a single page from a multiple-page PDF document and convert that page to PostScript format. In this case, the resulting file is called page27.ps and this is converted to RGML format by the following code.

R> PostScriptTrace("page27.ps")

We do not need the entire page, but using the picturePaths() function and a little trial and error, it is possible to determine which paths constitute the survival plot in the top-left corner of the page. The following code reads the RGML file into a Picture object and extracts just the paths that make up the crucial part of the plot: paths 3 to 16 draw the axes, path 18 is the green curve, and path 27 is the blue curve.

R> page27 <- readPicture("page27.ps.xml") R> survivalPlot <- page27[c(3:16, 18, 27)]

The following code draws the extracted paths and the result is shown in Figure23.

R> pushViewport(viewport(gp = gpar(lex = .2))) R> grid.picture(survivalPlot)

R> popViewport()

The original image shows that the outer tick marks are at locations 0 and 100 on the y-axis and 0 and 2.5 on the x-axis. These locations can be matched to the locations of the paths that make up those tick marks in the Picture object in order to establish a scale for the paths that make up the green and blue curves. For example, the picturePaths() function can be used to determine that the lowest tick mark on the y-axis is drawn by the ninth path in survivalPlot and zero on the vertical scale is at the y-location of this path.

R> zeroY <- survivalPlot@paths[[9]]@y[1]

R> zeroY 6308.73move

Vol. 4/1, June 2004 27

Figure 1: Survival distributions for two lung cancer treatments

Proportional hazards models

The mainstay of survival analysis in the medical world is the Cox proportional hazards model and its extensions. This expresses the hazard (or rate) of events as an unspecified baseline hazard function multiplied by a function of the predictor variables.

Writing h(t; z)for the hazard at time t with pre-dictor variables Z=z the Cox model specifies

log h(t, z) =log h0(t)eβz.

Somewhat unusually for a semiparametric model, there is very little loss of efficiency by leaving h0(t) unspecified, and computation is, if anything, easier than for parametric models.

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis, a rare liver dis-ease. This disease is now treated by liver transplan-tation, but at the same there was no effective treat-ment. The model is based on data from 312 patients in a randomised trial.

coxph(formula = Surv(time, status) ~ edtrt + log(bili) + log(protime) +

age + platelet, data = pbc, subset = trt > 0)

Likelihood ratio test=185 on 5 df, p=0 n= 312 The survexp function can be used to compare predictions from a proportional hazards model to ac-tual survival. Here the comparison is for 106 patients who did not participate in the randomised trial. They are divided into two groups based on whether they had edema (fluid accumulation in tissues), an impor-tant risk factor.

The ratetable function in the model formula wraps the variables that are used to match the new sample to the old model.

Figure2shows the comparison of predicted sur-vival (purple) and observed sursur-vival (black) in these 106 patients. The fit is quite good, especially as people who do and do not participate in a clin-ical trial are often quite different in many ways.

0 1000 2000 3000 4000

0.00.20.40.60.81.0

Figure 2: Observed and predicted survival The main assumption of the proportional hazards model is that hazards for different groups are in fact proportional, i.e. that β is constant over time. The

R News ISSN 1609-3631

Figure 22: A page from the newsletter of the R project for statistical computing that includes a survival curve at top left.

Figure 23: The survival plot from the R News article, which is drawn by importing the original page and subsetting the relevant paths from that page.

Similarly, the uppermost tick mark is path 14, so a unit step on the vertical scale is 1001 th of the difference between the y-location of this path and that of path 9.

R> unitY <- (survivalPlot@paths[[14]]@y[1] - zeroY)/100 R> unitY

move 9

The survival percentages from the green curve (path 15 of survivalPlot) can now be de-termined using this scale information. Each y-value repeats twice because the green curve is drawn as a step function.

R> greenY <- (survivalPlot@paths[[15]]@y - zeroY)/unitY R> head(round(unname(greenY), 1), n = 20)

[1] 100.0 100.0 98.6 98.6 97.1 97.1 95.7 95.7 92.8 [10] 92.8 89.9 89.9 88.4 88.4 85.5 85.5 84.1 84.1 [19] 82.6 82.6

Happily, these numbers match quite well with the values that were used to produce the original plot.

R> library("survival")

R> sfit <- survfit(Surv(time, status) ~ trt, data = veteran) R> originalGreenY <- sfit$surv[1:sfit$strata[1]]

R> head(round(originalGreenY*100, 1), n = 9) [1] 98.6 97.1 95.7 92.8 89.9 88.4 85.5 84.1 82.6

Figure 24: An example of a sequence logo produced by the WebLogo software.

The x-values for the green curve and the data from the blue curve could be extracted in an analogous fashion.

This idea of importing images just to extract the locations from the boundaries of shapes in the image might also be usefully applied to map data that is only available in pictorial form.

4.2. Importing WebLogo images

The next example demonstrates a more sophisticated use of grid.picture() based on work by Toby Dylan Hocking (http://www.ocf.berkeley.edu/~tdhock/). The images to be im-ported are sequence logos (Schneider and Stephens 1990) as generated by the WebLogo soft-ware (Crooks, Hon, Chandonia, and Brenner 2004,http://weblogo.berkeley.edu/).

The idea of sequence logos is to display patterns in aligned genetic sequences. An example of a sequence logo that was created to visualize the importance of different amino acids in a phage display experiment (Smith and Petrenko 1997) is shown in Figure 24.

The logo displays the relative freqeuncy of amino acids at each binding position in the ex-periment; large letters represent amino acids that are “strong signals” for each position. For example, D and E are strong signals for position 1 and G and S are strong signals for position 3.

A question of interest is whether D at position 1 correlates strongly with G at position 3, or with S at position 3. To answer this, a more detailed plot was produced that combined the overall sequence logo with a dendrogram of the experimental binding results, plus further sequence logos based on important subsets of the overall result. This plot is shown in Figure25 and it suggests that D at position 1 does correspond to G at position 2, while E at position 1 corresponds to S at position 2.

The more detailed plot was generated by combining sequence logos that were produced by WebLogowith a dendrogram that was produced by R. WebLogo can produce sequence logos in PostScript format, so these were imported to R using grImport and subsetted to remove the axes that WebLogo produces. The R graphics system was then used to produce a dendrogram,

250 150 50 0 WTSRLKH PSGYLYK TKGYLYK SRRNLTR ESSKRKR ERSKLSR ERSKLTR DRSNLRK DGSHLKR ESGHLRR DKGHLRR DVGHRSR DQGHRTR 2008−04−02, RL15, tgNNNGCAGCGgt, GGC

Position weights: 9.8, 2.9, 5.7, 7.5, 1.3, 2.3, 6.0

Figure 25: A dendrogram drawn using R’s plot.dendrogram() function, with four imported WebLogo sequence logos drawn alongside: an overall sequence logo on the left and three subsequence logos on the right, next to the relevant leaves of the dendrogram.

with space left on either side for the logos; the logos were positioned relative to the appropriate portions of the dendrogram using grid viewports and the gridBase package (Murrell 2003).

5. Limitations

The grImport package is based on PostScript as the original image format and XML as an intermediate image format. These technologies were chosen for a number of reasons, as outlined in previous sections, but they are not without their disadvantages. This section discusses some of the limitations of the grImport approach to importing vector graphics.

One of the reasons for choosing PostScript as the original image format is that PostScript is a sophisticated graphics language. However, there are some things that PostScript cannot rep-resent, one major example being semitransparent colors. This means that if an original image is in a format that allows semitransparent colors, such as PDF or SVG, the transformation to PostScript may lose important features of the original image.

It is possible to convert a PDF image that contains semitransparency to a PostScript format using ghostscript, but the result is a raster image in the PostScript format, so for the purposes of importing the image using grImport, the image is destroyed.

So one problem with PostScript is that it is lacks some useful graphical features that are

Figure 26: An example of complex clipping: the pattern on the left is clipped to the shape of the text “hello” on the right.

eofill fill

Figure 27: An illustration of the difference between the non-zero winding rule (right) and the even-odd rule (left) for filling the same polygon.

available in other vector formats. A different problem with using PostScript is that it pos-sesses some useful graphical features that R graphics lacks. An example of this problem was discussed in Section 3.8; a PostScript path can be more complex than R is capable of drawing. Previous sections have also mentioned that a PostScript file can contain raster ele-ments, which grImport just ignores, and B´ezier curves, which grImport flattens into a series of straight lines.

Some other PostScript features that are not supported by either R graphics in general or grImportspecifically are:

Clipping to arbitrary paths: PostScript allows clipping to the current path, which can be a very complex shape. As a simple example, the image in Figure 26 shows a pattern of radiating lines on the left; the image on the right in Figure 26shows this pattern being clipped to the outline of a piece of text.

Rgraphics only allows clipping to a rectangle and grImport currently completely ignores any clipping in a PostScript image, so an original image that makes uses of clipping can be imported but will not be reproduced correctly in R.

The path fill rule: When a self-intersecting path is filled with a color, there are two main ways to decide what constitutes the “inside” of the resulting shape. The two algorithms are called the non-zero winding rule and the even-odd rule. The image in Figure 27 demonstrates the difference by showing exactly the same (self-intersecting) path being filled with the two algorithms (the even-odd rule is on the left).

PostScript can perform either type of fill, but R graphics only uses the non-zero winding rule, so an original PostScript image that uses the even-odd rule can be imported, but it will not reproduce correctly in R.

Moving on to the choice of XML as an intermediate format, the main problem is that XML is a verbose language that can result in large files. However, this problem has so far proven to be an inconvenience rather than an insurmountable obstacle.

6. Availability

The grImport package is available from the Comprehensive R Archive Network (RFoundation for Statistical Computing 2008) at http://CRAN.R-project.org/package=grImport. This article describes version 0.4-3 of the package.

7. Conclusion

The grImport package implements a three-stage approach to importing vector-based graphical images into the R software environment for statistical computing and graphics.

It is assumed that the original image can be converted to a PostScript format, then functions in the grImport package are provided to convert the PostScript file to an intermediate RGML format, to read the RGML file into S4 objects in R, and to manipulate and draw those objects.

There are limitations to the system, due to the limitations of the PostScript language and due to the limitations of the R graphics system, although some of these issues can be worked around. Furthermore, despite these limitations, the grImport package has been employed in a variety of ways in several real-world applications.

One of the important ideas to take away is that the grImport package is not just about drawing pictures. The package starts with a vector image and transforms the image into data. One of the things that can be done with data in R is to draw it, but there are many other potential applications for the image data.

Acknowledgments

Richard Walton made contributions to the grImport package during a Faculty of Science summer scholarship at the University of Auckland (2005/2006).

Thanks to the authors of the freely-available images that are used in this article. The chess pawn is from a public domain chess board image in SVG format by Jos´e Hevia, which was originally sourced from the Open Clip Art Libraryhttp://openclipart.org/clipart/

/recreation/games/chess/chess_game_01.svg, but is now available from http://www.

public-domain-photos.com/free-cliparts/other/chess/chess_game_01-3657.htm. The GNU logo is by Aur´elio A. Heckert and is available fromhttp://commons.wikimedia.org/

wiki/Image:Heckert_GNU_white.svg under a Free Art Licence. The tiger image is dis-tributed as part of the ghostscript software system.

Thanks also for ideas and data to Toby Dylan Hocking and Daniel Jackson. The R citation data used in Figure21were taken from an email to the R-help mailing list by John Maindonald.

Comments and suggestions from the anonymous reviewers were extremely useful in focusing and improving the final manuscript.

References

Adobe Systems (1999). Postscript Language Reference. 3rd edition. Addison-Wesley. ISBN 9780201379228.

Adobe Systems (2005). PDF Reference Version 1.6. 5th edition. Adobe Press. ISBN 9780321304742.

Bah T (2007). Inkscape: Guide to a Vector Drawing Program. Prentice Hall Press, Upper Saddle River, NJ, USA. ISBN 9780131357945.

Bivand R, Leisch F, Maechler M (2008). pixmap: Bitmap Images (“Pixel Maps”). R package version 0.4-9, URLhttp://CRAN.R-project.org/package=pixmap.

Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2006). Extensible Markup Language (XML) 1.0. World Wide Web Consortium (W3C). URL http://www.w3.org/

TR/2006/REC-xml-20060816/.

Clark J (1999). XSL Transformations (XSLT). World Wide Web Consortium (W3C). URL http://www.w3.org/TR/xslt/.

Crooks GE, Hon G, Chandonia JM, Brenner SE (2004). “WebLogo: A Sequence Logo Gen-erator.” Genome Research, 14, 1188–1190.

Ferraiolo J, Jun F, Jackson D (2003). Scalable Vector Graphics (SVG) 1.1 Specification. World Wide Web Consortium (W3C). URL http://www.w3.org/TR/SVG11/.

Kylander OS, Kylander K (1999). GIMP: The Official Handbook. Coriolis Value. ISBN 1576105202.

Lumley T (2004). “The survival Package.” R News, 4(1), 26–28. URL http://CRAN.

R-project.org/doc/Rnews/.

Merz T (1997). Ghostscript User Manual. URLftp://mirror.cs.wisc.edu/pub/mirrors/

ghost/gs5man_e.pdf.

Murrell P (2003). “Integrating grid Graphics Output with Base Graphics Output.” R News, 3(2), 7–12. URL http://CRAN.R-project.org/doc/Rnews/.

Murrell P (2005). R Graphics. Chapman & Hall/CRC, Boca Raton, FL. ISBN 1-584-88486-X, URLhttp://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html.

Nikon Systems Inc (2005). rimage: Image Processing Module for R. R package version 0.5-7, URLhttp://CRAN.R-project.org/package=rimage.

RDevelopment Core Team (2008). R: A Language and Environment for Statistical Computing.

RFoundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URLhttp:

//www.R-project.org/.

R Foundation for Statistical Computing (2008). Comprehensive R Archive Network. URL http://CRAN.R-project.org/.

Sarkar D (2008). lattice: Multivariate Data Visualization with R. Springer-Verlag, New York.

ISBN 9780387759685.

Schneider TD, Stephens RM (1990). “Sequence Logos: A New Way to Display Consensus Sequences.” Nucleic Acids Res, 18, 6097–6100.

Sklyar O, Huber W (2006). “Image Analysis for Microscopy Screens.” R News, 6(5), 12–16.

URLhttp://CRAN.R-project.org/doc/Rnews/.

Smith GP, Petrenko VA (1997). “Phage Display.” Chemical Reviews, 97(2), 391–410.

Still M (2005). The Definitive Guide to ImageMagick. Apress. ISBN 9781590595909.

Temple Lang D (2008). XML: Tools for Parsing and Generating XML Within R and S-PLUS.

Rpackage version 1.95-3, URLhttp://CRAN.R-project.org/package=XML.

Veillard D (2009). “The XSLT C Library for GNOME.”http://xmlsoft.org/XSLT/.

Affiliation:

Paul Murrell

Department of Statistics The University of Auckland Private Bag 92019

Auckland, New Zealand

Telephone: +64/9/3737599-85392 E-mail: [email protected]

URL: http://www.stat.auckland.ac.nz/~paul/

Journal of Statistical Software

http://www.jstatsoft.org/

published by the American Statistical Association http://www.amstat.org/

Volume 30, Issue 4 Submitted: 2008-11-16

April 2009 Accepted: 2009-02-27

Related documents