• No results found

Hybrid Traversal: Efficient Source Code Analysis at Scale

N/A
N/A
Protected

Academic year: 2021

Share "Hybrid Traversal: Efficient Source Code Analysis at Scale"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Computer Science Conference Presentations,

Posters and Proceedings

Computer Science

2018

Hybrid Traversal: Efficient Source Code Analysis at

Scale

Ramanathan Ramu

Iowa State University

Ganesha Upadhyaya

Iowa State University

Hoan A. Nguyen

Iowa State University, hoan@iastate.edu

Hridesh Rajan

Iowa State University, hridesh@iastate.edu

Follow this and additional works at:

https://lib.dr.iastate.edu/cs_conf

Part of the

Software Engineering Commons

This Conference Proceeding is brought to you for free and open access by the Computer Science at Iowa State University Digital Repository. It has been accepted for inclusion in Computer Science Conference Presentations, Posters and Proceedings by an authorized administrator of Iowa State University Digital Repository. For more information, please contactdigirep@iastate.edu.

Recommended Citation

Ramu, Ramanathan; Upadhyaya, Ganesha; Nguyen, Hoan A.; and Rajan, Hridesh, "Hybrid Traversal: Efficient Source Code Analysis at Scale" (2018). Computer Science Conference Presentations, Posters and Proceedings. 45.

(2)

Hybrid Traversal: Efficient Source Code Analysis at Scale

Abstract

Source code analysis at a large scale is useful for solving many software engineering problems, however, could

be very expensive, thus, making its use difficult. This work proposes hybrid traversal, a technique for

performing source code analysis over control flow graphs more efficiently. Analysis over a control flow graph

requires traversing the graph and it can be done using several traversal strategies. Our observation is that no

single traversal strategy is suitable for different analyses and different graphs.

Our key insight is that using the characteristics of the analysis and the properties of the graph it is possible to

select a most efficient traversal strategy for a pair. Our evaluation using a set of analyses with different

characteristics and a large dataset of graphs with different properties shows up to 30% reduction in the

analysis time. Further, the overhead of our technique for selecting the most efficient traversal strategy is very

low; between 0.01%-0.2%.

Disciplines

Computer Sciences | Software Engineering

Comments

This is a manuscript of the proceeding Ramu, Ramanathan, Ganesha Upadhyaya, Hoan A. Nguyen, and

Hridesh Rajan. "Hybrid Traversal: Efficient Source Code Analysis at Scale." ICSE ’18 Companion. 40th

International Conference on Software Engineering, Gothenburg, Sweden, May 27-June 3, 2018. DOI:

10.1145/3183440.3195033

. Posted with permission.

(3)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116

Hybrid Traversal: Efficient Source Code Analysis at Scale

Ramanathan Ramu, Ganesha Upadhyaya, Hoan A Nguyen, Hridesh Rajan

Iowa State University

rramu,ganeshau,hoan,hridesh@iastate.edu

ABSTRACT

Source code analysis at a large scale is useful for solving many software engineering problems, however, could be very expensive, thus, making its use difficult. This work proposes hybrid traversal, a technique for performing source code analysis over control flow graphs more efficiently. Analysis over a control flow graph requires traversing the graph and it can be done using several traversal strategies. Our observation is that no single traversal strategy is suitable for different analyses and different graphs.

Our key insight is that using the characteristics of the analysis and the properties of the graph it is possible to select a most ef-ficient traversal strategy for a <analysis, input graph> pair. Our evaluation using a set of analyses with different characteristics and a large dataset of graphs with different properties shows up to 30% reduction in the analysis time. Further, the overhead of our technique for selecting the most efficient traversal strategy is very low; between 0.01%-0.2%.

1

INTRODUCTION

Performance of source code analysis over graphs heavily depends on the order of the nodes visited during the traversals: the traversal strategy. While graph traversal is a well-studied problem, and vari-ous traversal strategies exists, e.g., depth-first, post-order, reverse post-order, etc, no single strategy works best for different kinds of analyses and different kinds of graphs. Our contribution is hybrid traversal, a novel source code analysis optimization technique for source code analyses expressed as graph traversals.

The current approaches either select a single traversal strategy for all graphs or let user pick one. Our observation is that, for an-alyzing millions of graphs, no single traversal strategy performs best across different kinds of analyses and graphs. Certain charac-teristics of the analysis, such as data flow sensitivity, sensitivity to cycles in the graph, and direction of the graph traversal (top-down or bottom-up), influence the selection of the traversal strategy for an analysis. Similarly, graph properties such as branching factor and cyclicity plays an important role. Our key intuition is that, by combining the characteristics of the analysis and properties of the graph, an optimal traversal strategy can be selected for each <analysis, graph> pair.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

ICSE ’18 Companion, May 27-June 3, 2018, Gothenburg, Sweden © 2018 Copyright held by the owner/author(s).

ACM ISBN 978-1-4503-5663-3/18/05. https://doi.org/10.1145/3183440.3195033

2

APPROACH: HYBRID TRAVERSAL

A source code analysis over control flow graphs (CFGs) performs one or more traversal of the CFG. A traversal visits nodes in the CFG to perform certain computation and generates analysis results for nodes. The analysis model that we assume is a standard control and data flow analysis.

Traversal of Analysis Compute static properties Compute dynamic properties Sensitivity to - data flow - loop Traversal direction Graph - sequential - branch - loop Select and Optimize the best traversal strategy Traversal strategy Graph Standard strategies Run once for each analysis

Run on every graph

Figure 1: Overview of the hybrid approach.

Figure 1 provides an overview of our approach and its key com-ponents. Given a source code analysis and a dataset that contains a collection of CFGs, our technique automatically selects a traver-sal strategy from a set of candidate strategies for each CFG in the dataset. For selecting the traversal strategy, our technique first computes a set of properties of the analysis by statically analyzing the code of the analysis, such as data flow sensitivity and loop sensitivity. The analysis properties are then combined with graph properties, such as branching factor and cyclicity to select a traver-sal strategy for each CFG. The static properties of the travertraver-sals are computed only once for each analysis at compile-time, whereas graph properties are determined for every input graph in the dataset when executing the analysis.

(4)

117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174

ICSE ’18 Companion, May 27-June 3, 2018, Gothenburg, Sweden R. Ramu et al.

175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232

Table 1: List of source code analyses and properties of their involved traversals.Ts: total number of traversals.ti: properties of the

i -th traversal. Flw: data-flow sensitive. Lp: loop sensitive. Dir: traversal direc-tion where —, → and ← mean iterative, forward and backward, respectively. ✓and ✗ for Flw and Lp indicates whether the property is true or false.

Analysis Ts t1 t2 t3

Flw Lp Dir Flw Lp Dir Flw Lp Dir

1 Copy propagation (CP) 3 ✗ ✗ — ✓ ✓ → ✗ ✗ —

2 Common sub-expression detection (CSD) 3 ✗ ✗ — ✓ ✓ → ✗ ✗ —

3 Dead code (DC) 3 ✗ ✗ — ✓ ✓ ← ✗ ✗ —

4 Loop invariant code (LIC) 3 ✗ ✗ — ✓ ✓ → ✗ ✗ — 5 Upsafety analysis (USA) 3 ✗ ✗ — ✓ ✓ → ✗ ✗ —

6 Valid FileReader (VFR) 3 ✗ ✗ — ✓ ✓ → ✗ ✗ —

7 Mismatched wait/notify (MWN) 3 ✗ ✗ — ✓ ✓ → ✗ ✗ — 8 Available expression (AE) 2 ✗ ✗ — ✓ ✓ →

9 Dominator (DOM) 2 ✗ ✗ — ✓ ✗ →

10 Local may alias (LMA) 2 ✗ ✗ — ✓ ✓ →

11 Local must not alias (LMNA) 2 ✗ ✗ — ✓ ✓ →

12 Live variable (LV) 2 ✗ ✗ — ✓ ✓ ←

13 Nullness analysis (NA) 2 ✗ ✗ — ✓ ✓ →

14 Post-dominator (PDOM) 2 ✗ ✗ — ✓ ✗ ←

15 Reaching definition (RD) 2 ✗ ✗ — ✓ ✓ →

16 Resource status (RS) 2 ✗ ✗ — ✓ ✓ →

17 Very busy expression (VBE) 2 ✗ ✗ — ✓ ✓ ← 18 Safe Synchronization (SS) 2 ✗ ✗ — ✓ ✓ → 19 Used and defined variable (UDV) 1 ✗ ✗ —

20 Useless increment in return (UIR) 1 ✗ ✗ — 21 Wait not in loop (WNIL) 1 ✗ ✗ —

Given a set of analysis properties and graph properties, a deci-sion diagram is carefully designed to select the strategy as shown in Figure 2. The list of candidate traversal strategies includes: Any order (ANY ), Increasing order of node ids (INC), Decreasing or-der of node ids (DEC), Post-Oror-der (PO), Reverse Post-Oror-der (RPO), Worklist with Post-Order (WPO), Worklist with Reverse Post-Order (WRPO). The decision tree contains 11 unique paths (P0to P11) lead-ing to a traversal strategy (the leaf nodes). The intermediate nodes are conditions involving either an analysis property or a graph property. The order in which the checks are performed is intuitive. For example, it only makes sense to check the connectivity of the nodes (whether certain nodes have multiple parents), if the analysis is data-sensitive (in which case the outputs of neighbors influence the output of a node). Similarly, if the graph does not contain cycles, it is not necessary to check if the analysis is sensitive to a loop. Our decision tree design ensures that all necessary conditions are checked and only the necessary conditions are checked. While it is difficult to argue the correctness of our decision tree, instead we measured the misprediction rate and found it to be a near zero, where a misprediction happens when our decision tree selects a strategy that is sub-optimal.

3

EVALUATION

To measure the effectiveness of our hybrid approach, we measured the analysis execution times and compared it against the analysis execution times of six standard traversal strategies: DFS, PO, RPO, WPO, WRPO, and ANY, that use a single strategy for all kinds of analysis and graphs. We used 21 source code analyses that traverse CFG and a large dataset of CFGs that contains 287 thousand CFGs, generated from the DaCapo 9.12 benchmark. Table 1 lists the 21 analyses along with their properties.

Table 2 shows the percentage reduction in the analysis time of our hybrid approach against the six standard strategies. For all 21 analyses, our approach was able to reduce the analysis time and the

Table 2: Reduction in running times.Background colors indicate ranges of values: no reduction , (0%, 10%) , [10%, 50%) and [50%, 100%] .

Analysis DFS PO RPO WPO WRPO ANY Analysis DFS PO RPO WPO WRPO ANY

CP 17% 83% 9% 66% 11% 72% LV 38% 30% 84% 11% 56% 75% CSD 41% 93% 39% 74% 4% 89% NA 26% 88% 30% 50% 10% 80% DC 41% 30% 89% 7% 64% 81% PDOM 51% 41% 95% 10% 72% 95% LIC 17% 84% 8% 67% 7% 73% RD 15% 80% 7% 62% 9% 68% USA 36% 92% 34% 72% 9% 87% RS 31% 31% 30% 31% 28% 30% VFR 20% 41% 18% 51% 15% 62% VBE 40% 36% 88% 13% 76% 81% MWN 21% 35% 16% 35% 22% 49% SS 26% 39% 22% 37% 25% 57% AE 40% 14% 39% 73% 14% 87% UDV 6% 5% 6% 10% 9% 3% DOM 54% 97% 48% 70% 6% 95% UIR 2% 2% 1% 3% 3% 0% LMA 35% 46% 28% 74% 6% 46% WNIL 3% 4% 5% 6% 8% 2% LMNA 29% 39% 22% 68% 9% 41% Overall 31% 83% 70% 55% 35% 81%

reduction is between 1%-28% (taking least value in column 2-7 for each analysis in Table 2). For UDV, UIR, and WNIL, our approach saw less benefit, because these are data flow insensitive analyses, in which all nodes are visited exactly once and the order does not matter, hence all traversal strategies would perform similar. We also measured the overhead of our technique for computing the characteristics of the analysis and properties of the graphs and the overhead is negligible, between 0.01%-0.2%.

4

RELATED WORK

Atkinson and Griswold [1] proposed techniques for improving the efficiency of data-flow analysis. Tok et al. [4] proposed a new work-list algorithm for accelerating inter-procedural flow-sensitive data-flow analysis. Hind and Pioli [3] proposed an optimized priority-based worklist algorithm for pointer alias analysis. Bourdoncle [2] proposed the notion of weak topological ordering (WTO) for dataflow analysis and abstraction interpretation domains. These works pro-pose new traversal strategies for improving the efficiency of certain class of source code analysis, whereas our work proposes a novel technique for selecting the optimal traversal strategy from a list of candidate traversal strategies.

5

CONCLUSION

Scaling source code analyses to large code bases is an ongoing research direction. Our work proposes hybrid traversal, a technique for optimizing source code analysis over control flow graphs. Our key intuition is that by combining the analysis properties and graph properties, a suitable traversal strategy can be decided for any analysis and graph pair. Our initial results indicates that up to 30% reduction in the analysis time can be achieved.

ACKNOWLEDGMENTS

This work was supported in part by NSF grants CCF-15-18897, CNS-15-13263, and CCF-14-23370.

REFERENCES

[1] Darren C. Atkinson and William G. Griswold. 2001. Implementation Techniques for Efficient Data-Flow Analysis of Large Programs. In Proceedings of the IEEE Inter-national Conference on Software Maintenance (ICSM’01) (ICSM ’01). IEEE Computer Society, Washington, DC, USA, 52–. https://doi.org/10.1109/ICSM.2001.972711 [2] François Bourdoncle. 1993. Efficient chaotic iteration strategies with widenings.

In Formal Methods in Programming and Their Applications, Dines Bjørner, Manfred Broy, and Igor V. Pottosin (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 128–141.

(5)

233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

Hybrid Traversal: Efficient Source Code Analysis at Scale ICSE ’18 Companion, May 27-June 3, 2018, Gothenburg, Sweden

291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 [3] Michael Hind and Anthony Pioli. 1998. Assessing the Effects of Flow-Sensitivity

on Pointer Alias Analyses. In SAS. Springer-Verlag, 57–81.

[4] Teck Bok Tok, Samuel Z. Guyer, and Calvin Lin. 2006. Efficient Flow-sensitive Interprocedural Data-flow Analysis in the Presence of Pointers. In Proceedings of the 15th International Conference on Compiler Construction (CC’06). Springer-Verlag, Berlin, Heidelberg, 17–31. https://doi.org/10.1007/11688839_3

References

Related documents

[r]

environment and research.   

 сложеност туристичког производа, који је састављен од великог броја производа и услуга,  «производни процес» траје током целог дана, и сваког дана,

The results of a depth-first traversal, beginning at vertex a, of the graph in Figure 13-11... The results of a breadth-first traversal, beginning at vertex a, of the graph in

Given that it is not unusual that a traversal may have a large number of classes and methods involved Metric 1 and a traversal may have a very long traversal path while the

When a caribou skin tipi was no longer being used, people would take it apart, soak and tan the hides and make clothing or babiche.. Structure of

Supporters of requiring insurers to file standard forms say that when all insurers are required to offer the same policy forms, consumers can make “apples to apples” comparisons

In this dissertation, to efficiently query and manage OPM-compliant provenance, we first propose a provenance collection framework that collects both prospective provenance,