• No results found

8.3 Indexing and Query Processing

8.3.2 Loading an Index

The application also lets the user load an Index from disk, without going through the process of building the index from the application models. To do so one changes to the hLoad Existing Indexi tab and selects the Inverted File by clicking the hLoad Indexi button. If the user also wants to load a certain PageRank structure that is different from the one specified in the AJAXConfig class, he can do so by clicking the hLoad PageRanki and choosing the corresponding file. Figure 8.3 shows the GUI for loading an Index.

8.3.3 Query Processing

To search in an Index, for example for testing purposes, the application also provides this functionality. Any user-defined query can be typed in the hQueryi text field on the hSearchi tab. By clicking the hSearchi button the query is processed and the results

Chapter 8 Infrastructural Setup

Figure 8.3: The UI for loading an Inverted File into memory.

are listed in the text area below the query field. Figure8.4for example shows the results for the query "morcheeba enjoy the ride tzuke judie skye" in a sample index.

Figure 8.4: The UI for testing the Query Processing.

Chapter 9

Related Work

Our work is new as what concerns crawling AJAX Web Applications. Nevertheless, there are parallels with work form the database, IR and DB-IR communities.

Duplicate elimination. There are two kinds of duplicate cases during crawling: syntactic and semantic duplicates. The syntactic duplicates are more than one page linking to the same page with the same URL. Traditional crawlers extract and store all static URLs in a look-up table used to avoids repeated access to the same URL. For the semantics duplicates, several algorithms exist to detect near-duplicate web pages: [26], [29], Broder et al.’s shingling algorithm[12] and Charikar’s [14], [29] random projection. DustBuster (Different URLs with Similar Text) [5] tries to detect duplicate URLs using mining.

This is not possible for AJAX application crawling. Traditional Web Crawling has been addressed in works such as [11]. [32] crawls YouTube but for Data Mining purposes.

Javascript Analysis. The work of [28] analyzes Javascript and constructs a control flow, for testing Javascript applications. However, this model cannot help the crawler identify duplicate states in AJAX applications. Most Javascript such as [15] works deal with detecting common spam redirects on web pages.

Application Models. Colorful XML [27] encodes multiple views over the data in the same data structure; we are similar in the approach (i.e., multiple states in a single-page application), but their model lacks the notion of transition. The Transition Graph bears similarities with ActiveXML [1], a dynamic XML structure enhanced with dynamic calls.

Active XML views are however always evolving, and offer just snapshots of current states as opposed to AJAX Crawling.

Deep Web. Because AJAX applications interact with the server using user input, crawl-ing AJAX is related to Hidden Web [30], [36], [34], [35]. AJAX Crawl does more than a

Chapter 9 Related Work

hidden web crawler, since it focuses on very granular interactions with the application and since it escapes the page paradigm commonly used in search.

Parallelization/Optimization. UbiCrawler [6], [7] and [8] is parallelized using au-tonomous agents for scalability as well as fault-tolerance. Although its architecture is more sophisticated than ours, they share several commonalities. We didn’t partic-ularly focus on fault-tolerance. Also, UbiCrawler doesn’t need a Precrawling phase, because the web-graph is computed during crawling. However, our goal was to avoid any communication between different crawling processes. They also propose techniques to compress the web-graph [9]. We did not consider this kind of optimization, because we focused on the minimization of network calls (Hot Node detection mechanism).

Other work in the database community applies: ranking models such as XRank [25], which itself extends PageRank [10], have been used for generic XML structures. In the Annotated Application Structure we propose, we adapted such techniques and used the interactions between content and states for ranking. Techniques such as [3] can be used to rank based on the lowest common ancestor. TopXSearch[33] results or Threshold Algorithms [19] can be applied for an optimized computation of results and ranking.

Incremental Indexing [13] is also relevant.

Chapter 10

Conclusions and Future Work

Crawling AJAX is a difficult problem, avoided by current search engines. The benefits reflect in improved search results. This work addresses AJAX Crawling pragmatically.

and proposed an AJAX Crawler. The most difficult issues in crawling AJAX, duplicate detection and caching, have been addressed by the crawling algorithm and its optimiza-tions based on hot nodes. The experiments on YouTube proved the benefit of crawling AJAX content and offered an insight on the performance/result quality threshold. Usu-ally, a large number of states already leads to large performance overhead.

There are several avenues for future work, most address ways to decrease the number of AJAX content that needs to be crawled. The first one is crawling more current AJAX applications, such as Google Maps. A second one is to address forms in AJAX applications. Most AJAX applications allow user input. Combining AJAX Search and work on Deep Web can provide insight on which content is relevant for crawling. The next one is crawling and personalization, i.e., focusing on a specific user’s or group’s interaction with the server. This would mean that only relevant states and events are crawled, improving both quality and performance. Finally, crawling AJAX can also be seen as a repetitive process, which can reduce the number of crawled events, by ignoring events which did not cause large changes in previous crawling sessions.

Chapter 10 Conclusions and Future Work

Chapter 11 Bibliography

[1] Serge Abiteboul, Omar Benjelloun, Bogdan Cautis, Ioana Manolescu, Tova Milo, and Nicoleta Preda. Lazy Query Evaluation for Active XML. In SIGMOD, 2004.

[2] Amazon.com. http://www.amazon.com.

[3] Sihem Amer-Yahia, Emiran Curtmola, and Alin Deutsch. Flexible and efficient XML search with complex full-text predicates. In SIGMOD, 2006.

[4] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Re-trieval. ACM Press / Addison-Wesley, 1999.

[5] Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. Do not Crawl in the Dust: Different URLs with Similar Text. In WWW, 2007.

[6] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trova-tore: Towards a highly scalable distributed web crawler. In Poster Proc. of Tenth International World Wide Web Conference, 2001.

[7] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Ubicrawler:

Scalability and fault-tolerance issues. In Poster Proc. of Eleventh International World Wide Web Conference, 2002.

[8] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Ubicrawler:

A scalable fully distributed web crawler. In Software: Practice & Experience, 2004.

[9] Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression tech-niques. In Proc. of the Thirteenth International World Wide Web Conference, 2004.

[10] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks, 30(1-7), 1998.

Chapter 11 Bibliography

[11] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998.

[12] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig.

Syntactic Clustering of the Web. In WWW, 1997.

[13] Eric W. Brown, James P. Callan, and W. Bruce Croft. Fast Incremental Indexing for Full-Text Information Retrieval. In VLDB, 1994.

[14] Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms.

In STOC, 2002.

[15] Kumar Chellapilla and Alexey Maykov. A Taxonomy of JavaScript Redirection Spam. In AIRWeb, 2007.

[16] COBRA Toolkit. http://html.xamjwg.org/cobra.jsp.

[17] Cristian Duda, Gianni Frey, Donald Kossmann, Reto Matter, and Chong Zhou.

AJAX Crawl: Making AJAX Applications Searchable. Submitted for review.

[18] Cristian Duda, Gianni Frey, Donald Kossmann, and Chong Zhou. AJAXSearch:

Crawling, Indexing and Searching Web 2.0 Applications. In VLDB, 2008.

[19] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal Aggregation Algorithms for Middleware. In PODS, 2001.

[20] Gianni Frey. Indexing AJAX Web Applications. Master Thesis, ETH Zurich, 2007.

[21] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Pat-terns: Elements of Reusable Object-Oriented Software. Addison Wesley, Reading, Massachusetts, 1994.

[22] Google Mail. http://www.gmail.com.

[23] Google Maps. http://maps.google.com.

[24] Google Suggest. http://www.google.com/webhp?complete=1&hl=en.

[25] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. XRANK:

Ranked Keyword Search over XML Documents. In SIGMOD, 2003.

[26] Monika Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006.

Chapter 11 Bibliography

[27] H. V. Jagadish, Laks V. S. Lakshmanan, Monica Scannapieco, Divesh Srivastava, and Nuwee Wiwatwattana. Colorful XML: One Hierarchy Isn’t Enough. In SIG-MOD, 2004.

[28] Chien-Hung Liu, David C. Kung, Pei Hsia, and Chih-Tung Hsu. Structural Testing of Web Applications. In ISSRE, 2000.

[29] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting Near-Duplicates for Web Crawling. In WWW, 2007.

[30] Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. In VLDB, 2001.

[31] Rhino: JavaScript for Java. http://www.mozilla.org/rhino/.

[32] Chirag Shah. YouTube Crawling: A VidArch Year in Retrospect.

http://www.ils.unc.edu/vidarch/vidarch-annrep-2008.pdf.

[33] Martin Theobald, Ralf Schenkel, and Gerhard Weikum. An Efficient and Versatile Query Engine for TopX Search. In VLDB, 2005.

[34] Jiying Wang, Ji-Rong Wen, Fred Lochovsky, and Wei-Ying Ma. Instance-based Schema Matching for Web Databases by Domain-Specific Query Probing. In VLDB, 2004.

[35] Wensheng Wu, AnHai Doan, and Clement Yu. WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In ICDE, 2006.

[36] Wensheng Wu, Clement Yu, AnHai Doan, and Weiyi Meng. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. In SIGMOD, 2004.

[37] Yahoo! Mail. http://mail.yahoo.com.

[38] YouTube. http://www.youtube.com.

[39] YouTube Video Keyword Research and Characteristics of Popular YouTube Queries. http://www.brysonmeunier.com/youtube-video-keyword-research-and-characteristics-of-popular-youtube-queries.

Related documents