Bux, Mühl Accessing the Deep Web: A Survey
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008
VL Text Analytics
„Accessing the Deep Web: A Survey“
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 2 / 31 „Accessing the Deep Web: A Survey“, 2007 by Bin He, Mitesh Patel, Zhen Zhang, Kevin ChenChuan Chang Computer Science Department University of Illinois at UrbanaChampaign
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 3 / 31 The „Deep Web“ Webinhalte, die nicht durch Suchmachinen indiziert sind. „While the surface Web has linked billions of static HTML pages, it is believed that a far more significant amount of information is 'hidden' in the deep Web, behind the query forms of searchable databases [...]. Such information may not be accessible through static URL links.“ „Accessing the Deep Web“, He, Patel, Zhang, Chang
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 4 / 31 The „Deep Web“ Dynamisch generierte Seiten (Forms, Benutzereingaben) Logingeschützte Seiten Contextabhängige Seiten MultimediaSeiten (z.B. Flash)
Bux, Mühl Accessing the Deep Web: A Survey
Bux, Mühl Accessing the Deep Web: A Survey
Bux, Mühl Accessing the Deep Web: A Survey
Bux, Mühl Accessing the Deep Web: A Survey
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 9 / 31 2000er Studie Wie groß ist das „Deep Web“? ca. 43.00096.000 Websites ca. 7,5 TB Daten ca. 500fach größer als das „Surface Web“
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 10 / 31 2000er Studie Probleme: Beschränkt sich auf Hochrechnungen bezüglich der Größe des „Deep Webs“ Benutzt „Overlap Analysis“
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 11 / 31 2007er Studie IPSampling Methode 2.230.124.544 mögliche IPAdressen Nehme zufällige 1.000.000 als repräsentativen Ausschnitt (sample)
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 12 / 31 IPSampling Methode Technik: Sende HTTPRequests an 1.000.000 IPs (GNUTool: wget) Downloade und analysiere die Webseiten Erkenne „DeepWebsites“
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 13 / 31 IPSampling Methode Erkenne „DeepWebsites“ „ Web server that provides information maintained in one or more backend Web databases“ Zugriff auf die Datenbanken per Formular
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 14 / 31 IPSampling Methode Probleme: „Virtual Hosting“ Nicht alle Arten an „DeepWebsites“ berücksichtigt
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 15 / 31 Entrance to the Deep Web ● Entrance is a query interface ≠ login, polling, registration, message posting and site search ● Depth is the number of operations to get from the root page to the query interface
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 16 / 31 Entrance to the Deep Web ● Methods: − 100.000 of 1.000.000 IP samples deep crawled to depth 10 ● Findings: − 94% of the web databases appeared within depth 3 − Query interfaces located shallowly
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 17 / 31 Scale of the Deep Web ● Methods: − All 1.000.000 IP samples crawled to depth 3 − Depth 3 sufficicient since Deep Web is located shallowly ● Findings: − 2256 Web Servers found in total − 126 Deep Web sites with 190 Web databases and 406 query interfaces found
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 18 / 31 Scale of the Deep Web ● Extrapolation: − 190 * (2.230.124.544 / 1.000.000) / 0,94 ≈ 450.000 databases − In a similar way, 307.000 Deep Web sites and 1.258.000 query interfaces have been estimated
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 19 / 31 Structure of the Deep Web ● Structured Data – relationally represented in form of attributevalue pairs (e.g. books on Amazon.com) ● Unstructured Data – no specific order (e.g. CNN's recent news) ● Surface Web is mostly unstructured (HTML text)
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 20 / 31 Structure of the Deep Web ● Methods: − Manual querying and inspection of the 190 found databases ● Findings: − 43 unstructured and 147 strucutured databases ● Extrapolation: − Data in the deep Web is mostly structured (3.4:1 ratio)
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 21 / 31 Subject Diversity of the Deep Web ● Surface Web consists of >80% commerce sites ● Methods: − Manual categorization of the 190 found databases − Taxonomy: 14 toplevel categories of Yahoo.com ● Findings: − Large diversity of subjects − Even distribution between commercial and noncommercial Web databases
Bux, Mühl Accessing the Deep Web: A Survey
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 22 / 31
Distribution of databases over subject category
Business & EconomyComputers & InternetNews & MediaEntertainmentRecreation & SportsHealth GovernmentRegionalSociety & CultureEducationArts & HumanitiesScience Reference Others 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00%
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 23 / 31 Suchmaschinen Wie gut indizieren google u.a. das Deep Web? 20 „DeepWebsites“ Suche mit google, yahoo und msn
Bux, Mühl Accessing the Deep Web: A Survey
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 24 / 31
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 25 / 31 Searching the Deep Web: deepWeb directories ● Online portal services supporting Deep Web database access − Sort Web databases into different categories − Enable online search in their categorized databases
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 26 / 31 Searching the Deep Web: deepWeb directories ● Examples and their number of categorized databases: − www.completeplanet.com (70.000+) − www.lii.org (14.000) − www.turbo10.com (2.300) − www.invisibleweb.net (1.000)
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 27 / 31 Searching the Deep Web: deepWeb directories ● Overall coverage is poor (<20%) considered that there are 450.000 Web databases ● Deep Web grows too fast to allow manual categorization
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 28 / 31 Searching the Deep Web: Future Search Engines ● Traditional Search Engines fail in the Deep Web − Limitation of crawling (automated search and extraction) − Databases updated too frequently to be indexed properly − Search Engines can't exploit the Databases' structure
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 29 / 31 Searching the Deep Web: Future Search Engines ● Better idea: twotiered Search Engine ● Discovery: automated search for Web databases suiting the query − Realized by crawling and indexing the databases' query interfaces − No information on the databases internal data used
Bux, Mühl Accessing the Deep Web: A Survey
Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 30 / 31
Searching the Deep Web: Future Search Engines
● Forwarding: databasespecific search in the discovered databases
Bux, Mühl Accessing the Deep Web: A Survey Ulf Leser: Text Analytics, Praktikum, Sommersemester 2008 31 / 31 Nachweis „Accessing the Deep Web: A Survey“, Bin He, Mitesh Patel, Zhen Zhang, Kevin ChenChuan Chang, 2007 „The Deep Web: Surfacing Hidden Value“, Michael K. Bergman, 2001