3.2
Search Computing
The aim of Search Computing is to respond to multi-domain queries, i.e., queries over multiple semantic fields of interest. Its goal is to help users by decomposing queries and to automatically assemble complete results from partial answers. Hence, Search Computing aims at filling the gap between generalized search systems, which are unable to find information spanning multiple topics, and domain-specific search systems, which cannot go beyond their domain limits.
Some examples of Search Computing queries are: "Where can I attend an interesting scientific conference in my field and at the same time relax on a beautiful beach nearby?", "Where is the theatre closest to my hotel, offering a high rank action movie and a near-by pizzeria?", "Who is the best doctor who can cure insomnia in a nearby public hospital?", "Which are the highest risk factors associated with the most prevalent diseases among the young population?". These examples show that Search Computing aims at covering a large and increasing spectrum of user’s queries, which structurally go beyond the capabilities of general-purpose search engines. These queries cannot be answered without capturing some of their semantics, which at a minimum consists of understanding their underlying domains, in routing appropriate query subsets to each domain specific source and in combining answers from each expert domain to build a complete answer that is meaningful for the user.
The main idea behind search computing is a data integration process as complex queries are extracted from complex data and complex data demand integration. In search computing, integration is query-specific, as answering queries related to different semantic domains demands for intrinsically differ- ent data sources: building results for such queries does not require "global data integration", but simply data integration relative to specific domains.
In Search Computing a data source is any data collection accessible on the Web. The Search Computing vision is that each data source should be focused on its single domain of expertise (e.g., travel, music, shows, food, movies, health, genetic diseases, etc.), but pairs of data sources that share
3.2. Search Computing 36
information (e.g., about locations, people, genes) that can be linked to each other, and then complex queries spanning over more than one data source can use such pairing (that we call "composition pattern") to build complex results. An advantage of this approach is its transitivity. If we can pair source A to source B (e.g., pathologies which alter body functions), and then source B to source C (e.g., body function alterations that are treated by drugs), then we can answer queries that connect A to C (e.g., pathologies treated by drugs) and so on. Each source is in charge of its own maintenance policy thus determining its level of data accuracy and currency.
The search computing paradigm also looks into a problem of composition pattern, i.e., a data source coupling for answering multi-domain queries, recall- ing that the purpose of composition is search, and that therefore results should be presented to users according to some ranking, respectful of the original rank of the elements coming from the native data sources and of the search intent of the user. SeCo resorts to join [Ilyas 2004], which is however revisited in the context of Search Computing to become service-based and ranking-aware [Polyzotis 2011]. A result item of a multi-domain query is a "combination", built by joining two or more elements coming from distinct data sources and returned by different search engines. In our first query example ("Where can I attend an interesting scientific conference in my field and at the same time relax on a beautiful beach nearby?"), combinations are triples made of: database conferences (extracted from a site specialized in scholar events, e.g., Dblife [DBLife 2014]), inexpensive flights (extracted from a flight selection site, e.g., Expedia [Expedia 2014] or Edreams [Edreams 2014], and cities with nice beaches (extracted from tourism or review sites, e.g., Yahoo! Travel or Tripadvisor) [YahooTravel 2014,TripAdvisor 2014].
Inter domain connections contain semantics: flights connect pairs of cities at given dates; therefore connections use "dates" and "cities" as matching properties. In SeCo joins are applied in the context of web services. The assumption is that every data source is wrapped as a web service. Such services, typically, expose a query-like interface, which undertakes keyword- based input and yields ranked results as output. Services are further composed by using a ranking-preserving join. This operation is referred to as a join of
3.2. Search Computing 37
search services [Bozzon 2011a,Bozzon 2010,Braga 2011].
Search Computing aims at giving to expert users the capability of build- ing similar solutions for different choices of domains, which - in the same way as Expedia or Lastminute - share given properties and therefore can be con- nected. For such purposes, Search Computing offers a collection of methods and techniques for orchestrating the search engines and building global results. Composition patterns are predefined connections between well-identified Web services, therefore, orchestrations are not built arbitrarily, but rather by select- ing nodes (representing services) and arcs (representing the links in the compo- sition patterns) within a resource network representing the various knowledge sources and their connections [Brambilla 2011]. This vision is consistent with the emerging idea of moving from an Internet of (disorganized) pages to an Internet of (semantically coherent) objects.
Complex queries are not only hard to answer, but they can be also difficult to formulate for the user. Thus, an important research issue is to how best to capture user’s search intent and direct the user towards the discovery of their true information need. This process can be done by means of liquid queries, a dynamic query interface that lets users dynamically extend the scope of queries and then browse query results [Bozzon 2010,Bozzon 2011b].
Finally, Search computing considers the possibility of automatically infer- ring the relevant network of data sources required to build the answer from keyword-based user queries [Brambilla 2011]. This will require "understand- ing" query terms and associating them to resources, through tagging, match- ing, and clustering techniques. Thereafter, the query will be associated to the "best" network of resources according to matching functions, and dynam- ically evaluated upon them. This goal is rather ambitious, but it is similar to supporting automatic matching of query terms to services within a seman- tic network of concepts, currently offered by Kosmix [Rajaraman 2009]. One step in this direction is to extend joins between services to support the notions of partial linguistic matching between terms (supported by vocabularies such as WordNet) or dealing with the predicate "near" in specific domains (e.g., distance, time, money).
3.2. Search Computing 38
search query example in SeCo data model context. To approach web data sources SeCo Search relies on Service Description Framework (SDF) [Bram- billa 2011], a multi-layered service model. Top of SDF, a conceptual level is a simple model that characterizes real world entities used to build a domain di- agram (DD) encompassing different data services called Service Marts (SM). Service Marts are structurally defined by means of attributes and their rela- tionships. Next down, a logical level describes the access to the conceptual entities in terms of data retrieval patterns (Access Patterns, AP) described by input and output attributes. Join operations between access patterns are performed by means of attributes that share the same domains, thus, forming a connection pattern. The bottom, a physical level represents the mapping of access patterns to concrete Web data source Interfaces (SI), that incorporate access endpoints with a basic non-functional properties of the service.
To quickly illustrate let us assume the user wants to specify a query about which movies of a given genre are close to its current location in New Zealand. The query translates into natural language as "What are the cinemas showing thrillers in Auckland?". We assume a service framework, as depicted in 3.1, featuring:
1) A Movie service mart that can be accessed via access pattern MovieByTitle (AP1), returning movie details based on their title, with a single corresponding service interface mapped to http://www.imdb.com (SI1).
2) A Cinema service mart that can be accessed via TheaterByCity (AP2), returning New Zealand movie theaters close to an input position, having a service interface mapped to https://www.eventcinemas.co.nz (SI2).
Join operation is preformed via a Movie.title attribute present in input and output of AP1 and AP2. Given such services, the above query results in the SeCo query language (SeCoQL) query [Braga 2011] below, which accepts as input the movie genre, the address and city of the user’s location:
DEFINE QUERY Q($X:String, $U:String, $V:String) AS SELECT M.*, T.* FROM AP1 (Movie.Genre: $X, Movie.Year: $Z) AS M USING S1 JOIN AP2 (Theater.Addr: $U, Theater.City: $V,Theater.Phone: $W) AS T USING S2 ON M.Movie.Title=T.Movie.Title WHERE $V=Auckland AND $X=Thriller. At execution time, after extracting movie and theater instances, the query