• No results found

Service Materialization Model Formalization

4.3 Service Materialization Model

4.3.1 Service Materialization Model Formalization

When considering a set of data sources and service interfaces, as in the example of Figure 4.2, several data materialization scenarios can emerge. For instance, an application might require the (full) materialization of a data source over a specific service interface (and related access pattern). In other scenarios, one might be interested in materializing the whole data source by exploiting all of its access patterns and available services (e.g., the whole IMDB database). Finally, an application might aim at collecting a comprehensive view of the information related to one or more domains (e.g., all the movies ever released). SeCo description framework service characteristics and its novel extension of the materialization dimensions contained in access patterns are applied in the materialization process and formally defined.

Definition 4.1. In the context of a Service Materialization Model, a mate-

rialization over a given service interface s mapped to an access pattern AP is referred to as a single pattern single service materialization (spss). SPSS materialization is an ordered pair spss_m(s, R) where s is a service interface definition and R = {r1Sr2S...Srn} is the union of all result sets rnobtained

during materialization m.

Definition 4.2. The materialization coverage of R relative to some AP out-

put V is defined as: Cov = (|R|)/(|V |). It denotes the ratio between R - the number of tuples discovered in materialization m - and the total number of tuples in output V .

For instance, in terms of access pattern MovieByTitle, materialization over one of its service interfaces (e.g, IMDB1), is considered as a single pattern single service materialization (spss) imdb1_m(IM DB1, RIM DB1). The cov-

4.3. Service Materialization Model 65

the numbers of tuples discovered in materialization of IMDB1 to total IMDB1 size.

Definition 4.3. A materialization query over a given access pattern AP is

defined as a tuple qp = (v

1, ..., vn), where q is a query identifier, p is the

number of result pages returned by the query with 1 ≤ p ≤ M axN oP ages, where M axN oP ages is the maximum number of pages retrievable by a query as prescribed by the remote source; vn is a value of attribute n in the input

domain I; n is determined by the cardinality of I.

Definition 4.4. A materialization result set over AP is defined as a set of

tuples {r(v1, ..., vn)} where vn is a value of attribute n in the output domain

O; n is determined by the cardinality of O.

Definition 4.5. A materialization call over (si mapped to) AP is defined as

an ordered pair c(qp, r

q), where (i) c is the unique call id, (ii) qp is a query that

initiated the call; and (iii) rq is the result set - a result page p that answers

qp.

During data acquisition of an access pattern, a data source is accessed by issuing a sequence of materialization calls. Each call c contains a query qp over this access pattern’s input interface and corresponding result set r

q

which is expressed over AP output interface.

Following the above we define a sequence of materialization calls as a set {Cm}k

j=1, an ordered pair ({Qm}kj=1, {Rm}kj=1), where {Qm}kj=1 is a set of

materialization queries; and {Rm}k

j=1 is a set of materialization result sets;

where k is the number of calls performed to obtain materialization m.

Furthermore, a materialization call over (si mapped to) access pattern AP is expressed in SQL-like syntax:

SELECT distinct (<vO,1,...,vO,m> AS r_tmp, Rm) AS rq FROM

siAP WHERE (ai,1 = vi,1 &,...& ai,n = vi,n & pageN umber = p) AS qp

Where (i) rq is a result set aggregated by a duplicate tuples detection routine

(distinct);

4.3. Service Materialization Model 66

tuples rtmp{< vO,1, ..., vO,m>} from tuples already in materialization Rm;

(iii) each domain value vm in r_tmp element belongs to attribute element of

output interface O; and m = |O|;

(iv) qp is expanded so that each domain value vI is associated to the attribute

element of input interface I; and n = |I|;

(v) p is a result page number the query refers to, 1 ≤ p ≤ M axN oP ages, where MaxNoPages is the maximum number of pages retrievable by the query, it is defined by the service provider.

Observed through the MovieByTitle materialization, an example query posed against IMDB1 service may present as

imdb1_q0011 <0 Avatar0,0action0, 2009 >, where imdb1_q001 is this query’s

unique id, superscript 1 signifies the result page number this query relates to; and terms Avatar, action and 2009 are the values in the respective domains of IM ovieByT itle.

The query in SQL-like syntax features as:

SELECT distinct({<Title, Genre, Year>} as r_tmp, RIM DB1)

AS r1

imdb1_q001 FROM IM DB1 WHERE (T itle =

0 Avatar0 AND

Genre =0 action0 AND Y ear = 2009 AND pageN umber = 1) AS imdb1_q0011

This query results in:

r_tmp{<’Avatar’, action,2009>, <’Avatar2’, action, 2016>, ...,

<’The Avatars’, action, 2013>, <’Avatar Spirits’, documentary, 2010>}, with the assumption that Rm already contains tuple

<’Avatar Spirits’, documentary, 2010>, distinct(r_tmp) delivers the final result: r1

imdb1_q001{<’Avatar’, action,2009>, <’Avatar2’, action, 2016>,...,<’The Avatars’, action, 2013>}.

Definition 4.6. A materialization is considered feasible if there is an input

values dictionary (dict) allocated to all attributes in the input domain, that is ∀a ∈ I∃dict ⊆ Va : 1 ≤ |dict| ≤ |Va|.

A feasible solution allows for materialization call set to be executed se- quentially.

A materialization call set {Cm}kj=1 is limited by k, where k is determined by: a) the size of Cartesian product of all provided dictionaries; and b) the

4.3. Service Materialization Model 67

number of returned result set pages, i.e., the length of the query set sequence is:

{Qm}k

j=1 = dict1× dict2× ... × dictn,

where k = |dict1× dict2× ... × dictn| × noOf ReturnedP ages,

n = |I|, where noOf ReturnedP ages is the total number of retrieved results r in materialization.

This dictates the length of {Rm}kj=1 which implies cardinality |Rm| = Pk

i=1

pageSizei.

In the case of single pattern single service materialization (spss): VI,si ⊆ VI,apandVO,si ⊆ VO,ap.

Similarly, due to limiting factors such as output domain Osi duplicates

saturation, which occurs when a data source provides result tuples that are already present in the materialization, Covm is decreased, thus, rendering

the materialization process execution inefficient in terms of the number of executed queries and obtained materialization volume.

Let us consider a feasible materialization of IM DB1 with input dictionar- ies a subset of IMDB1 input domain dictT itle = Avatar, T itanic, dictGenre =

Drama, Action, Comedy and dictY ear = 2005, 2006, 2007, 2008, 2009. Let us

also assume for the sake of the example that noOf ReturnedP ages is prede- fined as 5 pages per query. A Cartesian product of the provided values gives a query set QIM DB1 of size k = (2 × 3 × 5) × 5 = 150, and with assumption

that each materialization call achieved (a) maximum result page size and (b) duplicate free result size of 100 tuples, the produced materialization RIM DB1

cardinality (size) was 150x100 = 15000 tuples. As the used input dictionaries contained a subset of the input domain VI,IM DB1 the produced materializa-

tion was a fraction of the full output domain of IM DB1 - VO,IM DB1, or in

terms of coverage CovIM DB1 = 15000/|VO,IM DB1|. However, the materializa-

tion execution still performed to its full potential as it produced the maximum result size for the given number of executed queries. Following this example it became evident that failure of the queried data source to deliver (a) full size result pages and (b) duplicate free results sets - would act detrimentally to the materialization coverage as it would decrease the size of the final ma- terialization.