• No results found

OGSA-DQP: Services meet Grid query processing

Web Services (WSs) [GGKS02] have emerged as a broadly adopted and appealing com- puting paradigm for loosely-coupled distributed applications, as they provide language and platform-independent techniques for describing, discovering, invoking and orches- trating collections of networked computational services. The significance of their ben- efits has forced the Grid community to recast its middleware functionalities as Grid Services (GSs) and thus to develop the Open Grid Services Architecture (OGSA), and a reference middleware implementation released as Globus 3 [FKNT02, TCF+02]. Web and Grid service technologies are converging rapidly and they complement each other. Indeed, WSs mostly focus on platform-neutral description, discovery and invo- cation, while GSs are more interested in managing service state and lifetime, dynamic discovery and efficient use of distributed computational resources.

OGSA-DQP [AMP+03] is a service-based distributed query processor, which reuses large parts of Polar*, and is built on top of Globus 3. OGSA-DQP is service-based in two orthogonal dimensions:

• it manifests itself as a collection of services, and

• it accesses remote data repositories and analysis tools that are in the form of

services.

OGSA-DQP services are built on top of Grid Data Services (GDSs) developed in the context of the OGSA-DAI initiative [OD], which aims to provide a common service interface to Grid-connected DBMSs. GDSs extend the basic GSs of Globus 3. In the remainder of this section, the main characteristics of each of the above classes of service are discussed.

2.5.1

Grid Services and Grid Data Services

GSs in Globus 3 implement basic, application-independent functionalities that are use- ful to any service. They implement a standard interface for registering their instances with registry services. This interface enables other services to query the registries and retrieve information about them. In this way, they provide for registration and discov- ery. Instances are created dynamically from Service Factories and are identified by their unique handle. Their lifetime is manageable; thus failed instances can be dis- posed of in a tidy manner. GSs provide also mechanisms for notification subscription, production, delivery and receipt.

GDSs extend GSs by providing a registry facility for the publication of other GDSs. They are also capable of creating instances tailored to specific, application-dependent requirements with regard to database location and query language. Also, GDSs can be Grid-enabled wrappers of relational, object-relational and XML databases.

2.5.2

OGSA-DQP Services

OGSA-DQP [AMP+03] re-engineered Polar* to conform to service-based principles. OGSA-DQP defines two services, both extending GDSs:

• Grid Distributed Query Services (GDQSs) that encapsulate the Polar* compiler,

• Grid Query Evaluation Services (GQESs) that implement the functionality of the

Polar* evaluators.

The main differences between the two systems that are relevant to this thesis, are the following:

• OGSA-DQP is not tightly coupled with any persistent storage system, such as

SHORE. The result is that OGSA-DQP services are significantly more light- weight than the corresponding Polar* components.

• In OGSA-DQP, metadata are kept in main memory and constructed on a per-

session basis, whereas in Polar* metadata are persistent, and thus can be arbi- trarily outdated.

• OGSA-DQP employs the GDS generic interface to access remote databases,

whereas Polar* relies on manually constructed wrappers.

• In OGSA-DQP, communication between service instances occurs in the form of

XML documents transmitted over SOAP/HTTP. Thus, there is no need for MPI messaging, as in Polar*.

• The operation calls in Polar* load local code dynamically, whereas in OGSA-

DQP, although they are still conceived of by the compiler as typed user-defined functions (UDFs), they constitute calls to WSs. A consequence of this difference is that they can be executed in any GQESs, contrary to what happens in Polar*.

2.5.3

Query Planning and Evaluation in OGSA-DQP

The single-node optimisation policy in OGSA-DQP is the same as in Polar*, i.e., em- ploying heuristics that minimise the volume of intermediate results and the size of data transmitted over the network, to reduce the query response time. Multi-node optimi- sation differs in the scheduling policy, to reflect the fact that operation calls can be deployed on any machine. Also, the compiler is modified to output the query plan in XML (see Figure 2.9 for a simple example). Finally, the optimiser has been modified to read metadata from main memory and not to access any repository.

The steps for setting up query sessions and submitting queries in a GS environment are shown in Figure 2.10:

<Partition>

<evaluatorURI>6</evaluatorURI>

<Operator operatorID="0" operatorType="TABLE_SCAN"> <tupleType> <type>Classification</type> <name>Classification.OID</name> <type>string</type> <name>Classification.contologyid</name> <type>string</type> <name>Classification.cproteinid</name> </tupleType> <TABLE_SCAN>

<dataResourceName> Classifications </dataResourceName>

<GDSHandle> http://rpc52.cs.man.ac.uk:9090/ogsa/services/ogsadai /GridDataServiceFactory </GDSHandle>

<tableName> Classifications </tableName> </TABLE_SCAN>

</Operator>

<Operator operatorID="1" operatorType="APPLY"> <tupleType> <type>string</type> <name>Classification.cproteinid</name> </tupleType> <APPLY> <inputOperator> <OperatorID>0</OperatorID></inputOperator> <applyOperationType>PROJECT</applyOperationType> <parameters> <attributeName>Classification.cproteinid</attributeName> </parameters> </APPLY> </Operator>

<Operator operatorID="2" operatorType="EXCHANGE"> <tupleType> <type>string</type> <name>Classification.cproteinid</name> </tupleType> <EXCHANGE> <inputOperator> <OperatorID>1</OperatorID></inputOperator> <consumers> <operatorReference> <EvaluatorURI>7</EvaluatorURI> <OperatorID>0</OperatorID> </operatorReference> <operatorReference> <EvaluatorURI>1</EvaluatorURI> <OperatorID>0</OperatorID> </operatorReference> </consumers> <producersNumber>0</producersNumber> <producers> </producers> <arbitratorPolicy> <ROUND_ROBIN> 1 </ROUND_ROBIN> </arbitratorPolicy> </EXCHANGE> </Operator> </Partition>

Figure 2.9: An example plan fragment that retrieves the Classification relation from evaluator 6, projects the cproteinid attribute, and sends the resulted tuples to evaluators 1 and 7. The fragment corresponds to the lower right partition in Figure 2.7(c).

1. Grid Data Services (GDSs) are registered to a Grid Registry. WSs that may play the role of UDFs are registered too.

2. The client starts a session and chooses which registry to contact, and which Grid and Web services registered to this registry to use.

3. The client informs the static query coordinator about the selected services. The coordinator creates a global database schema by collating the local schemata (only naming conflicts are resolved, as data integration issues are out of the scope of this thesis). The WSs are interpreted as typed UDFs, based on their WSDLs. This step concludes the creation of a query session.

4. The client submits a query to the query coordinator. The latter is responsible for parsing the query statement, creating a query plan, optimizing and parallelising it.

5. The static coordinator dynamically creates as many GQESs as the different sites chosen for evaluation.

6. The results are returned to the client.

2.5.4

OGSA-DQP’s approach regarding the novel challenges

OGSA-DQP does not differ from Polar* in how it tackles the issues of resource hetero- geneity and adaptivity. However, as data and resource metadata are retrieved for each session in OGSA-DQP, there is a higher probability that the metadata is more accurate than in Polar*.