Collection Manager: Integrating Diverse Data Sources on the Grid

(1)

Collection Manager: Integrating Diverse Data

Sources on the Grid

Travis Walsh, Sousan Karimi, Kevin Gamiel, Jeremiah Morris, Lavanya

Ramakrishnan

MCNC Research and Development Institute

3021 Cornwallis Road, P.O. Box 13910, Research Triangle Park, NC 27709-2889

(

[email protected], [email protected]

,

[email protected], [email protected]

,

[email protected]

)

Abstract: Today’s grid applications face numerous challenges associated with aggregation of diverse distributed data sources and types available across distributed domains. GridIR, a grid based information retrieval system faces similar challenges. A key component of the GridIR model is the Collection Manager (CM) that builds and publishes uniform data collections populated with data from heterogeneous sources both on and off the grid. CM facilitates different data sources to be grouped together as a collection and helps monitor for changes consequently. Collection Manager was designed as part of the Grid Information Retrieval System, to manage dispersed heterogeneous data sources. However, CM can work with any other system that needs data access and management in a grid environment. CM is built on the Open Grid Service Architecture (OGSA) and thus takes advantage of security and notification mechanisms available within the architecture. This paper describes the architecture of the Collection Manager and describes how extensions can be developed to make proprietary data sources accessible on the grid.

1. Introduction

Information retrieval (IR) is a field of research concerned with searching unstructured or semi-structured data, such as text documents, and gathering results pertinent to a user’s query. Modern web search engines are the most widely known implementations of IR systems; however, IR systems are also used for searching libraries, online store catalogs, etc. Grid computing is a technology that enables the integration of distributed computing resources. The Open Grid Services Architecture (OGSA) [1] is a popular grid computing architecture that defines a common, consistent framework allowing dynamic creation, coordination, management, and security of computing resources, called “grid services.” Collection of grid services can be created to form Virtual Organizations (VOs), a central concept of OGSA-based grid computing.

(2)

The GridIR [3] system consists of three distinctive services [Figure 1]: Collection Manager Services — to control collections and harvest data; Indexing and Searching Services — to build indices from document collections; and Query Processing Services — for distributed searching and result merging. These services are autonomous, and can be distributed on different resources. They can be created dynamically and — using OGSA/OGSI standard interfaces — any combination of these services, based on different algorithms, can create new IR systems. The services can also be linked together to create an inter-operable network of IR services. The notification framework within OGSA plays a major role within the GridIR system. It is used to track changes in data sources and other state changes in the services and notify interested clients.

The ability to integrate distributed data sources and diverse data types into a unified framework serves as groundwork for grid applications. Different data formats and access methods make it difficult to aggregate these diverse sources while keeping access transparent to the end-user. Take, for example, a bioinformatics database that has a specialized query response access method that becomes available to the scientists. This data must be made available to existing applications and users, while minimizing the burden of adding a new data type — which often involves changes to the entire application. The Collection Manager has a separate interface to deal with each collection and each data source. By separating the data source interface from the collection interface, a new data source type plug-in could be written and installed, thus making the new data type available to existing users of the CM. Also, the CM could be queried to return the parameters required to access this new data type. Thus, with the addition of a single data source type plug-in, all of the functionality of the CM and other CM-aware applications is now available to users of the new database. The ability to abstract the complexity of data management enables plugging-in various data types and sources dynamically during the entire life of an application on the grid.

Figure 1: GridIR Architecture

(3)

2. Collection Manager Architecture

The Collection Manager acts as a bridge between various data sources and other components of the GridIR system. Within the CM, all data is accessed as one or more documents; while it is left to the plug-in to decide what constitutes a document, the CM considers all data sources to be providers of documents. Collection Manager has two main modules: the collection interface module, and the document source type manager module [

Figure 3

]. The first module, the collection interface, provides notification when data changes, provides metadata to other parts of the GridIR system, and handles interfaces between CM and its clients. The second module, the document source type manager, manages various document source types, and allows easy expansion of the system by handling new document source types and new collection protocols. This module handles the interface between CM and document sources.

Figure 3: Collection Manager Architecture

(4)

document's metadata. A subset of the document source API is described in Table 1. At startup the document source type manager reads in a configuration file that determines from where it should load the initial document source classes. This gives the document source type manager adequate information to understand the requirements to interface with each document source.

All interactions between a client and a document source are mediated through the collection interface module. Thus to explain the usage of the document source we will need to trace the communication not only with the client but with the collection interface module as well.

As shown in [

Figure 5

], the first step for a client who wants to build a collection in a specific collection interface module is to query the collection interface module to find out what document sources are available. The collection interface module gets this information from the document source type manager. In response from the collection interface module the client will receive a list of strings containing the document source types a collection interface module has access to through the document source type manager. For example a list can be: “file”, “http”, “ftp”. These are simple descriptive labels that are used in subsequent calls to identify the document source. The next step for the client is to query the collection interface module to get the list of value/attribute pairs for a particular document source. The collection interface module again gets this information from the document source type manager. After making round trips for each document source type, a client can now build a collection object and send that to the collection interface module.

Figure 5: Interaction between Client and the Collection

The collection object must contain the relevant information necessary for creating and managing a collection. It should contain at least one document source object and may contain a Spider object, a Filter object, and a Schedule object.

(5)

2. A Spider object describes a rich set of rules that can be applied to many common hierarchical structures, and sets the criteria on the limits of following links and crawling the web.

3. A Filter object describes a rich set of filtering rules that can be passed to the data source. This helps ensure more efficient storage by preventing the storage of unneeded documents.

The Schedule object is vitally important to the collection interface module. Its functionality is described in more detail in Section 3.

Rather than have the collection interface module manage all of the document source instances itself, we built what is essentially a document source factory, called the document source instance manager. When a collection object is sent to the collection interface module, it sends the collection objects that concern the document source to the document source instance manager. The document source instance manager handles id creation and document source metadata for all of the document source instances that it creates.

Document source API and an example of its use

A new document source type can be added to the system by extending the document source base class. The document source base class includes all of the methods necessary to communicate with the collection interface module and the document source type module. An extension needs to provide the following:

1. A description of its available arguments.

2. A postInit method to read those arguments when sent and any necessary housekeeping. 3. A method for the scheduler, or some initializer, to check if any document has changed.

The getInfo method should be overridden. This method returns the DocumentSourceInfo object, that contains description of the document source, and the list of parameters that document source will accept. A parameter object contains the name of the parameter, the description of the parameter, the data type expected, and the acceptable values for the parameter (minimum, maximum). If the minimum is 0 it indicates an optional parameter, and if the maximum is 0 it indicates an unbounded number.

The postInit method reads in the parameters passed by the collection interface module from the addCollection call. The document source module should do any required housekeeping, such as creating directories, or sockets. It is a good idea at this point to do a first run at collecting the data, however, this has been left to the discretion of the document source module implementer.

The most important part of the document source module is contained in the checkForDataChanged method. This method fetches the data and stores it locally. It is assumed that the document source module can detect when a new document has been fetched, a document has changed, or a document has been deleted. In any events, add, modify, or delete the document source module calls the documentEvent method and adds the new event to the eventQueue. This notifies the collection interface module of changes that in turn notifies the client.

Table 1: Document Source Interface

Methods Comment

protected void addDocument (Document d) Registers a new document. protected void removeDocument (Document d) Removes document object. void addListener (DocumentListener listener, int

eventMask)

Registers a listener that receives the document change event.

(6)

Methods Comment

Document getDocument (long id) Provides an individual document by id. DocumentSourceInfo getInfo () Provides the value, attribute pairs. protected Object [] getParam (String name) Provides parameter values by name.

Iterator documentIterator () Returns a list of all the documents in this document source

protected void postInit () Initializes this document source instance using passed value, attribute pairs.

protected void queueEvent (DocumentEvent event) Adds document event for notification. protected void sendEvents () Sends all queued events.

long size () Returns the number of documents in this document source.

3. Scheduling

Many current data sources do not have notification frameworks built into them. The collection interface module, along with the scheduling object, provides notification capabilities to these asynchronous data sources. Schedule can be defined as a one time, or recurring event, for example every 2 hours, every 2nd

day of the month, or every Wednesday. The scheduler checks the documents on the specified time to determine if any changes have occurred since last check. A collection may have no defined schedule, either because the document source only needs to be run once, or because it has some internal method that handles.

A schedule object uses standard data types to define dates, times, and intervals. If a user wishes to add a new document source type, knowledge of the scheduling format isn't required. The scheduling module unwraps the scheduling object and handles threading and calls back to the document source. This allows the end user to schedule asynchronous checking of document sources in complex ways without any knowledge of the underlying system. Currently the scheduler can arrange recurring checks by regular interval, time of day, day of week, day of month, day of year, and combinations thereof.

4. Extensibility

To add extensibility, Collection Manager provides base classes and APIs for adding new document source types This aspect of the Collection Manager makes it a usable service not just to GridIR but also to other applications that need access to data from heterogeneous sources.

5. Collection Life Cycle

Once a collection has been built, the collection interface module creates an instance for each document source contained in the collection object. The collection interface module then starts the scheduling threads (if applicable) and the work of the document collection will begin.

(7)

this information is obtained through the collection interface module. Upon receiving the notification about changes to a collection the client may request to get the data associated with a collection. The collection interface module creates a compressed snapshot of the collection that can be delivered to the client as a whole or in parts. When a collection is no longer needed all associated data source instances, data files, data snapshots, and the collection metadata are removed.

Example client communication

In this section we explain through an example how a client can communicate with the Collection Manager assuming that the CM has the local file data source and web data source types installed. Specific information required for client communication with the CM can be found in the CM GWSDL [4].

First a client needs to find what document source types are available to it. The client first calls getAvailableTypes and receives a StringList similar to the following:

<StringList> FileSource HTTP </StringList>

Next the client calls getTypeInfo with the string name of the document source type that it is interested in. In turn it receives a list of value attribute pairs that the document source module will accept. For example for a document source of type “FileSource” it receives the following:

<name>TempSpace</name>

<value>Accepts a string in the form of the full path to the temporary space you wish to store file snapshots </value>

</Attribute> </AttributeList>

In this case we see that the only attribute a FileSource DS accepts is TempSpace. All of the other information needed by the DS can be obtained by using the elements of a DocumentSource object.

(8)

<DocCollection name="My Machine Monitor"> <DocumentSource documentSourceType="FileSource"> <uref name="securityLog">file:///var/security/secure.log</uref> <uref name="mailSpool">file:///var/mail/spool</uref> <attributeList> <Attribute> <name>TempSpace</name> <value>/tmp/cm</value> </Attribute> </attributeList> <spider noparent=1/> <schedule> <interval>PT900S</interval> </schedule> </DocumentSource> </DocCollection>

The client then submits this object to the CM. In response the CM sends the unique numeric id of the collection. The CM would then crawl these two locations on the local machine every 15 minutes. If the user subscribed to be notified on a change to this collection using the id of the collection she would be notified when data is changed.

A client may wish to receive the data from the collection upon receiving the notification of the change. In order to retrieve the data from a collection a client calls getDataSegment with the collection id, the offset into the data she wants to retrieve (starting at 0), and the preferred message size (in bytes) for receiving data. The response object received by the client is similar to the following:

<compression>gzip</compression> <finalSegment>false</finalSegment>

<data> (this is a byte array in base64binary) </data> </GetResponse>

The client continues to call getDataSegment changing and advancing the offset parameter by the length of the previously received data until the finalSegment in the GetResponse is set to true.

6. Related Work

The GGF Database Access and Integration Services working group (DAIS-WG) addresses similar data management issues as outlined in [5]. The DAIS proposed model is different in that it seeks to expose the complex diversity of underlying data representations through grid interface whereas the GridIR CM defines a uniform data collection model, where collection instances are composed of heterogeneous data types, and exposes lightweight interfaces to the collections.

(9)

7. Future Work

Currently we use a custom-built grid service interface to transfer the data to the clients. We are investigating the integration of GridFTP protocol to handle data transfer across the nodes. In the current CM implementation we use the SOAP messages to send data across the grid. A custom interface was required for this because while SOAP has no inherent size limitation, implementations of SOAP have problems encoding and decoding large messages. We have solved this problem, however these messages are split, and travel across the HTTP protocol that is inefficient with large bulk transfers. GridFTP handles large data transfer, using standard security mechanisms. It takes advantage of the grid architecture to increase performance.

To ensure scalability and reliability of the Collection Manager, we also plan to use techniques such as replication and migration of services across the grid topology. Self-dividing Collection Manager Services will enable division of the workload of a collection by creating new collection mangers on the fly. Currently, upon notification of a change in the collection, peers must request the entire collection as opposed to only those portions actually changed. We would like to investigate the ability of Collection Managers to send document changes as differentials rather than the entire document over the wire.

8. Conclusions

Collection Manager is an integral part of the Grid Information Retrieval system. It facilitates the integration of distributed data sources and diverse data types into a unified framework. CM minimizes the burden of adding new data types by separating the data source plug in from the collection interface. CM provides information about the new data type in order to make it accessible by any application that uses it. Collection Manager can be integrated into any system that needs to access distributed and heterogeneous data sources. Presence of the notification services and the inherent security of the Collection Manager make it more appealing for accessing different data sources with dynamic data and/or requirements for different levels of security.

The Collection Manager helps provide a uniform virtualized interface to diverse data sources and types. While the Collection Manager API pertinent to GridIR (i.e. the Collection Interface) is being standardized in the GIR Working Group, we believe there are elements of it that maybe relevant in the larger context of the future of data environments. The Document Source type manager that interfaces with the specific document types could be investigated in conjunction with other data areas for standardization. We also believe that profiles for integrating known data types with this standard will need to be developed. This will help organizations transition existing and new data types to grid environments in a standardized interoperable manner.

9. Acknowledgements

The material is based upon work supported by NASA under award No(s) NAG 2-1467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration.

10. References

(10)

[3] Dovey, Matthew J.; Gamiel, K. “GridIR — Grid Information Retrieval.” Poster at EuroWeb 2002. [4] Collection Manager GWSDL. http://www.gridir.org/cm.gwsdl