• No results found

File object method specification

In order to change the workflow as proposed in section 3.2 the infrastructure, cur- rently a file system, has to be able to manage the user’s analysis scripts.

Most operating systems are capable of associating an application to a specific data file type,e.g.files with the extensiontxtare often associated with a text editor. This enables the user to open the file by ’double-clicking’ an icon and relying on the operating system to open the appropriate application which is capable of reading the file.

This mechanism is achieved either by associating the file extension, txt with an application or by associating the Multipurpose Internet Mail Extensions (MIME) (Freed and Borenstein, 1996) type with an application. Alternatively it is possible to inspect the content of the file to determine its type using a file signature (Sammes

Text file

MIME type

File extension

User data Associated Application Data operations Application library Save as Count words ... Text editor

Figure 3.2:Shows how a file extension or MIME type can be used to load an application capable of understanding the data format. In this example a text file is opened using the user’s ‘favourite’ text editor. This has the capability to change the data format and count the words in the file, amongst other operations.

and Jenkinson, 2007).

When a user loads an application associated with a data file, the application can be thought of as a library of functions or methods which are capable of operating on that file. In example shown in figure 3.2 thetxtfile is opened using an editor which has features to count the words or change the data encoding,e.g.from ASCII

toUTF-8(Yergeau, 2003).

We propose an infrastructure to extend this common functionality to support custom ‘applications’, in the form of user defined code. Instead of a data file having an associated application we propose that the data file has associated functions and methods which originate from user code. For example a user with a script capable of returning the word count of a file can associate this code to files of a specific type,

e.g.any file with thetxtextension. Any user can then query anytxtfile to see the associated methods, and then execute one of these methods. For example to return the word count of a file by executing a user supplied script.

We propose treating files like programming objects in an infrastructure called the File Object Method (FOM). The FOM associates code with files providing users with the ability to execute these routines as methods on file objects. The aim of the FOM is to extend the usefulness of flatfiles using object orientated program- ming techniques. This can then be used by hybrid database systems, such as those described in section 2.7 as well as by users who manage data using flatfiles, e.g.

scientists.

Extending files to appear as objects ensures users execute the correct methods on the correct file types, thus removing the responsibility for users to ensure they have the correct file format. Users can take advantage of methods supplied by other users without having to request access to the code, as the FOM can list and execute the available methods. As the methods are executed using the FOM users do not require extensive Application Programming Interface (API) documentation

and are not required to setup or understand third-party code. This results in a greater opportunity for users to reuse existing code.

There are two ways with which a user can associate methods to files in the FOM. The first is to simply associate code with a single file i.e. its path and filename. The second is to associate atype with each method or set of methods, and then associate this with file types. This means that users can deposit a file into the FOM, and without any intervention be able to list and execute methods that are associated with that file type, using the data they have just deposited. In the text file example, when a user adds a new text file to the file system they will be able to see that there is a method to count the words in that file.

After consultation with a number of computational scientists, the three most desirable features to make the FOM concept as simple as possible to adopt were identified and are outlined below.

• Format preservation

Often large volumes of data are compressed using complex compression tech- niques which make flatfiles an attractive option. This contrasts databases which store all data in the same formati.e the native database format. The FOM capitalises on proprietary data formats by leaving the original files un- touched. Even if the files are not compressed there is an advantage of storing the files in the user’s format. For example, if there was a corruption in the sys- tem and the FOM becomes unusable, or the user decided not to use the FOM, the original files are still available in the same directory structure and in the same format as the user’s original data. This is similar to the Concurrent Versions System (CVS) approach where the user’s data structure remains un- changed. Additionally, users store data in different data formats as they have applications to process the data. Changing the data format can complicate the analysis of data i) as users have to be able to understand the new format, ii) the new format has to provide an interface for users to access the data, iii) defining a data format standard is complex and iv) it is slower to react to application specification changes as many layers require modification. Of- ten existing data formats take advantage of patterns in the information rather than the data, thus making them more efficient at data compression and ran- dom data accesse.g.NAMD (Phillips et al., 2005).

• Code reuse

The aim is to avoid forcing users to rewrite existing code or create wrappers to get existing code to interface with the FOM. This is achieved by permitting users to add their existing code as an object type in the FOM. This relies on the FOM being able to interpret the methods and its parameters so it can display and execute them appropriately. This issue is discussed for each of

Deposit data Submit code Execute code Retrieve data simulationData1.bin example.txt

File system view Programmatic view

User operations simulationData1.bin Simulation1 example.txt /home/user simulationInfo.txt Simulation1 wordCount() getListOfFiles()

Figure 3.3:Shows the proposed File Object Method (FOM). The user operations provide an interface to the FOM. The file system view shows the users data in its existing format and the pro- grammatic view shows how the data structure is accessible from within a programming environment.

the different implementations.

Reusing code had two advantages i) users can be assured of reliable code and ii) users can utilise code of which they were previously unaware.

• Non-intrusiveness

To make the FOM as useful as possible it is necessary to bring the database world to the user and not force the user to learn any new skills. The aim is to enable users to continue using their existing code with their existing data files. Although the user’s workflow may change, the FOM only extends it with optional steps, leaving the user’s existing workflow unchanged.

The three key areas of the FOM are, i) the user operations, ii) the file system view and iii) the programmatic view, of which the file system view remains unchanged.

3.3.1 User operations

Using the FOM, users have four operations available to them as shown in figure 3.3. These operations enable the user to interact with the FOM and are described below.

• Deposit data

Depositing data is as easy as copying it to a local file system; this should not be any different to the users current workflow. The files get associated with methods based on the file extension, but users have the capability to associate other methods with a file.

Example 3.1:Shows a programmatic Python example of the FOM model. First the user opens the example text file, this creates a data object which supports the FOM methods. The user can list the FOM methods that are available and execute the appropriate method to return the data required.

1 #Load the FOM infrastrusture

2 import FOM 3

4 #Open a text file

5 f = open("Example.txt") 6

7 #List all the methods available on that text file.

8 f.getMethods()

9 #This output shows that there are three methods available.

10 [’wordCount’, ’saveAsUTF-8’, ’saveAsUTF-16’]

11

12 #Execute the method to count the words in a text file.

13 f.wordCount();

14 #This output shows that there are 15 words in the text file.

15 15

• Submit code

If users want to extend a file object with new methods, they have the ability to submit code that will appear as a method for a specific file type. This operation takes place independently of the data deposition.

• Execute code

Users have the ability to execute methods on files, this code execution de- pends on the implementation, but users can initiate the execution from within a programming language or command shell. Either files or data objects are returned to the user.

• Retrieve data

There are various ways that users can get access to their data. One method is to access it through the file system and not using the FOM. This direct access ensures that any existing user requirements are met. The next is to access it from within a programming language; this is where users will see the FOM methods which are available for each file. This can then be used to retrieve a data object or activate code to produce a data file with the required information.

3.3.2 Programmatic view

Figure 3.3 shows the programmatic view of the example files shown in the file system view. The structure remains the same except the files are treated as software objects. The word count example described in the introduction is displayed as a method on theExample.txtfile.

A software object is created from the files in the file system and appear within the programming environment as objects with methods. When a user opens a file the FOM methods are added to the file object. Example 3.1 provides an example, showing a Python script opening a file (line 5), listing the methods available on the file (lines 7–10) and finally executing a FOM method on the file (lines 12–15).

There are two types of method: those created by the code submitted by the user and those provided by the FOM framework. For example, thewordCountmethod is from user defined code and thegetMethodsmethod is provided by the FOM.

The programmatic view is intended as the main interface into the FOM as it enables users to locate data and execute methods to process the data contained within a file.