6.2 Implementation details
6.2.1 Triplification enegine
The triplification engine is the initial step of the data flow within the EXCLAIM framework, and is the main element through which managed applications and services interact with the framework. Its main responsibility is to ‘extract’ raw data from the managed elements and transform it into semantic RDF triples using the CSO vocabulary.
Following the non-intrusive principle to data collection, this component re- lies on already existing APIs provisioned by services to gather required metrics. For example, in order to obtain the most recent view on the current utilisation of an RDBMS service (e.g., occupied space, number of client connections, current backup processes, etc.) it is necessary to run an SQL query on an auxiliary statis-
tics table1for system administration purposes, which contains run-time meta-data about the database. The Postgres default rate for updating the statistical informa- tion is 500 ms, and can also be configured to provide a more up-to-date view on the current state of the database. Accordingly, the sampling rate of the EXCLAIM framework for data extraction should be configured respectively so as to operate in an optimised manner – that is, fast enough to provide actual data and at the same time avoid redundant (and expensive) calls to the database.
The following SQL snippet, for example, fetches the current values for the occupied space (i.e., SIZE) and the number of established client connections (i.e., COUNT) to the database destinator_db belonging to the user destinator_user.
Listing 6.1: SQL query snippet.
SELECT pg_database_size(’destinator_db’) AS SIZE, COUNT(*) AS COUNT FROM pg_stat_activity a WHERE a.usename=’destinator_user’
Other services, such as IronWorker service, expose its run-time meta-data via a RESTful management API, accessible by the client libraries. For example, to get the current number of jobs in the IronWorker queue waiting to be processed as an array of JSON objects, it is necessary to invoke a corresponding API function (see Listings 6.2 and 6.3).
Listing 6.2: IronWorker API function to get the number of tasks in the queue.
public List<TaskEntity> getTasks(Map<String, Object> options) throws \gls {api}Exception {
JsonObject tasks = api.tasksList(options);
List<TaskEntity> tasksList = new ArrayList<TaskEntity>(); for (JsonElement task : tasks.get("tasks").getAsJsonArray()) { tasksList.add(gson.fromJson(task, TaskEntity.class));
}
return tasksList; }
Listing 6.3: Sample client code to fetch the number of tasks to be executed by the IronWorker service.
options = new HashMap<String, Object>(); options.put("running", 1);
tasks = IronWorkerManager.getClient().getTasks(options);
Similar standard mechanisms for extracting data metrics exist for all the other add-on services, provided by the cloud platform marketplaces (including Heroku). 1In Postgres Database Management System (DBMS), this table is called pg_stat_activity. Similar tables exist in the majority of other relational databases.
An important benefit of the data extraction used by the EXCLAIM framework is that there is no intrusion to the source code of the monitored services and appli- cations. The framework only requires user credentials to get access to individual instances of services – an acceptable requirement given that the EXCLAIM frame- work is assumed to be part of the cloud platform and act as a trusted entity for the consumers.
Once the raw data is extracted, it then needs to be uniformly represented using the CSO ‘building blocks’. At the moment, this is mainly a manual process, which involves mapping between the source raw data and target semantically-annotated triples.1
It needs to be explained that a single raw value is represented with multiple RDF triples, which form an RDF graph. Depending on the requirements, addi- tional RDF triples serve to provide a more unambiguous and context-aware infor- mation to the EXCLAIM framework. For example, Listing 6.4 below demonstrates how the number of current client connections to the PostgreSQL service (i.e., the service postgres-service-06 has 12 client connections) is translated into the RDF representation to be further processed by the EXCLAIM framework.
Listing 6.4: A single raw value is represented using four RDF triples.
cso:postgres-service-06 rdf:type cso: HerokuPostgresService cso:postgres-service-10 cso:hasNumberOfConnections cso:number-of-
connections-122
cso:number-of-connections-122 rdf:type cso:NumberOfConnections cso:number-of-connections-122 cso:hasValue "12"^^xsd:int
Accordingly, the newly-generated RDF triples are serialised into string objects and sent to the messaging queue, from where they are picked up by the EXCLAIM framework, de-serialised into RDF triples and processed by the C-SPARQL query engine. In the current version of the EXCLAIM framework, Heroku’s RabbitMQ messaging service is being used to facilitate information transfer between data connectors and the core EXCLAIM framework. Using RabbitMQ’s client API the process of configuring a message queue (see Listing 6.5) and sending a message (see Listing 6.6) can be accomplished in several lines of code.
Listing 6.5: Declaring and initialising a RabbitMQ service queue.
String uri = "amqp://guest:guest@localhost";
ConnectionFactory factory = new ConnectionFactory(uri);
1There are already existing tools for converting data stored in relational databases into the RDF representation, using special mapping languages (e.g., RDB to RDF Mapping Language (R2RML) – http://www.w3.org/TR/r2rml/), which may simplify the triplification process as far as re- lational data storage services are concerned. Similarly, analogous techniques can be envisaged for RDF triplification of JSON, XML and other formats.
Connection connection = factory.newConnection();
Channel rabbitMqTaskChannel = connection.createChannel(); rabbitMqTaskChannel.queueDeclare("exclaim_queue");
Listing 6.6: Sending a string message to the RabbitMQ queue.
String message = "cso:number-of-connections-122 cso:hasValue ’12’^^ xsd:int"
rabbitMqTaskChannel.basicPublish("exclaim_queue", null, message.getBytes ());
Please note that individual ‘connectors’ responsible for raw data extraction and RDF transformation can be configured to function at various sampling rates. Finding the right sampling rate may depend on various criteria, such as the update rate of the source raw data (e.g., table pg_stat_activity), amount of available computational resources to execute frequent data extraction and transformation, tolerance of the managed services and applications to potential delays, etc.
Also, as it is seen from the listings above, only the last RDF triple actually con- tains the actual value of client connections, whereas the other three triples serve to describe the semantic context of that value and do not change that frequently. In these circumstances, we can distinguish between truly dynamic and relatively static values, and sending both types of triples at the same rate may result in an increased amount of redundant data being sent over the network. Both issues – namely, optimisation of the RDF sampling rates and minimisation of redundant data sent over the network are seen as potential directions for future research and are further discussed in Chapter 8.