The Database Systems and Information Management Group at Technische Universität Berlin

(1)

The Database Systems and Information Management Group

at

Technische Universit¨

at Berlin

1 Introduction

The Database Systems and Information Management Group, in German known by the acronym DIMA, is part of the Department of Software Engineering and Theoretical Computer Science at the TU Berlin. It is led by Prof. Dr. Volker Markl and consists of 3 postdocs, 8 research associates and 19 student assistants.

2 Research Areas

The Database Systems and Information Manage-ment Research Group (DIMA) under the direction of Volker Markl conducts research in the areas of in-formation modeling, business intelligence, query pro-cessing, query optimization, impact of new hardware architectures on information management, and appli-cations. While having a strong focus on system build-ing and validatbuild-ing research in practical scenarios and use-cases, the group aims at exploring and provid-ing fundamental and theoretically sound solutions to current major research challenges. The group inter-acts closely with researchers at prestigious national and international academic institutions and carries out joint research projects with leading IT compa-nies, including Hewlett Packard, IBM, and SAP, as well as innovative small and medium enterprises.

In the following paragraphs, we present our main research projects.

2.1 Stratosphere

Our flagship project is a Collaborative Research Unit funded by the Deutsche Forschungsgemeinschaft (DFG) in which the Technische Universit¨at Berlin, the Humboldt Universit¨at zu Berlin, and the Hasso-Plattner-Institut in Potsdam are jointly research-ing ”Information Management on the Cloud”. Stratosphere aims at considerably advancing the state-of-art in data processing on parallel, adaptive architectures. Stratosphere (named after the layer of the atmosphere above the clouds) explores the power of massively parallel computing for complex informa-tion management applicainforma-tions. Building on the ex-pertise of the participating researchers, we aim to develop a novel, database-inspired approach to ana-lyze, aggregate, and query very large collections of either textual or (semi-)structured data on a virtual-ized, massively parallel cluster architecture.

Stratosphere conducts research in the areas of mas-sively parallel data processing engines, a program-ming model for parallel data programprogram-ming, robust optimization of declarative data flow programs, con-tinuous re-optimization and adaptation of the execu-tion, data cleansing, and text mining. The unit will validate its work through a benchmark of the over-all system performance and by demonstrators in the areas of climate research, the biosciences and linked open data. The goal of Stratosphere is to jointly re-search and build a large-scale data processor based on concepts of robust and adaptive execution. We are researching a programming model that extends a functional map/reduce programming model with ad-ditional second order functions. As execution

(2)

plat-form we use the Nephele system, a massively parallel data flow engine which is also researched and devel-oped in the project. We are examining real-world use-cases in the area of climate research, informa-tion extracinforma-tion and integrainforma-tion of unstructured data in the life-sciences, as well as linked open data and social network graph data.

2.2 MIA

The German language web consists of more than six billion web sites and is second in size only to the En-glish language web. This vast amount of data could potentially be used for a large number of applications, such as market- and trend analysis, opinion and data mining for Business Intelligence or applications in the domain of language processing technologies. The goal of MIA – A Marketplace for Trusted Informa-tion and Analysis – is to create a marketplace-like infrastructure in which this data is stored, refined and made available in such a way that it enables the trade with refined and agglomerated data and value-added services. In order to achieve this, we draw upon the results of our substantial research in the areas of Cloud Computing and Information Manage-ment.

The marketplace provides the German-language web and its history as a data pool for analysis and value-added services. The focus of its initial version are use cases in the domains of media, market re-search and consulting. These use cases have special requirements of data privacy and security that will be observed. Gradually, the platform will be expanded for additional use cases and services as well as inter-nationalization.

The proposed infrastructure enables new business models with information as a tradable good, which build on algorithmic methods that extract informa-tion from semi-structured and unstructured data. By using the platform to collaboratively analyze and re-fine the data of the German-language web, businesses significantly reduce expenses while at the same time jointly creating the basis for a data economy. This will enable even small and medium sized businesses to access and compete in this market.

2.3 GoOLAP.info

Today, the Web is one of the world’s largest databases. However, due to its textual nature, ag-gregating and analyzing textual data from the Web analogue to a data warehouse is a hard problem. For instance, users may start from huge amounts of tex-tual data and drill down into tiny sets of specific fac-tual data, may manipulate or share atomic facts, and may repeat this process in an iterative fashion. In the GoOLAP – The Web as Data Warehouse project we investigate fundamental problems in the process: What are common analysis operations of “end users” on natural language Web text? What is the typical iterative process for generating, verify-ing and sharverify-ing factual information from plain Web text? Can we integrate both, the “cloud”, a clus-ter of massively parallel working machines, and the “crowd”, end users of GoOLAP.info, for solving hard problems, such as training 10.000s of fact extractors, for verifying billions of atomic facts or for generating analytical reports from the Web?

The current prototype GoOLAP.info contains al-ready factual information from the Web for about several million objects. The keyword-based query in-terface focuses on simple query intentions, such as, “display everything about Airbus” or complex aggre-gation intentions, such as “List and compare mergers, acquisitions, competitors and products of airplane technology vendors”.

2.4 ROBUST

Online communities play a central role in vital busi-ness functions such as corporate expertise manage-ments, marketing, product support and customer re-lationship management. Communities on the web easily grow to millions of users and thus need a scal-able infrastructure capscal-able of handling millions of discussion threads containing billions of posts. The EU integrated project ROBUST - Risk and Op-portunity Management of huge-scale BUSi-ness communiTies develops methods and models to monitor and understand the behavior and require-ments of users and groups in these communities. A massively parallel cloud infrastructure will handle

(3)

the processing and analysis of the community data. Project partners like SAP or IBM host communities for customer support on the internet as well as com-munities for knowledge management in their intranet, which require highly scalable infrastructures for real time data analysis.

DIMA contributes to the areas of massively parallel processing of community data as well as community-based text analytics and information extraction.

2.5 SCAPE

The SCAPE - SCAlable Preservation Envi-ronments project will develop scalable services for planning and execution of institutional preservation strategies on an open source platform that orches-trates semi-automated workflows for large-scale, het-erogeneous collections of complex digital objects. These services will be able to:

• Identify the need to act to preserve all or parts of a repository through characterisation and trend analysis;

• Define responses to those needs using formal de-scriptions of preservation policies and preserva-tion plans;

• Allow a high degree of automation, virtualisation of tools, and scalable processing;

• Monitor the quality of preservation processes. The SCAPE consortium brings together experts from memory institutions, data centres, research labs, universities, and industrial firms in order to re-search and develop scalable preservation systems that can be practically deployed within the next three to five years. SCAPE is dedicated towards producing open source software solutions available to the entire digital preservation community. The project results will be curated and further exploited by the newly founded Open Planets Foundation. Project results will also be exploited by a small-to-medium enter-prise and researc institutions within the consortium catering to the preservation community and by two large industrial IT partners.

2.6 BIZWARE

The Database Systems and Information Management Group (DIMA) of the TU Berlin is research part-ner in the BMBF-funded regional business initiative BIZWARE, in which several industrial partners from Berlin, the TU Berlin and the Fraunhofer Institute FIRST work together to advance a long term sci-entific and economic development of holistic model-based software development for the whole software lifecycle.

In close collaboration with our industrial partners we will develop the“model and software factory” and a runtime environment that allows to model, gener-ate and run software components and applications based on domain-specific languages. The goal of the project is to provide innovative technology and meth-ods to automate the phases of software development processes.

Within the BIZWARE initiative, TU Berlin works on the sub-project “Lifecycle management for BIZWARE applications“. The joint project will de-velop the infrastructure and tools to run, test and configure applications that have been developed with the BIZWARE factory. Furthermore, the results of the project will enable monitoring of the applications in a technical and business manner and provide an environment optimized for end users, test engineers and software operators. Main focus of TU Berlin is to work on software lifecycle management that deals with management of models, software artifacts and components in dynamic repositories

2.7 SINDPAD

Parallelization becomes more and more important, even for the architecture of single machines. Re-cent advances in processor technologies achieve only small performance improvements for single cores. In-creasing the compute power of modern architectures mandates to increase the number of compute cores on a single central processing unit (CPU). Graphics Processing Units(GPUs) have a long history of scale-out through parallel processing on many compute cores. Graphics adapters nowadays offer a highly par-allel execution environment that within the context

(4)

of GPGPU (General purpose Processing in Graphics Processing Units) is frequently used in scientific com-puting. The challenge of GPGPU programming is to design applications for the SIMD architecture (Single Instruction, Multiple Data) of graphics adapters that allow only for a limited range of operators and very limited synchronization mechanisms.

In the course of the SINDPAD project, we will de-velop an indexing and search technology for struc-tured data sets. We will leverage graphics adapters to support query execution. SindPad aims at achiev-ing unprecedented performance compared to conven-tional systems of equal cost. We consider taking advantage of application characteristics to accelerate data processing. Especially for Business Intelligence (BI) applications, the schema enables the system to store specific data on graphics adapters. This can lead to further speed ups.

Researchers of the Database Systems and Informa-tion Management (DIMA) group at the TU Berlin will play a significant role in the conceptual plan-ning and implementation of algorithms for hybrid GPU/CPU processing. We will analyze query pro-cessing algorithms and devise metrics to compare the performance of GPU-operators and CPU-operators. The SINDPAD: Query Processing on GPUs project is funded by the German Federal Ministry of Economics and Technology and is carried out in cooperation with empulse GmbH.

2.8 ELS

Increasingly, standards for railway systems require novel solutions for mainstream problems, such as in the realization of optimal energy efficiency for com-plex control systems. For example, in order to op-timize an ITCS (Intermodal Transport Control Sys-tem) we will require a centralized computer network system that notifies and evaluates a carriers’ partic-ular situation to enable analysts to make informed decisions on problems of great interest. Achieving this objective would enable the reduction of traction energy demands.

Among the basic components in an ITCS are a cen-tralized computer system, a data communication sys-tem, and an on-board computer. The truth is there

are numerous influential factors, such as, the position of the vehicle and additional vehicular data (e.g., en-vironmental impact, intermodal roadmap conditions, etc.), which must be considered at the design level to realize significant energy conservation. The eval-uation of these influential factors involves real-time communication between the rail-vehicle and the con-trol station.

The online-system components are comprised of the parts control centre (ecoC), underground vehi-cle (ecoM) and data communication. A process-independent, post-processing of the operating sched-ule will have to be ensured by an offline component in the control centre. The offline simulation pro-cesses and mechanisms for the analysis of the impact of simulation decisions are part of the offline com-ponent. For the transmission of essential data to the board computer in real-time, an interface to the vehi-cle database will be defined. The system component, ecoM, contains in addition a module for supporting the train operator for predictable driving. All func-tions and programs are bundled and stored in the ecoC manager to support a central energy optimal procedure for rail transport. The reduction of the work data for use by the ITCS central station for situational analysis, the selection, storage and fur-ther processing of work data, central optimization, the calculation of management decisions and the ad-ministration of failure and management decision pro-posals will have to be considered.

In the ELS - An Optimal Energy Control & Failure Management System project, members of the DIMA group at TU Berlin will play a significant role in the:

• conceptualization of a knowledge database for relevant operational scenarios,

• identification and description of data streams, • construction of efficient renewal strategies in the

event of failures,

• articulation of functional and technical specifi-cations.

Moreover, we will also be involved in the imple-mentation of standardized interfaces for the

(5)

trans-mission of ELS data and in performing integration tests. Additionally, interfaces for internal and third-party components will have to be carefully designed to meet specific conventions and ensure the optimiza-tion of the control system.

3 Teaching

At TU-Berlin, we strive to combine teaching with research and practical settings.

Undergraduate and graduate coursework offerings include: the usage and implementation of database systems, information modeling and information in-tegration. In addition to standard database classes, we offer many interesting student projects (combin-ing lectures with hands-on practical exercises) in the areas of data warehousing and business intelligence as well as large-scale data analytics and data min-ing. Our courses cover current research trends, novel developments and research results. For practical ex-ercises, we use both commercial systems and open-source software (e.g., Apache Hadoop and Mahout). The lectures, seminars, and projects offered at DIMA all aim to educate students not only in tech-nology and theory, but also to help foster social skills with respect to team work, project management, and leadership as well as business acumen. Theoretical lectures are accompanied by practical lab courses and exercises, where students learn to work in teams and jointly find solutions to larger problems.

We also give students the opportunity to use the skills learned in our courses in practical settings. Be-cause we believe this to be very important, we reg-ularly offer research and teaching assistant positions for both graduate and doctoral students, and help place students at industrial internships with leading international companies.

4 Further Information

Further information on teaching and research can be found on the web pages of the DIMA instute at www.dima.tu-berlin.de.