Computational storage: an efficient and scalable platform for big data and HPC applications

(1)

Computational storage: an efficient

and scalable platform for big data and HPC

applications

Mahdi Torabzadehkashi

1,2*

_{, Siavash Rezaei}

1,2

_{, Ali HeydariGorji}

1,2

_{, Hosein Bobarshad}

2

_{, Vladimir Alves}

2

and Nader Bagherzadeh

1

Abstract

In the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are origi-nally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we intro-duce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Addition-ally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Keywords: Computational storage, In-storage processing, Near-data processing, SSD, Big data, Hadoop, HPC, DFT, System-on-chip

Open Access

© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

RESEARCH

*Correspondence: [email protected]

1_{University of California,}

(2)

Introduction

The modern human’s life has been technologized, and nowadays, people rely on big data applications to receive services such as healthcare, entertainment, government services, and transportation in their day-to-day lives. As the usage of these services becomes uni-versal, people generate more unprocessed data, which increases the demand for more sophisticated data centers and big data applications. At current pace, 2.5 quintillion

bytes of data is created each day [1]. Due to the huge volume of data created every

sec-ond, new concepts such as the age of information (AoI) has been established to access

the latest version of data [2]. According to the well-known 4V’s characteristics of big

data, the big data applications need to deal with a very large Volumes of data, which are

in Various types, and their Velocity is more than conventional data, while the data Verac-ity is not confirmed [3].

To process data with the aforementioned characteristics, data should frequently move between storage systems and memory units of the application servers. This high-cost data movement imposes energy consumption and degrades the performance of big data applications. To overcome this issue, data processing has moved toward a new

para-digm: “move process to data” rather than moving high volumes of data. Figure 1

com-pares the traditional “data move to process” concept versus the “move process near data”

paradigm.

In the modern clusters, nodes are connected through a low-latency and power-hungry

interconnect network such as InfiniBand [4] or Myrinet [5]. In such systems, moving

data can be more expensive than processing data [6], and moving the data through an

interconnect network is much more costly. In fact, accessing and transferring stored data from storage systems is a huge barrier toward reaching compelling performance

and energy efficiency. To deal with this issue, some frameworks such as Hadoop [7]

pro-vide mechanisms to process data near where they reside. In other words, these frame-works push process closer to data to avoid massive data movements between storage systems and application servers.

In-storage processing (ISP) is the concept of pushing the process closer to data in its ultimate boundaries. This technology proposes utilizing embedded processing engines inside the storage devices to make them capable of running user applications in-place, so data do not need to leave the device to be processed. This technology has been around for a few years. However, the modern solid-state drives (SSDs) architecture, as well as

data

application server 1

interconnection network

data

server 1

data

server 2

data

server 3

data

server 4

data

server 5

interconnection network

(3)

the availability of powerful embedded processors, make it more appealing to run user applications in-place. SSDs deliver higher data throughput in comparison to hard disk drives (HDDs). Additionally, in contrast to the HDDs, the SSDs can handle multiple I/O commands at the same time. These differences have motivated researchers to modify applications based on the modern SSD architectures to improve the performance of the

applications significantly [8].

The SSDs contain a considerable amount of processing horsepower for managing flash memory array and providing a high-speed interface to host machines. These processing capabilities can provide an environment to run user applications in-place. Based on the reasons mentioned above, this paper focuses on the modern SSD architecture, and in the rest of the paper, computational storage device (CSD) refers to an SSD capable of run-ning user applications in-place.

In an efficient CSD architecture, the embedded ISP engine has access to the data stored in flash memory array through a low-power and high-speed link. Thus, the deployment of such CSDs in clusters can increase the overall performance and efficiency of big data and high-performance computing (HPC) applications. The CSDs and the Hadoop

plat-form both emerged based on the concept of “move process closer to data,” and they both

can be deployed simultaneously in a cluster. This combination can minimize the data movement and increase efficiency for running big data applications in a Hadoop cluster. However, HPC clusters that are not developed based on the Hadoop platform can still utilize CSDs to improve performance and energy consumption.

Processing user applications inside storage units without sending data to the host pro-cessor seems appealing; however, proposing a flexible and efficient CSD architecture has the following challenges.

1. ISP engine: SSDs come with multiple processing cores to run the conventional SSD controller routines. These cores can be utilized for running user applications as well. However, there are two major problems in utilizing the existing SSD cores for in-storage processing. First, these cores are usually busy doing normal SSD operations and using them for running user applications can negatively affect the I/O perfor-mance of the drive. Second, these processing engines are usually real-time cores such as ARM Cortex-R series, which limits the category of the applications that can efficiently run on these cores; also, user applications need major modifications to be able to run on these cores.

2. Host-CSD communication: In a CSD architecture, there should be a mechanism for the communication between host and CSD to submit ISP commands from the host to CSD and receive the results. The conventional SSDs have one physical link con-nected to the host, which is designed for transferring data. There are many protocols

for sending data through this link such as SATA [9], SAS [10], and NVMe over PCIe

[11]. None of these protocols are designed for sending ISP commands and results.

Thus, this is the responsibility of the CSD designer to provide an ISP communication protocol between host and CSD.

(4)

data, and any application running in-place should not expect to be able to access the filesystem-level data. This limits the type of programming models available for devel-oping ISP-enabled applications, and also the reuse of other applications. Hence, the CSD designer should provide a mechanism to access the filesystem metadata inside the ISP engine so that applications that are running in-place can open files, process data, and finally create output files to write back the results.

4. Host-CSD data synchronization: In a CSD-equipped host, both host and ISP engine have access to the same flash memory array. In such a system, without a synchroni-zation mechanism, these two machines may not be able to see each other’s modifica-tions and could result in data corruption.

5. CSD as an augmentable resource: Adding CSDs to a host machine should not limit the host from accessing the data and processing it. The processing horsepower of the CSDs should be an augmentable resource so that the host and CSDs could process data simultaneously. If processing an application in CSD interferes with the host’s access to the data, this would dramatically decrease the utilization of the host and the efficiency of the whole system. A well-designed CSD architecture allows the host to access data stored in the flash memory at any time.

6. Adaptability: CSDs should provide a flexible environment for running different types of applications in-place. If the ISP engine of a CSD supports very limited program-ming languages or needs users to rewrite the application based on a specific pro-gramming model, this can affect the adoptability of the CSD significantly.

7. Distributed in-storage processing: A single CSD with limited processing horsepower may not be able to enhance an application’s performance significantly, so in many cases, there should be multiple CSDs orchestrating together to deliver compelling performance improvement. For doing such distributed processing, CSD designers need to provide the required tools for implementing a distributed processing envi-ronment among multiple CSDs.

8. ISP for high-performance computing: Highly demanding applications such as HPC algorithms can potentially run inside CSDs. However, to serve this class of applica-tions properly, CSDs should be able to boost their performance for some specific applications. In other words, the CSD architecture should be customizable to run some applications in an accelerated mode. Hence, CSD designers are required to provide ASIC- or FPGA-based accelerators to run highly demanding applications satisfactorily.

In this paper, we propose an efficient CSD platform, named Catalina, which addresses all the challenges mentioned above and is flexible enough to run HPC applications in-place

or playing the role of an efficient DataNode in a Hadoop cluster. Catalina is equipped

with a full-fledged Linux operating system (OS) running on a quad-core ARM 64-bit application processor which is dedicated to run user applications in-place. Catalina

uti-lizes a Xilinx Zynq Ultrascale+ MPSoC [12] as the main processing engine and contains

(5)

CSDs [13]. This property makes it easier to adopt this technology in Hadoop clusters. For the proof of concept, we developed a fully functional Catalina CSD prototype and built a platform equipped with 16 Catalina CSDs.

The contributions of this paper can be summarized as follows:

• Proposing the deployment of computational storage devices in clusters, and end-to-end integration of CSDs in clusters for running Hadoop MapReduce and HPC appli-cations.

• Describing the hardware and software architecture of Catalina as the first computa-tional storage device which is seamlessly adoptable in Hadoop and MPI-based clus-ters without any modification in the cluster’s software or hardware architecture. • Prototyping Catalina CSD and developing a platform equipped with 16 Catalina

CSDs to practically evaluate the benefits of deployment of computational storage devices in the clusters.

• Exploring the performance and energy consumption of CSD-equipped clusters using Intel HiBench Hadoop benchmark suite as well as 1D-, 2D-, and 3D-DFT operations on large datasets as the HPC benchmarks.

The rest of this paper is organized as follows: “Background” section reviews subjects

covered in this paper, such as the modern SSD architecture, ISP technology, and Hadoop

platform architecture. In “Method” section, we present the hardware and software

archi-tecture of Catalina CSD and describe the unique features that make it capable of seam-lessly running a wide spectrum of applications in-place. This section also illustrates the developed Catalina prototype and how it can be deployed in clusters to run Hadoop

MapReduce and HPC applications. “Results and discussion” section investigates the

benefits of deploying the proposed solution for running Hadoop MapReduce and HPC applications on clusters. This section demonstrates how increasing the number of

Cat-alina CSDs improves energy consumption and performance of the applications. “Related

works” section reviews the related works in the field of in-storage processing, and “ Con-clusion” section concludes the paper with some final remarks.

Background

The storage system, where data originally reside, plays a crucial role in the performance of applications. In a cluster, the data should be read from the storage system to mem-ory units of the application servers to be processed. As the size of data increases, the role of the storage system becomes more important since the nodes need to talk to the storage units more frequently to fetch data and write back the results. Recently, cluster architects have considered solid-state drives (SSDs) over hard disk drives (HDDs) as the major storage units in modern clusters due to better power efficiency and higher data transition rate [14].

(6)

to communicate with the host, such as NVMe over PCIe [11]. Implementing such inter-faces require embedding more processing horsepower inside SSDs. A modern SSD con-troller is composed of two main parts: 1—front-end (FE) processing engine providing high-speed host interface protocol such as NVMe/PCIe, and 2—back-end (BE) process-ing engine which deals with flash management routines. These two engines talk to each other to accomplish host’s I/O commands.

A NAND flash memory chip is a package containing multiple dies. A die is the small-est unit of flash memory that can independently execute I/O commands and report status. Each die is composed of a few planes, and each plane contains multiple blocks. Erasing is performed at the block-level, so a flash block is the smallest unit that can be erased. Inside each block, there are several pages which are the smallest units that can be programmed and written. The key point in this hierarchical architecture is the pro-grammable unit versus the erase unit. The NAND flash memory can be programmed in page-level, which is usually 4 to 16 kB, while the erase operation cannot be done on a smaller segment than a block which is few megabytes of memory. Also, each flash block can be erased for a finite number of times, and flash blocks wear as erase operations take place, so it is important to balance the number of erase operations among all the flash blocks of an SSD. The process of leveling the number of erase operations is called wear leveling. In addition, the logical address exposed to the host is different from physical block addresses, so there are multiple tables for logical and physical address translation. The flash translation layer (FTL) is composed of all the routines needed to manage flash memory arrays such as logical block mapping, wear leveling, and garbage collection. Overall, the BE processing subsystem of a modern SSD architecture handles FTL, while the FE subsystem provides protocols to communicate with the host. The details of FTL processes and NVMe protocol are out of the scope of this paper. The high-level

architec-ture of a modern SSD is demonstrated in Fig. 2.

The Webscale data center designers have been trying to develop storage architectures that favor high-capacity hosts, and this fact is highlighted at OpenCompute (OCP) by

Microsoft Azure and Facebook that call for up to 64 SSDs attached to each host [15].

front-end

engine NAND flash … NAND flash

channel #1

NAND flash … NAND flash channel #16

…

host

16×512MBps = 8GBps

back-end engine DRAM

controller

switch

NVMe NVMe

NVMe

…

SSD #1

~ 32 GBps

SSD #2

SSD #64

~ 64×1 GBps = 64 GBps

(7)

In Fig. 2, such a storage system is shown where 64 SSDs are attached to a host. For the sake of simplicity, only the details of one SSD are demonstrated. Modern SSDs usu-ally contain 16 or more flash memory channels which can be utilized concurrently for flash array I/O operations. Considering 512 MBps bandwidth per channel, the internal bandwidth of an SSD with 16 flash memory channels is 8 GBps. This huge bandwidth decreases to about 1 GBps due to the complexity of the host interface software and hard-ware architecture. In other words, the accumulated bandwidth of all internal channels of the 64 SSDs reaches the multiplication of the number of SSDs, number of channels per SSD, and 512 MBps (bandwidth of each channel) which is equal to 512 GBps. While the accumulated bandwidth of the SSDs’ external interfaces is equal to 64 multiply by 1 GBps (the host interface bandwidth of each SSD) which is 64 GBps. However, In order to talk to the host, all SSDs required to be connected to a PCIe switch. Hence, the avail-able bandwidth of the host is limited to 32 GBps.

Overall, there is a 16× gap between the accumulated internal bandwidth of all SSDs

and the bandwidth available to the host. In other words, for reading 32 TB of data, the host needs 16 min while internal components of the SSDs can read the same amount of data in about 1 min. Additionally, in such storage systems, data need to continuously move through the complex hardware and software stack between hosts and storage units, which imposes a considerable amount of energy consumption and dramatically decreases the energy efficiency of large data centers. Hence, storage architects need to develop techniques to decrease data movement, and ISP technology has been intro-duced to overcome the aforementioned challenges by moving process to data.

In a traditional CPU-centric scheme, data always move from storage devices to CPU

to be processed, and this mechanism, which is inherently limited by the von Neumann

bottleneck, is the root cause of the challenges above, especially when many SSDs are

con-nected to a host. ISP technology proposes a contrary approach to push the “move process

to data” to its ultimate boundaries where a processing engine inside storage unit takes advantage of high-bandwidth and low-power internal data links and processes data

in-place. In fact, “move process to data” is the concept that leads to the emergence of both

ISP and distributed processing platforms such as Hadoop. Later in this section, it will be described how the Hadoop platform and ISP technology can simultaneously work together in a cluster.

The ISP technology minimizes the data movements in a cluster and also increases the processing horsepower of the cluster by augmenting power-efficient processing engines to the whole system. This technology can potentially be applied to both HDDs and SSDs; however, modern SSD architecture provides better tools for developing such technolo-gies. The SSDs which can run user application in-place are called computational storage devices (CSDs). These storage units are augmentable processing resources, which means they are not designed to replace the high-end processors of modern servers. Instead, they can collaborate with the host’s CPU and augment their efficient processing horse-power to the system.

It is noteworthy that CSDs are fundamentally different than object-based storage

sys-tems such as Seagate Kinetic HDDs [16], which transfer data at the object-level instead

(8)

ID. Consequently, the host does not require to maintain metadata of block addresses of the object. On the other hand, CSDs can run user applications in-place without sending data to a host. There is a vast literature in this field that proposed different CSD architec-tures and investigated the benefits and challenges of deploying CSDs for running

appli-cations in-place. In “Related works” section, we will review the important works in this

field.

A while ago, when the cost of data movement was insignificant in comparison with the computational cost, there could be a centralized storage system, and other hosts could send requests to it to fetch data blocks. With this mechanism and today’s volume of data, a data-intensive application requires large amounts of data to be fetched from the stor-age system, and such huge data movements drastically increase energy consumption. With the emergence of big data, the storage system can no longer be centralized, and the traditional approaches come short of satisfying super-scale applications’ demands, which call for scalable processing platforms. To answer these demands, distributed pro-cessing platforms such as Hadoop are proposed to process data near where they reside [17].

Hadoop has emerged as the leading computing platform for big data analytics and

is the backbone of hyperscale data centers [18], where hundreds to thousands of

com-modity servers are connected to provide service to clients. The Hadoop distributed processing platform consists of two main parts, namely Hadoop filesystem (HDFS) and MapReduce engine. HDFS is the underlying filesystem of the Hadoop platform, and it is responsible for partitioning the data to blocks and distribute the data blocks among nodes. HDFS also generates a certain number of replicas of each block to make the

sys-tem resilient against storage or node failures. It consists of a NameNode host which takes

care of filesystem metadata such as the location of the data block and status of the other

nodes, and multiple DataNodes hosts that store the data blocks.

On top of HDFS, MapReduce platform takes advantage of the partitioned data and runs map and reduce functions and orchestrates the cluster nodes to run distributed applications while data movements are minimalized. The Apache MapReduce 2 (YARN)

is one of the well-known MapReduce platforms [19]. YARN agents manage the

proce-dure of running map tasks, performing shuffle and sort, and running the reduce func-tions to generate the output of the MapReduce application. The most important agents

in the YARN framework are global resource manager (RM), one node manager (NM) per

processing node, and an application master (AM) per MapReduce application.

The RM has a list of all the resources available in the cluster and manages the

high-level resource allocations to MapReduce applications. On the other hand, NMs that run

on the processing nodes manage the local hosts’ resources. The RM regularly talks to the

NMs to manage all the resources and poll the status of the nodes. For each MapReduce

application, there is an AM to observe and manage the progress of the application.

Regu-larly, RM runs in the same node that HDFS NameNode runs (head node), and NMs run

together with the HDFS DataNodes. Figure 3 shows a high-level overview of a

MapRe-duce application on a Hadoop platform.

In the Hadoop environment, data should initially be imported to the HDFS. This process includes partitioning the input data to data blocks and storing them in the

(9)

fashion. Since map functions preferably process the local data blocks, the MapReduce framework is known for moving process closer to data in order to improve the power efficiency and performance of the applications.

A MapReduce application targets a set of input data blocks as well as the user-defined map and reduce functions. The procedure starts by running the map function on the

tar-geted data blocks. Map function instances run concurrently on the DataNodes, consume

data blocks, and produce a set of key/value pairs to be used as the input of the reduce

function. These intermediate key/value pairs are stored locally on the DataNodes that

execute the map function and should be shuffled, sorted, and transferred to the nodes which run the reduce tasks. The Hadoop framework stores the output of the reduce function in HDFS, and subsequently, it can be imported to a host’s local filesystem.

The Hadoop MapReduce platform provides an efficient environment for processing large datasets in a distributed fashion. However, there are research papers in the liter-ature that optimize Hadoop MapReduce platform for different workloads. Generally, HDFS partitions data to fixed-size 64 or 128 megabytes data blocks. The data-inten-sive MapReduce applications send numerous I/O requests to the Hadoop I/O sched-uler, and this leads to a long queue of I/O requests in the schedsched-uler, and consequently, a substantial increase in the data latency. Bulk I/O dispatch (BID) is proposed as a Hadoop I/O scheduler that is optimized for data-intensive MapReduce applications

and can improve the access time of such applications significantly [20].

In some big data applications, the data that are generated in a processing node will be consumed by the same node in the near future. The lineage-aware data management (LDM) is proposed for Hadoop-based platforms to exploit this data locality to decrease

the network footprint [21]. LDM develops a data block graph which is used to

character-ize the future data usage pattern. Using this information, it defines tier-aware storage policies for the Hadoop nodes to mitigate the impact of data dependency.

The Hadoop strategy of “processing data close to where they reside” is completely

aligned with the ISP paradigm [22]. So, they can fortify each other’s benefits when

both are deployed concurrently in a cluster. In other words, Hadoop-enabled CSDs

can play both roles of fast storage units for conventional Hadoop DataNodes and

ISP-enabled DataNodes simultaneously, resulting in augmentation of processing

horse-power of the CSDs to the Hadoop cluster.

Although CSDs can improve the overall performance of a MapReduce applica-tion by augmenting their processing engine to the Hadoop framework, this is not the primary advantage of deploying CSDs in the clusters. Potentially, increasing the total horsepower of a cluster can be achieved by adding more commodity nodes to a

input data

data block #1

Processed data data block #2

data block #N

...

result

HDFS

map map

map

reduce reduce

reduce

shuffling

(10)

cluster. What makes CSDs distinguishable is, in fact, utilizing the high-performance and power-efficient internal data links of the modern SSD architecture for running Hadoop MapReduce applications.

On the other hand, well-designed CSDs can be deployed to run HPC applications in-place. However, CSDs need to deliver a compelling performance when running HPC applications; otherwise, it is hard to justify the complexity of deploying CSDs in the clus-ters while their performance improvement is not satisfactory. In this paper, we argue that CSDs can considerably improve the performance of HPC applications when they utilize ASIC-based accelerators such as Neon advanced SIMD engines.

Methods: Catalina CSD for MapReduce and HPC applications

We mentioned the challenges for proposing a well-designed CSD architecture in “

Intro-duction” section. Based on these challenges, we set eight design goals to propose an effi-cient and flexible CSD architecture. The design goals are as follows: 1—A desired CSD architecture should avoid using real-time processors that are originally intended to run conventional flash management routines for running user applications in-place. Instead, it should contain an ISP-dedicated application processor. 2—There should be a TCP/IP link between the host and the ISP engine, allowing the applications running on the host and CSD to communicate with each other. 3—The ISP engine should have access to the filesystem metadata. 4—Since both host and ISP engine have access to the flash mem-ory concurrently, there should be a synchronization mechanism. 5—Both host and ISP engine should be able to interact with the flash storage simultaneously. 6—There should be an OS running inside the CSD. This OS provides a flexible environment to run a vast spectrum of applications in-place. 7—The desired CSD should support the distributed processing platforms such as Hadoop and message passing interface (MPI). 8—The CSD should have potentials to implement ASIC- and FPGA-based accelerator engines to run CPU-intensive applications in-place with a compelling performance.

In this section, we describe the hardware and software architecture of Catalina, which is designed to satisfy all the design goals mentioned above. This section is composed of four subsections. In the first subsection, different hardware components of Catalina are described, and we discuss how they work together. The second subsection defines Catalina software layers that make it possible to send ISP commands, processing data in-place, and writing the results back. The third subsection demonstrates the fully func-tional Catalina prototype used to investigate the benefits of deploying CSDs in the clus-ters. The last subsection demonstrates how multiple Catalina CSDs can be deployed to run Hadoop MapReduce and HPC applications.

Hardware architecture

Catalina CSD is developed based on Xilinx Zynq Ultrascale+ MPSoC [12]. This device

(11)

controller such as the host and flash memory array interfaces. These two subsystems are packaged in one chip with multiple data links connecting them for high-performance and power-efficient intra-chip data transfers. These two subsystems together provide a proper platform for implementing conventional SSD routines as well as running user applications in-place.

Figure 4 shows the Catalina architecture, implemented using Xilinx Zynq Ultrascale+

MPSoC. On the PL subsystem, there are three conventional components of the control-ler which are host NVMe/PCIe interface, error correction unit, and the flash memory interface. The host interface is responsible for sending and receiving the NVMe/PCIe packets from the host and check the integrity of the payloads. Soft errors can potentially happen in different parts of a device, and SSDs are no exceptions. These errors can hap-pen in the SSD controller and the flash memory packages. There are research works that investigate the effect of the soft errors on the SSD controller and the availability and

reliability of the storage systems [23]. On the other hand, for addressing the soft errors

on the flash memory packages, the error correction unit is utilized to correct the data errors.

The flash memory interface controls the flash memory channels. In Fig. 4, each flash

channel is connected to a set of flash memory packages. Design and implementation of these three components require a considerable amount of engineering resources; how-ever, they are common among all conventional SSDs. We skip the detailed architectures of these components since they are out of the scope of this paper.

On the PS subsystem, there are two ARM Cortex-R5 real-time processors that are used for controlling the components implemented in the PL subsystem as well as run-ning the flash translation layer (FTL) routines. In fact, the conventional firmware rou-tines run on these two real-time processors. One of the real-time processors runs the FE firmware, which controls the host interface module and interprets the host’s I/O com-mands. The other ARM Cortex-R5 processor runs the BE firmware, which is responsible for controlling the error correction and flash interface units. The BE firmware also runs other essential FTL routines such as garbage collection and wear-leveling. The FE and

error correction

unit flash memory interface

…

memory controller

host NVMe/PCIe interface

PL

PS

AXI bus

2x ARM R5 (firmware)

8GB DRAM memory

NVMe

Nand flash

TCP/IP tunnel

channel#1

Ch #2

Ch #3

Ch #C Quad-core

ARM A53 (Linux OS)

NEON SIMD

Floating point unit

ISP engine

(12)

BE firmwares work together to receive the host’s I/O commands, interpret them, and execute flash read/write operations.

All of the components mentioned above are common among conventional SSDs; how-ever, Catalina is equipped with a unique ISP subsystem. This subsystem is dedicated to running user applications in-place. It contains a quad-core ARM Cortex-A53 processor which is equipped with Neon SIMD engines and floating-point units (FPUs). The quad-core processor is capable of running a vast spectrum of applications, while Neon SIMD engines can increase the performance of HPC applications. Overall, the quad-core Cor-tex-A53 processor is the main ISP engine of Catalina, and Neon SIMD engines and the FPUs can accelerate user applications that run in-place.

As Fig. 4 demonstrates, both the ISP engine and the two Cortex-R5 real-time

proces-sors which run the conventional flash management routines are connected via an inter-nal Xilinx advanced extensible interface (AXI) bus. The shared AXI bus makes it possible to transfer data between the BE firmware and the ISP engine efficiently. In other words, the ISP engine can bypass the whole NVMe hardware and software stack and access the data stored in the flash memory array directly by communicating to the BE firmware. There is also an 8 GB DRAM memory connected to the AXI bus, which is shared among all the processing units.

Software stack

The Catalina software stack is designed to achieve the goals we set earlier in this sec-tion. The most important part of the software components is an OS running inside the ISP engine. Therefore, we have ported a full-fledged Linux OS running on the quad-core ARM Cortex-A53 processor. Each core of the quad-core processor has a level-one 32 kB instruction cache and a 32 kB data cache. The ARM processor also contains a 1 MB level-two cache, which is used by all the cores. These caches together with the memory unit and the non-volatile flash memory compose a memory hierarchy. The Catalina OS governs the memory hierarchy for data coherency and integrity. Overall, the OS is a flex-ible environment for running user applications in-place as well as implementing other layers of the software stack.

Running an OS inside the CSD comes with complexity and processing overhead. The first challenge to port an OS inside the CSD is to boot the OS together with the conven-tional flash management firmware. At the boot time, the OS and the firmware should boot up concurrently and complete a handshake phase. This procedure adds a consider-able amount of complexity to the design and manufacturing of the CSD. We have suc-cessfully implemented this dual-boot procedure, and the users do not have any concerns regarding this challenge.

On the other hand, the ISP subsystem consumes a portion of the processing budget for running the OS routines. Since the processing budget of the ISP engine is limited, the OS should not consume a big portion of the resources. We have measured the OS rou-tines overhead in different states of the ISP subsystem, and our results showed that the processing overhead of running the OS is at most 2% of the total processing budget of

the Catalina CSD. Figure 5 demonstrates the architecture of the software layers and how

(13)

Our first design, named Compstor [24], had a custom-developed software stack, and it was unable to be utilized in the distributed processing environments. This problem is addressed in Catalina design by adopting standard software stacks such as MPI and

Hadoop MapReduce [25]. In Fig. 5 there is a cluster of M hosts connected to a TCP/IP

interconnect, and the host 1 is attached to N Catalina CSDs via a PCIe switch. In this

figure, the lowest layer of the software stack is the BE firmware, which implements the FTL procedures. The BE firmware serves both FE firmware which talks to the host via NVMe protocol and a block device driver implemented in the kernel space of the ISP engine’s OS. The block device driver issues flash I/O commands directly to the BE firmware, so the data link through the block device driver bypasses the NVMe/PCIe software and hardware stack. The block device driver also makes it possible to mount the flash storage inside the Catalina OS. It allows a user application running in-place to have filesystem-level access to the data stored in the flash memory array via a high-performance and low-power internal data link.

On the other hand, the ISP engine should also provide a link between applications that run in-place and applications running on the host, so in addition to the block device driver, we have implemented a TCP/IP tunnel through NVMe protocol to transfer TCP/IP packets between the applications running on the host and the appli-cations running inside Catalina. We have utilized NVMe vendor-specific commands to packetize TCP/IP payloads inside the NVMe packets (TCP/IP tunnels through

user application running in-place

Flash memory array interface CSD block

device driver CSD Linux kernel

space (ARM A53)

Nand flash array CSD controller

PL subsystem CSD Linux user

space (ARM A53)

back-end firmware CSD firmware

(ARM R5)

CSD Linux OS

AXI bus intra-chip

inter-chip

NVMe

front-end firmware

inter-process Catalina #2

Catalina #3

Catalina #N

Catalina #1

AXI bus intra-chip TCP/IP over NVMe tunnel PCIe switch

cluster TCP/IP interconnect

user application running on host

host DRAM memory

host OS

host #1

intra-operating system network

host #3

NVM

e

host #M

Neon SIMD API set

…

host #2

(14)

NVMe are demonstrated in Fig. 5 by dashed lines). A software layer implemented on both host OS and the Catalina OS provides the tunneling functionality. Since

distrib-uted platforms such as Hadoop MapReduce [17] and MPI [26] are based on TCP/

IP connection, this link plays a crucial role in running distributed applications. As

shown in Fig. 5, all the N Catalina CSDs that are attached to the host 1 can

concur-rently communicate with applications running on the host.

It is worth mentioning that by using Linux TCP/IP packet routing tools, we can cre-ate an internal network in the host OS, and reroute the packets sent or received by the

Catalina CSDs to the other hosts attached to the TCP/IP interconnect (see Fig. 5).

Alter-natively stated, considering several hosts that are connected via a TCP/IP interconnect, and each of them is equipped with multiple Catalina CSDs, the hosts, as well as all the CSDs attached to them, can communicate to each other via a TCP/IP network. Such CSD-equipped cluster architecture benefits from the efficient ISP capabilities of Catalina CSDs to run distributed applications. In fact, the proposed CSD architecture is an aug-mentable processing resource that is adoptable in the cluster without any modifications in the underlying Hadoop or HPC platforms.

Also, the user applications that run in-place have access to Neon SIMD engines via a set of application programming interfaces (APIs) provided by the Catalina OS. Using these APIs, user applications can potentially be accelerated by the Neon SIMD engine. Overall, the user applications have access to three unique tools, including 1—a high-speed and low-power internal link to the data stored in the flash memory, 2—a TCP/ IP link to the applications running on the host, 3—a set of APIs to utilize Neon SIMD engines.

The last layer of the software stack is the synchronization layer between the host and Catalina OSs. These two OSs can access the data stored in the flash memory array at the filesystem-level and concurrently mount the same storage media, which is a problematic behavior without a synchronization mechanism. Since Catalina contains a full-fledged Linux OS and there is a TCP/IP connection to the host, to address the synchronization

issue, we have utilized the Oracle cluster filesystem 2nd version (OCFS2) [27] between

the host and the CSD. Using OCFS2, both the host and Catalina CSD can issue flash I/O commands and mount the shared flash memory natively. This is the main differ-ence between OCFS2 and network filesystem (NFS). In NFS, only one node mounts the shared storage natively, and other nodes use a network connection to access the shared

storage, so NFS limits the data throughput and also suffers from the single point of

fail-ure problem. On the other hand, using OCFS2, all nodes can mount the storage natively.

Catalina prototype

To prove the feasibility of the proposed computational storage solution and also investi-gate the benefits of deploying Catalina CSDs in clusters, we have designed and manufac-tured a fully functional prototype of Catalina which completely aligns with the hardware

and software architecture that has been described in the previous subsections. Figure 6

shows the Catalina CSD prototype. The CSD controller implemented on a Xilinx Zynq

(15)

The CSD controller is composed of the PS and PL subsystems that implement the two processing engines of the Catalina CSD, namely the conventional FTL and host interface engine and the ISP engine. The hardware specifications of the Catalina prototype

sepa-rated for these two engines come in Table 1.

The prototype of Catalina is able to execute the host’s I/O commands and also provides a straightforward mechanism for offloading user applications to the CSD via a TCP/IP tunnel through NVMe/PCIe. Considering multiple Catalina CSD prototypes attached to a host, an administrative application on the host can initiate user applications on CSDs while the host and the CSDs’ OSs are being synchronized by the OCFS2 filesystem. The user applications can be developed in any language supported by Linux OS, and they can interact with the flash memory at the filesystem-level, similar to when they run on a conventional host machine.

Despite all of the projected benefits of deploying the CSDs, they should be cost-effec-tive to be adoptable in the clusters. After prototyping Catalina, a sensible cost analysis of manufacturing CSDs can be presented. Compared with a regular SSD based on a con-ventional controller, a CSD should be equipped with more processors to run applica-tions in-place efficiently. Interestingly, according to our observaapplica-tions and also the SSD

bill of material analysis [28, 29], the major difference between an SSD and a CSD

manu-facturing costs would be insignificant, since the SSD manumanu-facturing cost is largely domi-nated by the NAND flash memory. The cost of flash memory chips is about 75% of the

SSD price.1_{With other miscellaneous costs (such as DRAM, miscellaneous components,}

and manufacturing costs) that would account for 20–25% of the SSD price, the control-ler would account for at most 5% of the SSD price.

flash memory

modules

CSD controller

Fig. 6 Catalina CSD prototype

Table 1 Catalina prototype hardware specifications

FTL and host interface engine Processing units Two ARM R5 proces-sors @600 MHz FPGA @250 MHz Host interface NVMe over PCIe Gen3 In-storage processing engine Processing units Quad-core ARM

Cortex-A53 with Neon SIMD engines Host interface TCP/IP ove NVMe

Shared DRAM memory 8 GB

(16)

Deploying Catalina in clusters

Catalina is developed concerning a straightforward deployment in clusters. Since it has all the required features to play the role of a regular processing node, the cluster archi-tects do not have to make major modifications in the underlying platforms for

deploy-ment of the Catalina CSDs. Figures 7, 8 illustrate CSD-equipped Hadoop and MPI-based

clusters, respectively, where a head node is connected to M host machines, and each of

the hosts is equipped with N Catalina CSDs. In such a cluster, all CSDs and conventional

nodes are orchestrating together to improve the performance and efficiency of distrib-uted applications.

NameNode (head node)

DataNode (application host)

DataNode (CSD)

DRAM DN CPU

NN head node

application host #M

Catalina

2

Catalina

N

…

1 a

nil

ata

C

DRAM CPU

application host #1

Catalina

2

Catalina

N

…

Catalina

1

…

DN

DN DN DN DN DN DN

DN NN

DN

CPU

DRAM

Fig. 7 Catalina CSD-equipped Hadoop cluster

MPI coordinator (head node)

MPI worker (application host)

MPI worker (CSD)

DRAM CPU

application host #1

Catalina

2

Catalina

N

…

1

ani

lat

a

C

DRAM CPU

application host #M

Catalina

2

Catalina

N

…

Catalina

1

CPU

DRAM head node

…

(17)

In the CSD-equipped Hadoop cluster, the head node runs Hadoop NameNode and

YARN resource manager (RM), while the hosts and Catalina CSDs run DataNodes and

node managers (NMs). In fact, Catalina CSDs play both roles of storage units as well as

efficient DataNodes. Since Hadoop implements its filesystem synchronization

mecha-nism, we exceptionally do not need the OCFS2 filesystem for running Hadoop. However, OCFS2 plays an important role in running MPI-based applications.

Figure 8 illustrates a CSD-equipped cluster based on MPI for running HPC

appli-cations. In this figure, the head node runs an MPI coordinator while the conventional

hosts, and the Catalina CSDs, run the MPI workers. In this MPI-based cluster, each host

is attached to N CSDs, and the data stored on the CSDs are shared between the host

and CSDs, so the MPI workers on the host and the CSDs have access to the shared data. Thanks to the OCFS2 filesystem, the shared data is simultaneously visible to the host and CSDs at the filesystem-level so that the user can freely distribute the processing load among the hosts and the CSD.

Results and discussion

As mentioned in the previous section, deploying Catalina CSDs in clusters is straightfor-ward. After attaching the Catalina CSDs to the host and setting the network configura-tions, the CSDs are exposed to the other hosts in the cluster by their network address (e.g., IP address). From the system-level point of view, the Catalina CSDs are similar to regular processing nodes, and the underlying ISP hardware and software details are invisible to other nodes in the cluster. In this section, we first demonstrate the developed platforms equipped with up to 16 Catalina CSDs and describe how we implemented a CSD-enabled Hadoop and MPI-based clusters on the developed platforms. The second subsection shows the results of running different Hadoop MapReduce and HPC bench-marks and discusses the benefits of deploying Catalina CSDs in clusters.

Experimental setup

The Catalina CSDs are not designed to compete with the modern hosts based on

high-end ×86 processors with tens to hundreds of gigabytes of DRAM. Instead, it is presented

as a resource that augments the processing horsepower of a system and improves the

performance and power efficiency of the applications [25]. However, for getting

consid-erable improvements, we propose attaching multiple Catalina CSDs to host machines.

Figure 9 shows the architecture of the developed platform, which contains 16 Catalina

CSD prototypes. We built this platform to investigate the benefits of deploying Catalina CSDs in clusters.

This platform is composed of a conventional host (called the head node) and an

appli-cation host, which is equipped with the Catalina CSDs. These two hosts, along with the Catalina CSDs, form a distributed environment for running Hadoop MapReduce and

MPI-based HPC applications. We use the head node exclusively for running Hadoop

NameNode and the MPI coordinator to eliminate the load of the administrative tasks on

the processing nodes. In other words, the application host and the CSDs are the

process-ing nodes, while the head node is dedicated only for running the administrative tasks.

To extensively investigate the benefits of Catalina CSDs in different environments,

(18)

medium, and high. The specifications for the head node and the different application

host’s configurations are summarized in Table 2. In order to attach up to 16 Catalina

CSDs to the application host, we used a Cubix Xpander Rackmount unit [30] which

pro-vides 16 PCIe Gen3 slots. This unit and the attached Catalina CSDs are shown in Fig. 9.

The implementations of the Hadoop and the MPI-based clusters are aligned with the

architectures that are shown in Figs. 7, 8. For implementing the Apache Hadoop

clus-ter, we ran Hadoop NameNode and the Yarn resource manager (RM) on the head node,

while the DataNodes and the node managers (NMs) run on the application host, and the

Catalina CSDs that are attached to the application host. The communication between

the head node and the application host is through an Ethernet cable, while Catalina CSDs communicate via the developed TCP/IP through NVMe link.

On the other hand, for running the HPC application based on MPI, we use the head

node for running the MPI coordinator application which initiates and organizes the MPI

workers that run on the application host and the Catalina CSDs. In this case, the OCFS2

1#

ani

lat

a

C

. . .

application host

NVMe/PCIe

Catalina #2 Catalina #8 PCIe switch #1

TCP/IP Tunnel

Catalina #9 _{Catalina #1}

0

Catalina #1

6

. . .

PCIe switch #2

TCP/IP Tunnel

head node

Ethernet cable

Fig. 9 Architecture of the developed platform equipped with 16 Catalina CSDs

Table 2 Specifications of the hosts in the developed system for running experiments Feature Head node Application host configurations

Low Medium High

Processor Xeon E5-2620 v4 Core i3-8100T Core i7-7700 Xeon E5-2620 v4 Memory 32 GB (DDR 4) 32 GB (DDR 4) 32 GB (DDR 4) 32 GB (DDR 4) Storage 4× Samsung 850

(19)

filesystem synchronizes the filesystems of the Catalina CSDs and the application host, so

at any given time the application host can access the whole data stored on all the CSDs

directly, while each CSD only has access to its local data.

As previously discussed, all the packets from/to the CSDs are routed inside the

appli-cation host. This mechanism provides a flexible networking environment without using an extra network switch. However, there is a cost for having such a flexible network. The

application host needs to consume a portion of its processing horsepower to route the

packets. In our design, for each CSD added to application host, an insignificant amount

of the host’s CPU is consumed for the routing mechanism. We have measured this

net-working overhead for different workloads. In highly congested scenarios, the application

host consumed at most 3.5% of its processing horsepower for the routing the TCP/IP

packets among CSDs. This overhead is considered in the results in this section.

Benchmarks and results

This subsection is composed of two parts. First, we describe the targeted Hadoop MapReduce benchmarks and report the performance and energy consumption of ning the benchmarks for different configurations to investigate the benefits of run-ning Hadoop MapReduce applications in-place. Then, we show the results for runrun-ning 1D, 2D, and 3D discrete Fourier transform (DFT) algorithms utilizing the Neon SIMD engines of Catalina CSDs. For reporting the performance, we measure the total elapsed time of running a benchmark on the developed platform for a given configuration. On the other hand, to measure the energy consumption, we use a power meter to measure the power consumption of the platform. Using the logging tool provided by the power meter, we calculate the total energy consumption for running a benchmark. However, we deducted the idle energy consumption from the total calculated energy consumption for all the experiments to eliminate the energy consumption imposed by miscellaneous devices such as the cooling system.

Hadoop MapReduce benchmarks and results

For running Apache Hadoop MapReduce applications on the developed platform, we have

used a subset of the Intel HiBench benchmark suite [31] which includes Sort, Terasort, and

Wordcount benchmarks. We believe that extensive experiments on these three benchmarks can show the potentials of CSD architectures running Hadoop MapReduce applications. These benchmarks have been executed on 16 different platform configurations, which are

listed in Table 3. In all the experiments, the head node specification is fixed and matches

with Table 2, and the number of mappers and reducers tasks are 2000 and 200, respectively.

The application host uses all the attached Catalina CSDs as the storage units (6 CSDs

in the low and medium configurations; and 16 CSDs in the high configuration), while in

each configuration, a certain number of the ISP engines of the CSDs are enabled to run MapReduce application in-place. This way, the scalability of deploying Catalina CSDs in clusters can be investigated. The data size for Sort, Terasort, and Wordcount bench-marks are 8 GB, 1.3 GB, and 80 GB, respectively.

(20)

the 16 different platform configurations, and each experiment has been repeated 30 times that gives us a total of 1440 MapReduce tests.

As previously stated, in all experiments, the application host uses all connected

Cat-alina CSDs as storage units, however, in each test, a certain number of CSDs are enabled to run MapReduce application in-place and play the role of a processing node.

Fig-ures 10, 11 shows the Hadoop MapReduce experiments performance and energy

con-sumption results, respectively.

The diagrams in Fig. 10 show that increasing the number of ISP-enabled CSDs

decreases the elapsed time for all benchmarks. The performance of the high-configured

application host platform increased up to 2.2× when the ISP engines of all 16 Catalina CSDs are enabled. Thus, deploying ISP-enabled CSDs increases the performance of the Hadoop MapReduce benchmarks significantly.

Also, according to these diagrams, the elapsed time for running the MapReduce

benchmarks on the low-configured application host platform equipped with six Catalina

CSDs is close to the elapsed time of running the benchmarks on the high-configured

application host platform with no enabled ISP engine. So, only six Catalina CSDs can improve the performance of a low-end host close to the performance of a high-end host. As we previously discussed, the cost of implementing the ISP engine inside the SSDs is negligible compared to the total cost of manufacturing an SSD, so ISP technology

can considerably improve the performance of Hadoop clusters economically. Figure 11

shows the energy consumption results of running the Hadoop MapReduce benchmarks on the developed platform for different configurations.

According to Fig. 11, the energy consumption of running the benchmarks on the low

-configured application host platform decreases up to 36% by deploying six ISP-enabled

Catalina CSDs. This improvement for the high-configured application host platform

equipped with 16 ISP-enabled Catalina CSDs reaches to 4.3×.

Table 3 Different configurations for running Hadoop MapReduce benchmarks

Experiment number Application host configuration Enabled in-storage processing capability

1 Low None

2 Low 2 ISP-enabled CSDs

5 Medium None

6 Medium 2 ISP-enabled CSDs

9 High None

10 High 2 ISP-enabled CSDs

(21)

With no ISP engine enabled, the low-configured application host platform is less energy efficient than the other configurations. This is expected behavior since running the same benchmark on a more powerful platform takes less time. According to the

dia-grams in Fig. 11, when we enabled six ISP engines, the energy efficiency of the low

-con-figured application host platform can surpass the energy efficiency of the medium- and

high-configured application host platforms equipped with the same number of

ISP-ena-bled CSDs. Although, the performance of the low-configured application host platform

is still lower than the other platforms (see Fig. 10). We believe that this happens because

of the high energy efficiency of the Catalina CSDs.

The Hadoop framework distributes tasks among all of the processing nodes. If a process-ing node gets idle, it will fetch data from other busy nodes and process it. Thus, the amount of data processed by each node in the Hadoop cluster is proportional to its processing

resources. This means in the low-configured application host platform, a larger amount of

data is processed by the Catalina CSDs compared to the amount of data processed by them

in the high-configured application host platform which has an Intel Xeon processor. Since

12 17 22 27 32 37 42 47 52 57 62

0 2 4 6 8 12 16

)se tu ni m( e mi T

Sort benchmark elapsed time

low application host medium application host high application host

30 40 50 60 70 80 90 100 110

0 2 4 6 8 12 16

( e mi T set uni m)

Terasort benchmark elapsed time

20 30 40 50 60 70 80 90

0 2 4 6 8 12 16

( e mi T set uni m)

Number of ISP-enabled CSDs Wordcount benchmark elapsed time

(22)

ISP engines are considerably more energy-efficient than the application host’s processor, as we increase the portion of data processed by the CSDs, the whole platform becomes more

energy-efficient. This justifies why the energy efficiency of the low-configured application

host platform equipped with 6 ISP-enabled CSDs can surpass the energy efficiency of the

high-configured application host platform with the same number of ISP-enabled Catalina

CSDs.

HPC benchmarks and results

In this subsection, we first describe the targeted benchmarks to investigate the effect of deploying Catalina CSDs in clusters for running HPC applications. Then we show and discuss the performance and energy consumption results of running the benchmarks on the developed platform. The HPC applications usually demand a considerable amount

15 25 35 45 55 65 75 85

0 2 4 6 8 12 16

)J

K(

ygr

en

E

Sort energy consumption

45 65 85 105 125 145 165

0 2 4 6 8 12 16

)J

K(

ygr

en

E

Terasort energy consumption

25 45 65 85 105 125

0 2 4 6 8 12 16

)J

K(

ygr

en

E

Number of ISP-enabled CSDs Wordcount energy consumption

(23)

of processing resources and consume a large amount of data. Thus, we only consider the

high-configured application host platform equipped with 16 Catalina CSDs for running the

HPC experiments (see Table 2). We have implemented the MPI framework to run the HPC

benchmark according to the architecture described earlier in this section. The MPI

coordi-nator runs on the head node host, while the application host and the ISP-enabled Catalina

CSDs, run the MPI workers. In the developed platform, the application host can access the

data stored on all the Catalina CSDs; however, each CSD only has access to its local data. In addition, for running the HPC applications in-place, Catalina CSDs should be able to deliver a compelling performance. Therefore, we have utilized the Neon SIMD engines inside the Catalina CSDs. The Neon SIMD engines are application-specific integrated cir-cuit (ASIC) accelerators that are expected to improve the performance and energy effi-ciency of the applications significantly. Overall, this section shows how using ASIC-based accelerators enhances the benefits of deploying CSDs for running HPC applications.

The HPC Challenge benchmark suite [32] which is developed by the University of

Ten-nessee is one of the well-known HPC benchmark suites and is used in many research works

[33–35]. This suite is composed of several benchmarks, each of which focuses on a

par-ticular feature of the HPC clusters such as the ability to do floating-point calculations, the communication speed between nodes, and the potentials of running demanding algorithms such as DFT. Among these benchmarks, we have targeted the DFT algorithm, since it is a CPU-intensive algorithm that also consumes a large amount of data, so it can show the potentials of deploying CSDs in clusters. Also, DFT is one of the most important algo-rithms, as Gilbert Strang, the author of the textbook Linear Algebra and Its Applications

[36] referred to it as “the most important numerical algorithm in our lifetime.” The DFT of

a finite sequence X is a finite sequence Y with the same length of X in a complex-valued

for-mat in the frequency domain. The DFT of the finite sequence X is defined by (1).

In the case of multidimensional input signal of X:xn1,n2,...,n_l

, a d-dimensional DFT is defined as (2).

Considering a large amount of floating-point input data, the multi-dimensional DFT calculation is a challenging CPU-intensive application and can show the potentials of the Catalina CSDs for running HPC applications. Thus, we targeted this algorithm to measure the energy consumption and performance of 1D, 2D, and 3D DFT calculations

of large datasets running on the high-configured application host platform with different

number of ISP-enabled Catalina CSDs. For implementing the DFT algorithm, we

uti-lized the FFTW library [37], which can be compiled to use the Neon SIMD engines of

(1)

Y =F{xn}

y_k =

N−1

n=0

xn·e−

2πi N kn

(2)

yk1,k2,...,k1 =

N1−1

�

n1=0



α n1k1

N1

N2−1

�

n2=0



α n2k2

N2 · · ·

Nd−1 �

nd=0

�

αndkd

Nd ·xn1,n2,...,n_d

� 

 



where α_N

l =exp

�₋₂

π

(24)

Catalina CSDs, and also supports multi-threading capability of the processing nodes in the developed platform.

For running the 1D-, 2D-, and 3D-DFT calculations, we have prepared three

differ-ent datasets. The PTB Diagnostic ECG dataset is used for 1D-DFT calculation. The

PTB Diagnostic ECG is a set of ECG signals collected from healthy volunteers and patients with different heart diseases by Professor Michael Oeff, M.D., at the

Depart-ment of Cardiology of University Clinic Benjamin Franklin in Berlin, Germany [38,

39]. We have duplicated this dataset to generate 200 million 1D objects, each of which

is a sequence of 180 floating-point numbers.

Regularly, 2D-DFT operations are performed on images; therefore, we have gener-ated 14.4 million synthetic grayscale images as the 2D-DFT dataset. On each of these images, a dark point is placed randomly on the image, and other points’ brightness

is relative to their distance from the single darkest point. Figure 12 shows four

sam-ples of these images. For performing 2D-DFT operations, we converted each of the

images to a 50×50 matrix. Overall, the 2D-DFT dataset is composed of 14.4 million

2D objects, where each of which is a sequence of 2500 floating-point numbers. The 3D dataset also is generated using the same method we used for generating the 2D dataset. Each object in the 3D dataset can be described as a cube-shaped 3D object where a single darkest point is placed randomly in the cube-shaped object, and other points’ brightness is relative to their distance from the single darkest point. We have

gen-erated a set of 288,000 three-dimensional objects and converted them to 50×50×50

matrices to represent the dataset for the 3D-DFT operations. Table 4 summarizes the

datasets we used for running 1D-, 2D-, and 3D-DFT operations on the developed CSD-enabled platform.

1

10

20

30

40

50

1 10 20 30 40 50 1

10

20

30

40

50

1 10 20 30 40 50

1

10

20

30

40

50

1 10 20 30 40 50 1

10

20

30

40

50

(25)

Similar to the Hadoop MapReduce experiments, in all of the DFT calculation

experi-ments, the application host has access to the data stored in all of the Catalina CSDs and

the CSDs always play the role of storage units. However, in each test, a certain num-ber of ISP-engines of the CSDs are enabled to show the scalability of ISP technology for

running HPC applications. Figure 13 shows the performance and energy consumption

results of running the DFT calculations on the developed platform for different numbers of ISP-enabled Catalina CSDs. The performance reported in the diagrams is defined as the number of 1D, 2D, and 3D objects that have been processed in a second, and the reported energy consumption is the energy consumed for processing an object. It is worth mentioning that each test has been repeated 20 times, and each result reported in this subsection is the average of all repetitions.

Table 4 Datasets for 1D-, 2D-, and 3D-DFT calculations

Dataset Number of objects Dimensions of an object Total size of the dataset (GB)

1D-DFT 200 million 180×1 288

2D-DFT 14.4 million 50×50 ₂₈₈

3D-DFT 288,000 50×50×50 ₂₈₈

0.00 5000.00 10000.00 15000.00 20000.00 25000.00 30000.00 35000.00

0 1 2 4 6 8 12 16

dn oc es / st cej bo D1

1D-DFT performance

0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00

0 1 2 4 6 8 12 16

dn oc es / s tc ej bo D2