Dell EMC PowerStore: Apache Spark Solution Guide

(1)

Technical White Paper

Dell EMC PowerStore: Apache Spark Solution

Guide

Abstract

This document provides a solution overview for Apache Spark running on a Dell EMC™ PowerStore™ appliance.

(2)

Revisions

Date Description

June 2021 Initial release

Acknowledgments

Author: Henry Wong

This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly.

This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

(3)

Executive summary

Apache®_Spark®_{has seen tremendous growth in the past few years. It is the leading platform for big data} distributed processing because of its innovation, speed, and developer-friendly framework. This document offers a high-level overview of the Dell EMC™ PowerStore™ appliance and the benefits of running Apache Spark and Hadoop®_{HDFS on PowerStore. The document also provides installation, configuration, testing,} and a simple use case for Spark and HDFS on PowerStore.

Audience

This document is intended for IT administrators, storage architects, partners, and Dell Technologies™

(6)

1 Introduction

This document was developed using the PowerStore X model appliance, Apache Spark, Apache HDFS, and Red Hat®_{Enterprise Linux}®_{. This section provides an overview for PowerStore, Apache Spark, and Apache} HDFS.

1.1 PowerStore overview

PowerStore achieves new levels of operational simplicity and agility. It uses a container-based microservices architecture, advanced storage technologies, and integrated machine learning to unlock the power of your data. PowerStore is a versatile platform with a performance-centric design that delivers multidimensional scale, always-on data reduction, and support for next-generation media.

PowerStore brings the simplicity of public cloud to on-premises infrastructure, streamlining operations with an integrated machine-learning engine and seamless automation. It also offers predictive analytics to easily monitor, analyze, and troubleshoot the environment. PowerStore is highly adaptable, providing the flexibility to host specialized workloads directly on the appliance and modernize infrastructure without disruption. It also offers investment protection through flexible payment solutions and data-in-place upgrades.

The PowerStore platform is available in two different product models: PowerStore T models and PowerStore X models. PowerStore T models are bare-metal, unified storage arrays which can service block, file, and VMware®_vSphere®_{Virtual Volumes™ (vVols) resources along with numerous data services and efficiencies.} PowerStore X model appliances enable running applications directly on the appliance through the AppsON capability. A native VMware ESXi™ layer runs embedded applications alongside the PowerStore operating system, all in the form of virtual machines. This feature adds to the traditional storage functionality of PowerStore X model appliances, and supports serving external block and vVol storage to servers with multiple protocols.

For more information about PowerStore T models and PowerStore X models, see the documents Dell EMC PowerStore: Introduction to the Platform and Dell EMC PowerStore Virtualization Infrastructure Guide.

1.2 Apache Spark overview

Apache Spark is an open-source distributed processing engine designed to be high performing, scalable, and capable of computing massive amount of data. It can perform a wide range of analytic tasks such as SQL queries, streaming, and machine learning.

Spark supports several popular programming languages such as Java, Scala, Python, and R and provides a unified and consistent set of APIs for these programming languages. Also, it has an extensive set of libraries for SQL (dataframes), machine learning (MLlib), Spark Streaming, and GraphX. These capabilities allow developers to easily build Spark applications by combining different APIs, libraries, and functions.

Spark is built for speed and high performance. Spark loads the entire dataset in memory on the cluster and performs computation on it. The data is kept in memory to minimize disk access. Spark performs

(7)

Spark supports a wide range of storage systems such as local file systems, Apache Hadoop HDFS, Apache Hive, Apache HBase, Cassandra, and more. Figure 1 shows the Spark components in blue. For more information, see the corresponding documentation on https://spark.apache.org.

SQL

Spark Core

Dataframes Streaming GraphX MLlib Java Python Scala R

HDFS Local file

system HBase Hive Cassandra Others

APIs Libraries Core execution engine Storage/Data source Spark components

(8)

A Spark standalone cluster (see Figure 2) consists of one master node and multiple worker nodes. The cluster manager on the master node manages the cluster resources, such as CPUs and memory, and assigns application tasks to the worker nodes. A Spark application is a driver program that establishes a Spark session to the cluster manager and requests resources to perform multiple tasks on the worker nodes. Executors are the Java VMs (JVM) on the worker nodes that perform the tasks and report the status and results back to the driver program.

Spark Worker Node

Executor Task Task Task Executor Task Task Task

Spark Worker Node

Executor Task Task Task Executor Task Task Task Application Driver Program Spark session

Spark Master Node Spark Cluster

Manager

Spark standalone cluster overview

1.3 Apache Hadoop Distributed File System overview

Apache Hadoop is an open-source software suite and framework for big-data processing. Hadoop Distributed File System (HDFS) is one of the core components of Hadoop. It is a distributed file system designed to be massively scalable, fault tolerant, and have high throughput. HDFS can scale up to hundreds of servers and supports large files. It is well suited for applications, such as Spark, that require access to large datasets. HDFS files are divided into blocks and stored on multiple servers. Data block replication (replication factor) places replicas of the block across the cluster to increase data availability and read performance.

Other Hadoop core components include the following: • Hadoop YARN: A cluster and resource manager

• Hadoop MapReduce: A distributed parallel data processing system • Hadoop Common: Core common utilities shared by other modules

HDFS provides the persistent storage and data source for Spark. The configuration in this document does not use Hadoop YARN and MapReduce.

(9)

1.4 The advantages of Spark and Hadoop on PowerStore

Both Spark and HDFS share a similar distributed architecture that requires a powerful, highly scalable, and flexible infrastructure. PowerStore is performance-optimized for any workload, and its adaptable scale-up and scale-out architecture complements the distributed model of these applications. This section highlights the PowerStore features that benefit and extend the application environment.

1.4.1 AppsON brings applications closer to the infrastructure and storage

Bringing applications closer to data increases density and simplifies infrastructure operations. The

PowerStore AppsON capability integrates with VMware vSphere®_{, resulting in streamlined management in} which storage resources plug directly into the virtualization layer. Using VMware as the onboard application environment results in unmatched simplicity, since support is inherently available for any standard VM-based applications. When a new PowerStore X model is deployed, the VASA provider is automatically registered, and the datastore is created, eliminating manual steps and saving time. PowerStore seamlessly integrates the VMware ESXi software into the same hardware. Two ESXi nodes are embedded inside the appliance which has direct access to the same storage resources. This close integration allows applications such as Spark and Hadoop to take full advantage of server and storage virtualization with simplified deployment and management. AppsON is available on the PowerStore X model exclusively.

1.4.2 Agile infrastructure, flexible scaling on a high-performing storage and compute

platform

PowerStore provides flexible scaling with ease of management that compliments the Spark and Hadoop scale-up and scale-out distribution model. The integrated hypervisor dynamically scales up the cluster nodes when the workload requires it, while you can rapidly provision new nodes on the same or on other appliances in a different location.

Big-data applications require large amount of data and computational power for analytics, machining learning, model training, and other workloads. With a PowerStore appliance, administrators can scale up the storage capacity by adding disks and disk expansion enclosures without service interruption at any time. You can also configure multiple PowerStore appliances into a cluster to increase CPUs, memory, storage capacity, and front-end connectivity. Clustering simplifies and centralizes the management of multiple appliances from PowerStore Manager, a single HTML5-based management interface. A cluster can consist of up to four PowerStore T appliances or four PowerStore X appliances. Each appliance within the cluster can have different configurations of CPUs, memory, NVMe drives, and expansion enclosures.

The NVMe architecture is designed for the next-generation NVMe-based storage and takes advantage of low-overhead NVRAM cache. PowerStore is engineered to handle the most demanding workloads.

1.4.3 Mission-critical high availability and fault-tolerant platform

PowerStore provides a high level of stability and reliability for Spark and HDFS. At the hardware level, PowerStore is highly available and fault tolerant. It monitors the storage devices continuously, and it automatically relocates data from failing devices to avoid data loss. The PowerStore X model appliance includes two ESXi nodes and redundant hardware components. The nondisruptive upgrade (NDU) feature further increases overall PowerStore availability. The updates take place on the nodes in a rolling fashion. NDU supports PowerStore software releases, hotfixes, and hardware and disk firmware.

(10)

To support high-value business workloads and service requirements on the application level, it is essential to protect and ensure the availability of the Spark and HDFS nodes. The Spark master node and the HDFS NameNode are central to all operations of the application in the cluster. When they become inaccessible, all applications and storage operations are affected. Also, if a DataNode is not reachable for an extended period, the Namenode determines the blocks on the failed node and starts making copies of the blocks from other replicas until the replication factor is met.

With standard VMware vSphere High Availability (HA) integrated into PowerStore, the embedded VMware ESXi™ hypervisor automatically restarts or migrates failed Spark and HDFS servers to a different ESXi node. This process resumes operations by helping restore Spark and HDFS to its full operation capacity, and it minimizes the chance of the DataNodes being marked dead.

To achieve an even higher level of redundancy and application availability, you can deploy the Spark cluster and HDFS cluster across multiple PowerStore appliances in different racks, floors, or locations. PowerStore improves application availability and provides unparalleled flexibility and mobility to relocate and move across data centers and appliances.

1.4.4 PowerStore inline data reduction reduces storage consumption and cost

Data science and big-data applications continuously pull in a tremendous amount of data from various sources. To help reduce storage consumption and cost, the PowerStore inline data-reduction feature maximizes space savings by combining both software data deduplication and hardware compression. Data reduction works seamlessly in the background, is always enabled, and cannot be disabled. Since data reduction is always active in PowerStore, enabling application or operating system compression may not provide additional savings.

1.4.5 Efficient and convenient snapshot data backup

PowerStore provides Spark and HDFS with extra data protection through array-based snapshots. A PowerStore snapshot is a point-in-time copy of the data. The snapshots are space efficient and require seconds to create. Snapshot data are exact copies of the source data and can be used for application testing, backup, or DevOps. Because of the tight integration with VMware vSphere, PowerStore can take vVol VM snapshots directly from PowerStore Manager using a protection policy schedule or on demand. You can view the VM snapshot information in PowerStore and vCenter.

1.4.6 Secure data protection with ease of mind

With high-value data driving business applications, data security is a top concern for all organizations. Lost or stolen data can seriously damage the reputation of an organization and result in huge financial costs and loss of customer trust. Dell Technologies engineered PowerStore with Data at Rest Encryption (D@RE) which uses self-encrypting drives and supports array-based, self-managed keys. When D@RE is activated, data is encrypted as it is written to disk using the 256-bit Advanced Encryption Standard (AES). PowerStore D@RE provides this data security benefit to Spark applications while eliminating application overhead, performance penalties, and administrative overhead that is typically associated with software-based solutions.

1.4.7 Unified infrastructure and services management

(11)

1.4.8 Spark value and future expansion

Big-data platforms such as Spark and Hadoop create enormous value for organizations. As the value and scale of this data grows, it is critical to have a future-proof platform that is easy to manage. The platform must also provide technical innovation for future growth, and support the application architecture. Spark and Hadoop on PowerStore bring IT organizations the ability to be agile, efficient, and responsive to business demands.

1.5 Terminology

The following terms are used with PowerStore.

Appliance: Solution containing a base enclosure and attached expansion enclosures. The size of an

appliance could be only the base enclosure or the base enclosure plus expansion enclosures.

PowerStore node: Storage controller that provides the processing resources for performing storage

operations and servicing I/O between storage and hosts. Each PowerStore appliance contains two nodes.

Base enclosure: Enclosure containing both nodes (node A and node B) and 25 NVMe drive slots

Expansion enclosure: Enclosures that can be attached to a base enclosure to provide additional storage. Fibre Channel (FC) protocol: Protocol used to perform SCSI commands over a Fibre Channel network. iSCSI: Provides a mechanism for accessing block-level data storage over network connections.

NDU: A nondisruptive upgrade (NDU) updates PowerStore and maximizes its availability by performing rolling

updates. This includes updates for PowerStore software releases, hotfixes, and hardware and disk firmware.

NVMe: Non-Volatile Memory Express is a communication interface and driver for accessing nonvolatile

storage media such as solid-state drives (SSD) and SCM drives through the PCIe bus.

NVMe over Fibre Channel (NVMe-FC): Allows hosts to access storage systems across a network

fabric with the NVMe protocol using Fibre Channel as the underlying transport.

NVRAM: Nonvolatile random-access memory is persistent random-access memory that retains data without

an electrical charge. NVRAM drives are used in PowerStore appliance as additional system write caching.

Volume: A block-level storage device that can be shared out using a protocol such as iSCSI or Fibre

Channel.

Snapshot: A point-in-time view of data that is stored on a storage resource. You can recover files from a

snapshot, restore a storage resource from a snapshot, or provide access to a host.

Storage container: A VMware term for a logical entity that consists of one or more capability profiles and

their storage limits. This entity is known as a vVol datastore when it is mounted in vSphere.

PCIe: Peripheral Component Interconnect Express is a high-speed serial computer expansion bus standard. PowerStore Manager: An HTML5 management interface for creating storage resources and configuring and

(12)

PowerStore T model: Container-based storage system that is running on purpose-built hardware. This

storage system supports unified (block and file) workloads, or block-optimized workloads.

PowerStore X model: Container-based storage system that runs inside a virtual machine that is deployed on

a VMware hypervisor. Besides offering block-optimized workloads, PowerStore also allows you to deploy applications directly on the array.

RecoverPoint for Virtual Machines: Protects VMs in a VMware environment with VM-level granularity and

provides local or remote replication for any point-in-time recovery. This feature is integrated with VMware vCenter and has integrated orchestration and automation capabilities.

SCM: Storage-class memory, also known as persistent memory, is an extremely fast storage technology

supported by PowerStore appliance.

Snapshot: A point-in-time view of data stored on a storage resource. You can recover files from a snapshot,

restore a storage resource from a snapshot, or provide access to a host.

Storage Policy Based Management (SPBM): Using policies to control storage-related capabilities for a VM

and ensure compliance throughout its life cycle.

Thin clone: A read/write copy of a thin block storage resource (volume, volume group, or vSphere VMFS

datastore) that shares blocks with the parent resource.

User snapshot: Snapshot that is created manually by the user or by a protection policy with an associated

snapshot rule. This snapshot type is different than an internal snapshot, which is taken automatically by the system with asynchronous replication.

Virtual machine (VM): An operating system running on a hypervisor, which is used to emulate physical

hardware.

vCenter: VMware vCenter server provides a centralized management platform for VMware vSphere

environments.

VMware vSphere Virtual Volumes (vVols): A VMware storage framework which allows VM data to be

stored on individual vVols. This ability allows for data services to be applied at a VM-level of granularity and according to SPBM. vVols can also refer to the individual storage objects that are used to enable this functionality.

vSphere API for Array Integration (VAAI): A VMware API that improves ESXi host utilization by offloading

storage-related tasks to the storage system.

vSphere API for Storage Awareness (VASA): A VMware vendor-neutral API that enables vSphere to

(13)

2 Sizing considerations

Before you select the PowerStore model, storage media, capacity, and connectivity options, you must first understand the target Spark and HDFS environment. There are many factors to consider that are not limited to the following:

• Consider the amount of data to keep and future data growth. Spark typically requires minimal storage. It performs computation in memory and only requires temporary disk storage when the data does not fit in the memory. HDFS NameNodes requires small amount of storage for storing the HDFS

metadata information and transaction logs. HDFS DataNodes requires a large amount of storage because they manage and store the data on disk.

• Consider the HDFS replication requirement. The HDFS replication factor decides how many copies of the data are replicated across the DataNodes in the cluster.

• Understand the workload patterns. For Spark, computational power and memory are the most important factors to its performance. For Hadoop, disk space, I/O bandwidth, and computational power are important factors.

• Use a 10 Gb or 25 Gb network to provide sufficient network bandwidth and reduce latency especially for HDFS replication.

Also, consider the following resources for the Spark and Hadoop nodes on a PowerStore X appliance: • Determine how many VMs, and the CPU and memory requirements of the VMs.

• While it is possible to run Spark and Hadoop on separate VMs, we recommend placing the data source, HDFS, close to the Spark worker nodes for best performance. If it is not possible to co-locate the Spark worker node and the HDFS DataNode on the same VM, ensure the Spark and Hadoop VMs are close and connected on a fast network.

• Prepare for the event in which one of the ESXi nodes fail, and decide if the CPU and memory resources should be reserved on an ESXi node to accommodate full performance for the applications.

• Do not overcommit CPU and memory resources on PowerStore ESXi nodes in any production or mission-critical environments. However, this practice might be acceptable in test or development environments where the guaranteed performance level is not a concern.

Review the Spark documentation and Hadoop documentation to learn about other software and hardware requirements.

(14)

3 Deploying a Spark cluster with HDFS

The following sections describe what a Spark cluster environment looks like and demonstrates how to set up a simple Spark cluster with HDFS on a PowerStore X appliance.

3.1 Planning for the virtual machines that run Spark and Hadoop

Spark has several deployment modes. The standalone cluster deployment mode, which includes a cluster manager, is the simplest way to deploy Spark in a private cluster. You can also deploy Spark on top of other cluster managers including Apache Hadoop YARN, Apache Mesos, and Kubernetes. This paper presents the Spark standalone cluster deployment in a private cluster on a PowerStore X appliance.

Spark is a computing analytic engine and does not handle the storage of the data. A Spark job reads the data from various sources into memory, processes it and keeps it in memory, and optionally writes the result to storage systems to persist the data. Spark supports many storage systems such as Linux file systems, Hadoop distributed file system (HDFS), Apache Hive, Apache Pig, Cassandra, and others. This paper focuses on setting up a Spark cluster with HDFS.

A Spark standalone cluster consists of one master node and multiple worker nodes. The master node manages the cluster resources and coordinates running Spark applications across the worker nodes. Spark applications are split into multiple tasks which are performed by executors (Java processes) on the worker nodes. One or multiple executors (Java processes) might be launched on each worker node.

For HDFS, it requires a NameNode and multiple DataNodes. The NameNode maintains the files and directory information of the distributed file system and tracks where the data is located within the cluster in the

DataNodes.

To increase Spark processing power, you can add more worker nodes to the cluster which increases the total computational power and memory available for executors. This addition enables more parallel tasks to be performed across the cluster. For storage, more DataNodes provide more storage processing capability and storage capacity in the cluster.

With a PowerStore X appliance, you can deploy both Spark nodes and Hadoop nodes as virtual machines on the appliance. This simplifies the provision and management of the virtual machines and storage.

In this example, the Spark cluster consists of five virtual machines, and the Hadoop cluster consists of six virtual machines. While Spark nodes and Hadoop nodes can run on different virtual machines, co-locating them on the same virtual machines brings the data closer to Spark and reduces the data access time. Table 1 summarizes the roles and software installed on each virtual machine.

Spark and HDFS virtual machine specifications

Virtual machine CPU RAM Software Role

hadoop-namenode-vm10 16 32 GB Hadoop HDFS NameNode

spark-prim-vm10 16 32 GB Spark, Hadoop Spark primary server

HDFS DataNode

spark-wrk-vm10 24 64 GB Spark, Hadoop Spark worker node

HDFS DataNode

(15)

Virtual machine CPU RAM Software Role

HDFS DataNode

spark-bench-vm10 16 32 GB Spark, Spark-bench,

Hadoop

spark-bench driver JupyterLab server

PowerStore Command Line Interface Client

HDFS Client

3.1.1 PowerStore X model appliance

AppsON is a unique PowerStore X model feature where a VMware hypervisor running vSphere ESXi v6.7 is embedded on the two internal hosts. This feature allows applications to run in VMs directly on the appliance. The appliance offers deep integration with vSphere and is fully compatible with VMware tools. During the initial configuration of the appliance, the internal ESXi hosts are configured to register with a vCenter provided by the customer. The initialization automatically applies performance optimizations, or you can apply them manually afterward. These optimizations include the following:

• Create multiple iSCSI targets on the appliance • Configure additional network ports

• Optimize ESXi multipath settings for the appliance • Increase ESXi queue depths

• Configure jumbo frames for cluster and iSCSI networks

For details about the performance best practices, see the following documents. • PowerStore: PowerStore X Performance Best Practice Tuning

• Dell EMC PowerStore Virtualization Guide

• Dell EMC PowerStore: VMware vSphere Best Practices

(16)

3.1.2 PowerStore storage containers and virtual volumes

On PowerStore X models, the VASA provider is automatically registered with vSphere, and the default

storage container is mounted automatically on the internal ESXi nodes through the iSCSI protocol. See Figure 3. For external ESXi hosts, PowerStore can serve block volumes using Fibre Channel (FC), iSCSI, or NVMe over Fabrics (NVMe-OF). PowerStore can also serve vVol storage containers to the external hosts using FC or iSCSI. However, you must manually register the VASA provider, and you must mount the storage

containers manually on the external ESXi hosts.

PowerStore automatically tracks the vVols that belong to each VM. The PowerStore Manager UI shows these vVols objects under the Virtual Machines view. See Figure 4 and Figure 5.

For more information about vVols, storage containers, and vSphere VASA, see the Dell EMC PowerStore Virtualization Guide.

(17)

(18)

3.1.3 Creating virtual machines on PowerStore X model appliance

Using vCenter, the virtual machines for the Spark nodes and the Hadoop nodes are deployed directly on the PowerStore X internal ESXi hosts based on the information in Table 1 (virtual machine specifications), and Table 2 (file system layout). The two PowerStore X internal ESXi hosts are presented and managed in vCenter like other external ESXi hosts. See Figure 6. You can also view the virtual machines in the PowerStore Manager UI, See Figure 7.

We recommend creating the virtual machines from a template which ensures consistency and faster setup. You can import a virtual machine or template from an existing environment to speed up the deployment process. In this example, a Red Hat Enterprise Linux 7.9 template is established with the packages and configuration outlined in section 3.1.3.1.

PowerStore X internal ESXi hosts and virtual machines in vCenter Internal ESXi nodes in PowerStore X

appliance

PowerStore Controller VMs

(19)

Virtual machines in PowerStore Manager

3.1.3.1 Guest virtual machine operating system

Spark runs on Windows, macOS, and Linux. Apache Hadoop supports Linux and Windows but is mostly deployed on Linux. In this example, Red Hat Enterprise Linux (RHEL) is used for all applications. A VM template is created with RHEL plus the following software and configurations. All application VMs are created from the template to ease deployment and ensure consistency:

• Enterprise Linux 7.9 Server with Graphical Desktop • chrony • open-vm-tools • lsscsi • sg3_utils • autofs • iscsi-initiator-utils • java-1.8.0-openjdk • java-1.8.0-openjdk-devel • python3 • python3-pip • python3-setuptools • zlib • zlib-devel • ncurses • ncurses-devel • gcc • opensll-devel • bzip2-devel

(20)

This example applies the following configurations to Red Hat Enterprise Linux: • Disable optional services

for service in firewalld avahi-daemon irqbalance iptables ip6tables do

systemctl disable $service systemctl stop $service done

• Configure the virtual machines to use network time servers. It is a best practice to keep the system clock synchronized across the cluster nodes. chrony is a common time synchronization service available on Linux. Add the network time server IP address in /etc/chrony.conf and enable the service.

server $time_server_1_ip iburst server $time_server_2_ip iburst # systemctl enable chronyd --now

• Ensure the applications have enough system resources to run on the VMs by increasing the ulimit limits. Use the following as a starting point, and adjust the settings if necessary. Set the following in

/etc/security/limits.conf.

* soft nofile 128000 * hard nofile 128000 * hard nproc 16000 * hard fsize -1

* soft core ulimited * soft data unlimited * hard data unlimited * soft stack unlimited * hard stack unlimited

3.1.3.2 File system layout

(21)

Table 2 shows an example of the file system layout for each application VM. It is a best practice to separate application data from the operating system. One or more file systems are dedicated on each VM for HDFS use.

Spark and Hadoop file system layout

Virtual machine Paravirtual SCSI

controllers Virtual disks Size

Mount point Description hadoop-namenode-vm10 2 /dev/sda /dev/sdb 50 GB 100 GB / and swap /data/1

sda used for

operating system and application binaries sdb used for NameNode spark-prim-vm10 spark-wrk-vm10 spark-wrk-vm11 spark-wrk-vm12 spark-wrk-vm13 3 /dev/sda /dev/sdb /dev/sdc 50 GB 100 GB 100 GB / and swap /data/1 /data/2

sda used for

operating system and application binaries sdb and sdc used for data on DataNodes

spark-bench-vm10 1 /dev/sda 50 GB / and swap sda used for

operating system and application binaries

3.1.3.3 Networking

The PowerStore X model creates a vSphere distributed switch (vDS) and a set of preconfigured distributed port groups for internal communications during the initial configuration process. Each internal ESXi node has two 10 Gb connections for vDS uplinks and one 1 Gb connection for the management network. The

(22)

vSphere distributed switch and distributed port groups for PowerStore X

3.2 Installation and configuration of Apache Hadoop

This section shows the basic installation and configuration of a Hadoop cluster. For more advanced topics and configuration, see the documentation at https://hadoop.apache.org/docs/current/.

3.2.1 Installing Hadoop

Apache Hadoop project offers several binary versions and the source code on the project website. For simplicity and ease of installation, download one of the prebuilt binaries from

http://hadoop.apache.org/releases.html. To decide which Hadoop version to use, check the Spark download site at http://spark.apache.org/downloads.html to verify which version of Hadoop is supported. In this example, Hadoop release 3.2.2 is chosen because it is supported by the Spark prebuilt version 3.0.2. The following steps show an example of installing a prebuilt version of Hadoop.

Perform the following steps on each Hadoop node as the root user: 1. Install Java JDK.

# yum install java-1.8.0-openjdk

# yum install java-1.8.0-openjdk-devel 2. Install Python3.

# yum install python3 python3-pip python3-setuptools

(23)

3. Create a hdfs user and group.

When deploying Hadoop on multiple VMs, ensure the hdfs user id (UID) and group id (GID) are the same across all cluster nodes.

# groupadd -g 3000 hdfs

# useradd -u 3000 -g 3000 -d /home/hdfs hdfs # passwd hdfs

4. Download Hadoop from http://hadoop.apache.org/releases.html and save the installation file in /usr/local.

5. Extract the software into a subdirectory in /usr/local. # cd /usr/local

# tar xzvf hadoop-3.2.2.tar.gz 6. Assign ownership to hdfs user.

# chmod -R hdfs:hdfs /usr/local/hadoop-3.2.2

7. Optionally, create a symbolic link to the software directory. Configure a symbolic link to point to the active version of the software. This action ensures a consistent path to the Hadoop program and configuration files between different versions of the software.

# ln -s /usr/local/hadoop-3.2.2 /usr/local/hadoop

8. Configure the following environment variables for the hdfs user in $HOME/.bashrc. export HADOOP_HOME=/usr/local/hadoop

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export JAVA_HOME=/usr/lib/jvm/java-1.8.0

export _JAVA_OPTIONS="-Xmx4g -Djava.awt.headless=true" export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${PATH} export HADOOP_MAPRED_HOME=/usr/local/hadoop

HADOOP_HOME Set the location of the Hadoop software

HADOOP_CONF_DIR Set the location of the Hadoop configuration files JAVA_HOME Set the location of the java software

_JAVA_OPTIONS Set the java heap size and other java options

PATH Add locations of Hadoop programs to the search path

(24)

3.2.2 Configuring Hadoop HDFS cluster

You can deploy Apache Hadoop on a single node as a standalone instance or across multiple nodes in a cluster setting. The standalone setup is great for performing quick tests or debugging without the overhead of bringing up a full cluster. This paper focuses on setting up a Hadoop cluster with multiple VMs.

1. Set up the Hadoop worker file /usr/local/Hadoop/etc/hadoop/workers. This file contains a list of HDFS DataNodes of the cluster.

$ cd /usr/local/hadoop/etc/hadoop $ cat workers spark-prim-vm10 spark-wrk-vm10 spark-wrk-vm11 spark-wrk-vm12 spark-wrk-vm13

2. Configure the Hadoop environment settings in /usr/local/hadoop/etc/hadoop/hadoop-env.sh. This file contains environment variables for the Hadoop daemons such as the Java process options and Hadoop software location.

export HDFS_NAMENODE_USER=hdfs

export HADOOP_MAPRED_HOME=/usr/local/hadoop export HADOOP_HOME=/usr/local/hadoop

export HADOOP_HEAPSIZE_MAX=4g

export HDFS_NAMENODE_OPTS="Xmx4g Djava.awt.headless=true -XX:+UseParallelGC"

export HDFS_DATANODE_OPTS="Xmx4g Djava.awt.headless=true -XX:+UseParallelGC"

3. Configure Hadoop core configuration settings in /usr/local/hadoop/etc/core-site.xml. This file contains core site settings such as I/O, security, and others. There are hundreds of configurable attributes, and many have default values that are not listed in the core-site.xml file. To see the complete list of these attributes and their description, go to http://hadoop.apache.org and search for the core-default.xml documentation. In this example, the fs.defaultFS and

hadoop.http.staticuser.user attributes are defined in the file.

- fs.defaultFS sets the default file system universal resource identifier (URI) of your environment. - hadoop.http.staticuser.user sets the username that would be used to browse the content of file

(25)

4. Configure the HDFS configuration settings in /usr/local/hadoop/etc/hdfs-site.xml. For a complete list of attributes of HDFS, search for hdfs-default.xml documentation on http://hadoop.apache.org. The following attributes are set in this example.

- dfs.replication specifies the number of block replications for all files in HDFS. The default for block

replication is 3.

- dfs.namenode.name.dir specifies the local file systems to store the NameNode name table

(fsimage).

- dfs.datanode.data.dir specifies the local file systems on the DataNode to store the data blocks. - dfs.datanode.max.transfer.threads specifies the maximum number of threads for transferring data

in and out of the DataNode.

<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/data/1/dfs/nn</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/1/dfs/dn,/data/2/dfs/dn</value> </property> <property> <name>dfs.datanode.max.transfer.threads</name> <value>4096</value> </property> </configuration>

5. Configure passwordless SSH between the NameNode and the DataNodes. This action allows the NameNode to transfer files, and start and stop the Hadoop daemons remotely without supplying the password for each node. See appendix A for instructions to set up passwordless SSH.

6. Sync the configuration files in the /usr/local/hadoop/etc/hadoop directory on the NameNode to all DataNodes. Use the scp or rsync command to transfer the files between the VMs.

7. Start the Hadoop daemons as the hdfs user.

Hadoop provides a set of scripts in /usr/local/hadoop/sbin to start and stop the daemons and cluster.

- start-dfs.sh, stop-dfs.sh – start and stop NameNode, DataNode, and HDFS daemons.

(26)

8. Validate the Hadoop cluster with the hdfs command or web UI.

a. As the hdfs user, run hdfs dfsadmin -report to show the status of the cluster and each node. $ hdfs dfsadmin -report Configured Capacity: 1073217536000 (999.51 GB) Present Capacity: 1072615031814 (998.95 GB) DFS Remaining: 673199080454 (626.97 GB) DFS Used: 399415951360 (371.99 GB) DFS Used%: 37.24% Replicated Blocks:

Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0

Missing blocks (with replication factor 1): 0

Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0

Erasure Coded Block Groups: Low redundancy block groups: 0

Block groups with corrupt internal blocks: 0 Missing block groups: 0

Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0

--- Live datanodes (5):

Name: 100.88.XX.XX:9866 (spark-prim-vm10.techsol.local) Hostname: spark-prim-vm10.techsol.local

Decommission Status : Normal

Configured Capacity: 214643507200 (199.90 GB) DFS Used: 64006270976 (59.61 GB) Non DFS Used: 69025792 (65.83 MB) DFS Remaining: 150439768579 (140.11 GB) DFS Used%: 29.82% DFS Remaining%: 70.09%

Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 3

Last contact: Thu May 13 09:42:36 CDT 2021 Last Block Report: Thu May 13 07:21:20 CDT 2021 Num of Blocks: 1298

(27)

b. Go to the Hadoop web UI in a browser: http://$NAMENODE_IP:9870.

(28)

Hadoop web UI > DataNode Information

3.3 Installation and configuration of Apache Spark

This section shows the installation and configuration of a Spark cluster including the following software dependencies:

• Java JDK: Spark runs on Java Virtual Machines (JVMs). It requires Java 6 or above. • Programming language interpreter: Spark is written in Scala and is shipped with a Scala

interpreter. Spark also works with Python, Java, and R. For Spark to work from any of these programming languages, install the programming interpreters on the system. According to Apache Spark project, Python is now the most widely used language with Spark.

3.3.1 Installing Spark

Spark offers several prebuilt versions and source code on the project site. For simplicity and ease of

installation, download one of the prebuilt binaries from http://spark.apache.org/downloads.html. If you prefer to build Spark from source for advanced customization, follow the information on

(29)

Perform the following steps as the root user on the Spark nodes: 1. Install Java JDK.

# yum install java-1.8.0-openjdk

# yum install java-1.8.0-openjdk-devel 2. Install the programming language interpreters.

# yum install python3 python3-pip python3-setuptools

3. Create a spark user and group as the root user. When deploying Spark on multiple VMs, ensure the Spark user id (UID) and group id (GID) are the same across all cluster member VMs.

# groupadd -g 3004 spark

# useradd -u 3004 -g 3004 -d /home/spark # passwd spark

4. Download Spark from http://spark.apache.org/downloads.html to the /usr/local directory. In this example, the Spark release is 3.0.2, and the package type is Pre-built for Apache Hadoop 3.2 and

later (see Figure 11). Click the download link to download the file.

Apache Spark download page

Note: The download site is updated periodically with new releases, and older releases may be archived to

another location. Ensure that the prebuilt Spark version is compatible with the Hadoop version that you have chosen.

5. Extract the software in a subdirectory in /usr/local. # cd /usr/local

# tar xzvf spark-3.0.2-bin-hadoop3.2.tgz 6. Assign ownership to spark user.

(30)

7. Optionally, create a symbolic link to the software directory. Configure a symbolic link to point to the active version of the software. This action ensures a consistent path to the Spark program and configuration files between different versions of the software.

# ln -s /usr/local/spark-3.0.2-bin-hadoop3.2 /usr/local/spark 8. Configure the following environment variables for the spark user in $HOME/.bashrc.

export SPARK_HOME=/usr/local/spark

export JAVA_HOME=/usr/lib/jvm/java-1.8.0

export _JAVA_OPTIONS="-Xmx4g -XX:+UseParallelGC"

export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${PATH} export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:${SPARK_HOME}/python/:$PYTHONPATH"

export PYSPARK_PYTHON=/usr/bin/python3

export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

SPARK_HOME Set the location of the Spark software JAVA_HOME Set the location of the java software

_JAVA_OPTIONS Set the java heap size and other java options

PATH Add locations of Spark programs to the search path

PYTHONPATH Set the locations of Spark python libraries for pyspark PYSPARK_PYTHON Set the Python interpreter for pyspark

PYSPARK_DRIVE_PYTHON Set the Python interpreter for pyspark driver 9. Verify the Spark installation using the spark-shell interactive tool as spark user.

$ source ~/.bashrc $ spark-shell

A Spark session is successfully created and waiting for user command.

3.3.2 Configuring a Spark standalone cluster

Spark can run on a single host or on multiple hosts in a cluster setting. To form a Spark standalone cluster, add the cluster nodes to the Spark configuration file and synchronize the environment settings and cluster configuration across all cluster nodes.

Configure and update these files on the master node first, and sync them to all worker nodes. 1. To configure the Spark standalone cluster, add the worker-node information in the

/usr/local/spark/conf/slaves file. In this example, the following worker nodes are added to the Spark

(31)

2. Configure Spark logging in /usr/local/spark/conf/log4j.properties.

Spark uses log4j for logging. Configure log4j by copying the template file in the /usr/local/spark/conf directory. The default settings in the template are a good starting point without any changes. Adjust the parameters if necessary.

$ cd /usr/local/spark/conf

$ cp log4j.properties.template log4j.properties

3. Configure Spark environment settings in /usr/local/spark/conf/spark-env.sh.

The following variables are chosen as a baseline. These variables configure the Spark cluster such as the java classpath and the web portal UI. To see the complete list of variables, see

https://spark.apache.org/docs/latest/spark-standalone.html. # cat spark-env.sh

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/

export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) export SPARK_MASTER_HOST=spark-prim-vm10

export SPARK_MASTER_WEBUI_PORT=9090 export SPARK_WORKER_CORES=22

HADOOP_HOME Set the location of the Hadoop programs and configuration files so that Spark can access the HDFS.

SPARK_DIST_CLASSPATH Include the Hadoop classpath. SPARK_MASTER_HOST Set the master node ip/hostname.

SPARK_MASTER_WEBUI_PORT Set the master web UI port (default is 8080). It might be necessary to change the default port due to conflicts with other applications on the same system.

SPARK_WORKER_CORES Set how many CPU cores Spark applications allow to use. The default is all CPU cores.

4. Configure the Spark application properties in /usr/local/spark/conf/spark-defaults.conf. This configuration file contains properties that control most of the application settings. We recommend reviewing these properties to understand what they do and how they change the

behavior of Spark. See https://spark.apache.org/docs/latest/configuration.html for the comprehensive list of properties, their default values, and description.

(32)

spark.driver.log.persistToDfs.enabled true spark.driver.log.dfsDir /user/spark/driverlogs spark.network.timeout 600000

spark.executor.heartbeatInterval 100000 # Enable history server

spark.eventLog.enabled true

spark.eventLog.dir hdfs://hadoop-namenode-vm10:9000/user/spark/loghistory spark.history.fs.logDirectory

hdfs://hadoop-namenode-vm10:9000/user/spark/loghistory

spark.master Set to the cluster manager ip/hostname

and port. For Spark standalone cluster, the URI is in the form of

spark://$SPARK_MASTER_IP:$PORT.

spark.sql.debug.maxToStringFields Set the maximum number of fields that can be converted to strings in debug output.

spark.dynamicAllocation.enabled Enable dynamic resource allocation. This _{allows dynamic scaling of executors.}

spark.dynamicAllocation.shuffleTracking. enabled

Enable shuffle file tracking for executors without the need for an external shuffle service.

spark.dynamicAllocation.executorIdleTime out

Increase the executor default timeout from 60–600 s. This prevents the executors to be removed prematurely for long running tasks.

spark.driver.log.persistToDfs Enable the applications to write the _{driver logs to a persistent storage. The} default is do not persist the driver logs. spark.driver.log.dfsDir Set the persistent storage location where

the Spark driver stores the logs. In this example, it is set to use a HDFS directory.

spark.network.timeout Set the default timeout for all network connections. The default is 120 s. Increase the timeout for long running tasks, for example, 600000 ms (10 min).

(33)

spark.eventLog.enabled Enable Spark to log events to be used with Spark history server. See section 3.3.3.

spark.eventLog.dir Set the location for the Spark event logs

to be stored. In this example, it is set to use a HDFS directory. See section 3.3.3.

spark.history.fs.logDirectory Specifies the persistent storage where _{the Spark history server can load the} event logs from. This should be set to the same as spark.eventLog.dir. See section 3.3.3.

5. Configure passwordless SSH between the master node and the worker nodes. This allows the master node to transfer files, and start and stop the Spark daemons remotely without supplying the password for each node. See appendix A for instructions to set up passwordless SSH.

6. Sync the configuration files in the /usr/local/spark/conf directory on the Spark master node to all worker nodes. Use the scp or rsync command to transfer the files between the nodes.

7. Start the Spark processes as the spark user.

Spark provides a set of scripts in /usr/local/spark/sbin to start and stop the cluster or individual processes.

start-all.sh, stop-all – start, and stop all Spark processes on the master and worker nodes. This does not include the Spark History Server.

(34)

8. Validate the Spark standalone cluster on the Spark master web UI.

Go to the Spark master web UI in a browser: http://spark-prim-vm10:9090. Verify that the worker node status is alive.

Spark master web UI

3.3.3 Configuring Spark History Server

Spark History Server is a web front end that accesses and displays the event logs generated from the Spark applications across all nodes. You must enable and save the event logs in a centralized location where the Spark History Server can access them. While the Spark master web UI also provides application logs, they are not persisted across Spark restarts. It is useful to have the event logs available for troubleshooting or tuning the application performance. See section 4.2.5 for an example of using the Spark History Server. To enable Spark history server, use the following procedures:

1. Create a directory for the event logs where all Spark worker nodes have write access. In this example, the log directory resides in the HDFS cluster.

On any of the HDFS DataNodes, perform the following commands as the hdfs or spark user to create a new directory and set the ownership to spark user:

$ hdfs dfs -mkdir /user/spark/loghistory

(35)

2. Add the following entries in /usr/local/spark/conf/spark-defaults.conf on all Spark nodes. These entries enable the event logging to the specified HDFS directory from all Spark nodes.

# Enable history server spark.eventLog.enabled true

spark.eventLog.dir hdfs://hadoop-namenode-vm10:9000/user/spark/loghistory spark.history.fs.logDirectory

hdfs://hadoop-namenode-vm10:9000/user/spark/loghistory

3. As spark user, start the Spark History Server on the Spark master node. $ /usr/local/spark/sbin/start-history-server.sh

4. Verify the status of Spark History Server. In a browser, connect to the Spark History Server at http://$SPARK_MASTER_IP:18080

(36)

4 Testing Spark with Spark-bench

Spark-bench is an open-source benchmarking tool that is designed to test various application workloads. The tool provides an integrated data generator and several application workloads including machine learning, streaming, graph processing, and SQL. The goal of the project is to provide the developers a comprehensive Spark-specific benchmark tool that is also easy to use and configure. Developers use the tool to test and validate their configurations, compare performance between different platforms, and identity system bottlenecks in the environment.

4.1 Installing Spark-bench tool

The original Spark-bench code can be downloaded from the github site https://github.com/CODAIT/spark-bench. The code was last updated in November 2018. Since then, other developers have forked the code to fix issues or add enhancements to the tool. In this example, a forked version is used from

https://github.com/ch2994/spark-bench because it updates the Scala version to 2.12. The original Spark-bench code is compiled with Scala version 2.11, which is incompatible with the prebuilt version of Spark 3.x because it is compiled with Scala version 2.12.

The Spark-bench tool can be installed on one of the Spark nodes or on a dedicated node. If Spark-bench is installed on a dedicated node, it is a best practice to keep the Spark-bench node close to the Spark cluster to avoid high network latency.

4.1.1 Installation prerequisites

Spark-bench requires the following software: • Java 8 or above

• Python 3

• Spark software: Spark-bench launches application workloads by making the spark-submit calls to the cluster. Spark software is required on the Spark-bench node.

• Hadoop software: To configure the Spark-bench driver to save the logs to an HDFS directory, the system requires Hadoop software.

Follow the steps in section 3.3.1 and section 3.3.2 to install and configure Spark on the Spark-bench VM, except for configuring /usr/local/spark/conf/slaves. In this example, the Spark-bench VM functions as a dedicated driver and not a part of the Spark cluster. There is no requirement to add the Spark-bench node in the configuration file. Also, there is no requirement to start the Spark daemons on the Spark-bench VM. Follow the steps in section 3.2 to install and configure Hadoop, except for configuring

(37)

4.1.2 Installing Spark-bench

The following procedure installs Spark-bench on a dedicated virtual machine running on the same PowerStore X appliance:

(38)

2. As root user, extract the software into a subdirectory in /usr/local. # cd /usr/local

# unzip spark-bench-scala-2.12.zip

3. Create a symbolic link to the software directory as root user.

# ln -s /usr/local/spark-bench-scala-2.12 /usr/local/spark-bench 4. Assign ownership to the spark user.

# chown -R spark:spark /usr/local/spark-bench-scala-2.12 5. Download the sbt tool from github to compile the Spark-bench code.

# cd /usr/local

# wget https://github.com/sbt/sbt/releases/download/v1.4.8/sbt-1.4.8.zip # unzip sbt-1.4.8.zip

6. Compile the Spark-bench code with sbt tool as spark user. For more information about compiling Spark-bench, see https://codait.github.io/spark-bench/compilation/.

$ cd /usr/local/spark-bench

$ /usr/local/sbt/bin/sbt assembly $ mkdir lib

$ cp -p target/assembly/* lib

7. When sbt assembly is complete, two jar files are generated in the target/assembly directory. Move or copy them to the lib directory.

spark-bench-2.3.0_0.4.0-RELEASE.jar

spark-bench-launch-2.3.0_0.4.0-RELEASE.jar

4.2 Running Spark-bench workloads

This section demonstrates running the Spark-bench KMeans workload, and using the Spark master web UI and Spark History Server to monitor the applications. Spark-bench provides data generators for KMeans, Linear Regression, and Graph. It also includes workloads for KMeans, Logistic Regression, SparkPi, and others. For the complete list of workloads and their definitions, see

https://codait.github.io/spark-bench/workloads/. As noted in the article, even though the project aims to provide comprehensive workloads supported by Spark, some of these workloads have not been fully implemented. For instance, while it can generate data for Linear Regression, the Linear Regression workload has not been implemented yet. This paper focuses on the KMeans workload, a machine learning workload, because Spark-bench supports both generating the data and exercising the KMeans workload against the generated data. The workload reads the data from the storage, performs computation in memory, and writes the results to the storage. Spark-bench also provides example configuration files for different workloads. These examples are good starting points for beginners to explore Spark-bench and the workload configuration files. Before you attempt to implement the KMeans workload, we recommend reading about these examples on

(39)

4.2.1 Generate KMeans dataset

The following data-generation configuration is based on an example configuration from Spark-bench. The example configuration files are in /usr/local/spark-bench/examples. Make a copy of the

data-generation.conf and modify it to fit your Spark environment. At a minimum, adjust the parameters highlighted

to reflect your environment.

$ cat data-generation-8p.conf spark-bench = {

spark-submit-parallel = false spark-submit-config = [{ spark-args = {

// Specify the Spark master address in your env master = "spark://spark-prim-vm10:7077"

// Specify how much memory to request for the executor. Must be less than the avail memory on the worker node

executor-memory = "4G" }

suites-parallel = false workload-suites = [ {

descr = "Generating data for the benchmarks to use" parallel = false

repeat = 1 // generate once and done! benchmark-output = console

workloads = [ {

name = "data-generation-kmeans"

// The generated data is written to the HDFS filesystem in the parquet format

output = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata-p8-50mil/kmeans-data.parquet"

save-mode = "overwrite"

// Size of the dataset, 50 million rows total rows = 50000000 cols = 24 partitions = 8 } ] } ] }] }

To generate the dataset, run the following command as spark user. Ensure that the spark-bench.sh is invoked by the same user that owns the output directory. These examples assign the spark user the ownership of the HDFS directories.

$ cd /usr/local/spark-bench/examples

(40)

When the application completes, verify the dataset in HDFS. In a web browser, go to the HDFS NameNode web UI at http://$HADOOP_NAMENODE_IP:9870. Click Utilities > Browse the file system. The number of files created is based on the partition parameters that are defined in the configuration file.

Browse KMeans data directory in HDFS

4.2.2 Run KMeans workload

Make a copy of the KMeans configuration file in the example directory. Modify the configuration to fit your environment. The Spark-bench configuration syntax is very flexible and allows defining multiple workloads, repeating workloads, and sequential and parallel executions of workloads to simulate various mix workload patterns. The following example shows three workloads with different Spark settings and each workload is repeated five times. The goal is to test the effect of using different CPU cores for each executor. The configuration also overrides the default executor memory and driver memory setting.

spark-bench = {

(41)

executor-cores = 4 executor-memory = 2g driver-memory = 4g } suites-parallel = false // Workload 1 workload-suites = [ {

descr = "Run kmeans " parallel = false repeat = 5 benchmark-output = "console" workloads = [ { name = "kmeans" input = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata-p8-50mil/kmeans-data.parquet" k = 10 } ] } ] }, { spark-args = { master = "spark://spark-prim-vm10:7077" executor-cores = 8 executor-memory = 2g driver-memory = 4g } // Workload 2 suites-parallel = false workload-suites = [ {

(42)

spark-args = { master = "spark://spark-prim-vm10:7077" executor-cores = 12 executor-memory = 2g driver-memory = 4g } suites-parallel = false workload-suites = [ {

descr = "Run kmeans 3 " parallel = false repeat = 5 benchmark-output = "console" workloads = [ { name = "kmeans" input = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata-p8/kmeans-data.parquet" k = 10 } ] } ] }] }

4.2.3 Spark memory and CPU cores

Because Spark is an in-memory computing engine, proper memory configuration is critical to its performance. One of the common errors Spark applications encounter is Out Of Memory errors (OOM). This is typically related to the JAVA Heap setting, the executor memory setting, or the driver memory setting. Every

application has different requirement, and there is not a universal setting that works for every application. The memory configuration might appear to be different from application to application. The general guideline is that the memory setting should be large enough to hold the dataset. Adjust and experiment with these values when the application encounters memory errors. For a complete list of tunable Spark settings, see

https://spark.apache.org/docs/latest/configuration.html.

To adjust the amount of Java Heap space for Spark daemons, set the following environment settings: • _JAVA_OPTIONS = -Xmx4g in spark user $HOME/.bashrc file

• SPARK_DAEMON_JAVA_OPTS = -Xmx4g in $SPARK_HOME/conf/spark-env.sh file

To change the default settings for the Spark driver memory and Spark executor memory, set the following parameters in $SPARK_HOME/conf/spark-defaults.conf.

• spark.driver.memory • spark.executor.memory

(43)

When a Spark application is submitted to the cluster manager, it launches several executors on the Spark worker nodes to process the tasks. By default, one executor is requested on each Spark worker node with 1 GB of memory and all CPU cores available on the node. However, when Spark is co-located with other applications, like Hadoop in this example, some CPU cores should be reserved for the Hadoop daemons. To limit the total number of CPU cores that Spark applications are allowed to use on the system, set

SPARK_WORKER_CORES to the total number of CPU cores on the system minus the number of CPU cores reserved for the other applications in /usr/local/spark/conf/spark-env.sh.

Also, the Spark application might explicitly request the number of cores allowed for each executor. Instead of all available CPU cores for a single executor on a worker node, set the executor-cores parameter in the application to override the default. The example in section 4.2.2 compares the duration of the KMeans workloads using a different number of executor-cores. Figure 15 shows the status summary of the application runs and their durations on the Spark master web UI. In this example, there are four Spark worker nodes. Each node has 22 available CPU cores for Spark applications. When the executor-cores is specified explicitly, Spark automatically calculates the number of executors that it can run on each worker node based on the available CPU cores.

Click the application id link to see the executors details like in Figure 16. The status of the executors might show KILLED even though they are completed successfully because the driver asks the workers to terminate the executors after they finish processing.

Spark application status in Spark master web UI

(44)

Executor status of an application

4.2.4 Spark network timeout

It might be necessary to increase the timeout values for Spark network communication. The default network timeout is 120 s, which might not be long enough for long running tasks. If the application fails with a timeout error, increase the spark.network.timeout value incrementally to find the optimal value.

spark.network.timeout is defined in $SPARK_HOME/conf/spark-defaults.conf. See section 3.3.2.

4.2.5 Monitoring Spark applications

Use the Spark master web UI to monitor the applications that are running or recently completed. The application information does not persist after restarting Spark daemons. To retain information about

completed applications, configure and use the Spark History Server. See section 3.3.3 for information about configuring Spark History Server.

4.2.5.1 Spark master web UI

The Spark master web UI shows the workers status, a summary of CPU cores and memory available and used on each worker, and a list of completed and incomplete applications. It also displays Spark applications information including their core and memory usage, environment settings, stages and tasks, and the

executors information. Click the application id to see the runtime information and log messages. See Figure 15 and Figure 16.

(45)

4.2.5.2 Spark History Server

We recommend configuring the Spark History Server to retain the application information on persistent storage. It is useful to be able to recall the information for troubleshooting or comparing performance with different environment settings. To store the application information, configure Spark worker nodes to write events to a persistent storage location like HDFS. See section 3.3.3 about configuring Spark History Server. To access the Spark History Server, in a web browser, go to http://$SPARK_MASTER_IP:18080. The main page shows a list of completed and incomplete applications. See Figure 17.

Spark History Server web UI

Click the app id link to see more details like the event timeline and the status of the jobs. The jobs and stages sections provide useful information about their runtimes, the functions performed, and the detail logs specific to each job and stage. This allows users to easily see the end-to-end job flow, identify trouble areas, and investigate the cause of the bottleneck or issues. See Figure 18, Figure 19, Figure 20.

(46)

(47)

Show all stages and review the detail log of a specific stage

The environment section, Figure 21, shows the application environment settings and is useful to review how adjusting these settings might affect the performance of the application. For instance, execute the same application with different executor memory settings and compare their performance.

Dell EMC PowerStore: Apache Spark Solution Guide