As you’ve learned, Spark provides a single tool for submitting jobs across all cluster managers, called spark-submit. In Chapter 2 you saw a simple example of submit‐ ting a Python program with spark-submit, repeated here in Example 7-1.
Example 7-1. Submitting a Python application
bin/spark-submit my_script.py
When spark-submit is called with nothing but the name of a script or JAR, it simply runs the supplied Spark program locally. Let’s say we wanted to submit this program to a Spark Standalone cluster. We can provide extra flags with the address of a Stand‐ alone cluster and a specific size of each executor process we’d like to launch, as shown in Example 7-2.
Example 7-2. Submitting an application with extra arguments
bin/spark-submit --master spark://host:7077 --executor-memory 10g my_script.py The --master flag specifies a cluster URL to connect to; in this case, the spark:// URL means a cluster using Spark’s Standalone mode (see Table 7-1). We will discuss other URL types later.
Table 7-1. Possible values for the --master flag in spark-submit
Value Explanation
spark:// host:port
Connect to a Spark Standalone cluster at the specified port. By default Spark Standalone masters use port 7077.
mesos:// host:port
Connect to a Mesos cluster master at the specified port. By default Mesos masters listen on port 5050.
yarn Connect to a YARN cluster. When running on YARN you’ll need to set the HADOOP_CONF_DIR
environment variable to point the location of your Hadoop configuration directory, which contains information about the cluster.
local Run in local mode with a single core.
local[N] Run in local mode with N cores.
local[*] Run in local mode and use as many cores as the machine has.
Apart from a cluster URL, spark-submit provides a variety of options that let you control specific details about a particular run of your application. These options fall Deploying Applications with spark-submit | 121
roughly into two categories. The first is scheduling information, such as the amount of resources you’d like to request for your job (as shown in Example 7-2). The second is information about the runtime dependencies of your application, such as libraries or files you want to deploy to all worker machines.
The general format for spark-submit is shown in Example 7-3.
Example 7-3. General format for spark-submit
bin/spark-submit [options] <app jar | python file> [app options]
[options] are a list of flags for spark-submit. You can enumerate all possible flags by running spark-submit --help. A list of common flags is enumerated in Table 7-2.
<app jar | python file> refers to the JAR or Python script containing the entry point into your application.
[app options] are options that will be passed onto your application. If the main() method of your program parses its calling arguments, it will see only [app options] and not the flags specific to spark-submit.
Table 7-2. Common flags for spark-submit
Flag Explanation
--master Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.
--deploy-mode Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where
spark-submit is itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
--class The “main” class of your application if you’re running a Java or Scala program.
--name A human-readable name for your application. This will be displayed in Spark’s web UI.
--jars A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
--files A list of files to be placed in the working directory of your application. This can be used for data files
that you want to distribute to each node.
--py-files A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip
Flag Explanation --executor-
memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver- memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
spark-submit also allows setting arbitrary SparkConf configuration options using either the --conf prop=value flag or providing a properties file through -- properties-file that contains key/value pairs. Chapter 8 will discuss Spark’s config‐ uration system.
Example 7-4 shows a few longer-form invocations of spark-submit using various options.
Example 7-4. Using spark-submit with various options
# Submitting a Java application to Standalone cluster mode $ ./bin/spark-submit \
--master spark://hostname:7077 \ --deploy-mode cluster \
--class com.databricks.examples.SparkExample \ --name "Example Program" \
--jars dep1.jar,dep2.jar,dep3.jar \ --total-executor-cores 300 \ --executor-memory 10g \
myApp.jar "options" "to your application" "go here"
# Submitting a Python application in YARN client mode $ export HADOP_CONF_DIR=/opt/hadoop/conf
$ ./bin/spark-submit \ --master yarn \
--py-files somelib-1.2.egg,otherlib-4.4.zip,other-file.py \ --deploy-mode client \
--name "Example Program" \ --queue exampleQueue \ --num-executors 40 \ --executor-memory 10g \
my_script.py "options" "to your application" "go here"