Setting Up the Development Environment
WHICH PROPERTIES CAN I SET?
ConfigurationPrinter is a useful tool for discovering what a property is set to in your environment. For a running
daemon, like the namenode, you can see its configuration by viewing the /conf page on its web server. (See Table 10- 6 to find port numbers.)
You can also see the default settings for all the public properties in Hadoop by looking in the share/doc directory of your Hadoop installation for files called core-default.xml, hdfs-default.xml, yarn-default.xml, and mapred-default.xml. Each property has a description that explains what it is for and what values it can be set to.
The default settings files’ documentation can be found online at pages linked from
http://hadoop.apache.org/docs/current/ (look for the “Configuration” heading in the navigation). You can find the defaults for a particular Hadoop release by replacing current in the preceding URL with r<version> — for example,
http://hadoop.apache.org/docs/r2.5.0/.
Be aware that some properties have no effect when set in the client configuration. For example, if you set
yarn.nodemanager.resource.memory-mb in your job submission with the expectation that it would change the
amount of memory available to the node managers running your job, you would be disappointed, because this property is honored only if set in the node manager’s yarn-site.xml file. In general, you can tell the component where a property should be set by its name, so the fact that yarn.nodemanager.resource.memory-mb starts with
yarn.nodemanager gives you a clue that it can be set only for the node manager daemon. This is not a hard and fast
rule, however, so in some cases you may need to resort to trial and error, or even to reading the source.
Configuration property names have changed in Hadoop 2 onward, in order to give them a more regular naming structure. For example, the HDFS properties pertaining to the namenode have been changed to have a dfs.namenode
prefix, so dfs.name.dir is now dfs.namenode.name.dir. Similarly, MapReduce properties have the mapreduce
prefix rather than the older mapred prefix, so mapred.job.name is now mapreduce.job.name.
This book uses the new property names to avoid deprecation warnings. The old property names still work, however, and they are often referred to in older documentation. You can find a table listing the deprecated property names and their replacements on the Hadoop website.
We discuss many of Hadoop’s most important configuration properties throughout this book.
GenericOptionsParser also allows you to set individual properties. For example:
% hadoop ConfigurationPrinter -D color=yellow | grep color color=yellow
Here, the -D option is used to set the configuration property with key color to the value
yellow. Options specified with -D take priority over properties from the configuration
files. This is very useful because you can put defaults into configuration files and then override them with the -D option as needed. A common example of this is setting the
number of reducers for a MapReduce job via -D mapreduce.job.reduces=n. This will override the number of reducers set on the cluster or set in any client-side configuration files.
The other options that GenericOptionsParser and ToolRunner support are listed in Table 6-1. You can find more on Hadoop’s configuration API in The Configuration API.
WARNING
Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the -Dproperty=value option to the java command. The
syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas GenericOptionsParser does allow whitespace.
JVM system properties are retrieved from the java.lang.System class, but Hadoop properties are accessible only
from a Configuration object. So, the following command will print nothing, even though the color system property
has been set (via HADOOP_OPTS), because the System class is not used by ConfigurationPrinter: % HADOOP_OPTS='-Dcolor=yellow' \
hadoop ConfigurationPrinter | grep color
If you want to be able to set configuration through system properties, you need to mirror the system properties of interest in the configuration file. See Variable Expansion for further discussion.
Table 6-1. GenericOptionsParser and ToolRunner options
Option Description
-D property=value Sets the given Hadoop configuration property to the given value. Overrides any default or site
properties in the configuration and any properties set via the -conf option.
-conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient way to set site
properties or to set a number of properties at once.
-fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.defaultFS=uri.
-jt host:port Sets the YARN resource manager to the given host and port. (In Hadoop 1, it sets the jobtracker
address, hence the option name.) Shortcut for -D yarn.resourcemanager.address=host:port.
-files file1,file2,...
Copies the specified files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by MapReduce (usually HDFS) and makes them available to
MapReduce programs in the task’s working directory. (See Distributed Cache for more on the distributed cache mechanism for copying files to machines in the cluster.)
-archives
archive1,archive2,...
Copies the specified archives from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by MapReduce (usually HDFS), unarchives them, and makes them available to MapReduce programs in the task’s working directory.
-libjars jar1,jar2,...
Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by MapReduce (usually HDFS) and adds them to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that a job is dependent on.