Overview
The Torque cluster is a distributed computer node system that allows batch job submission and uses the Torque resource manager to queue the jobs. The following discussion describes how to configure DMS to run jobs on a Torque cluster.
The DMS implementation on the Torque cluster consists of the following components: • DMS installation - a system where DMS has been installed
• Torque cluster - a set of Linux Red Hat 5 workstations where multiple DMS jobs can
be run simultaneously. Jobs are managed by the Torque Resource Manager.
• job agent - a DMS process that mediates between DMS and Torque. The job agent
submits jobs to Torque, which in turn submits the jobs to the Torque cluster.
• job agent registry - a DMS list of known machines running job agents. (Note: Typical
cases use only a single job agent running on the same node as the DMS GUI.) • file staging directory - a shared directory used by DMS to store files and other inputs
for jobs that will be executed by the Torque cluster. This file staging directory is also provides a convenient location for the script files required by the Torque management system to run jobs.
Prerequisites
This set of instructions assumes the following prerequisites:
• A Red Hat Enterprise Linux 5 cluster must be configured and the Torque job and queue management tools must be installed and working.
• The Torque nodes must be a platform supported by DecisionSpace (i.e., 64-bit Red Hat Enterprise 5 on AMD or Intel processors).
• RPMs for OpenMotif 2.2.3 and libXp 1.0.0 must be installed on each of the nodes of the cluster.
• The appropriate DMS-supported simulators must be installed in directories that are accessible from all nodes of the grid. Simulator installation requirements depend on
Important: The assistance of a system administrator will probably be required to
complete some of the cluster-side installation and configuration tasks described in this appendix, as well as to provide some of the necessary information for
the job types to be run. DMS supports the following simulators on the Torque cluster - - Nexus, VIP, and ECLIPSE.
• DMS software must be installed in a folder that is accessible from all nodes of the grid. • All nodes of the grid must have network connectivity to a Landmark license server that has valid licenses for the various DMS features to be used by the remote jobs. The license server must also have a valid DMS_GRID license.
• All nodes of the grid must have network connectivity to valid licenses for the particular simulators that will be run.
• All nodes of the grid must have read/write access to a “file staging directory” where job input files are distributed by the DMS job agent.
Configure Job Agent and Job Registry
Note: These instructions assume that the DMS application, the DMS job registry, and the
DMS job agent all run on the same Windows PC.
For this configuration, a directory accessible to both the Windows PC and the cluster machines is required. In the examples below, this directory appears as U:\ on the Windows PC and as /usr/local/dms_staging on the cluster machines.
The shared directory contains a “scripts” subdirectory which will contain some batch files and shell scripts which you or your system administrator must provide. The “Create Script Files and Configuration Files” section below provides examples of the required batch files and shell scripts.
On the Windows PC, you will need to edit the job agent’s properties file. This is located in a folder under the main DMS installation folder called modules\jobagent\batch. The properties file is called JobAgent.properties.
In the example shown below, items that will need to be customized for the deployment environment are shown in red.
Add the following lines to JobAgent.properties: # Torque com.lgc.edms.algorithmDispatcher=Sge5 com.lgc.edms.fileStagingDirectory=u:\\ com.lgc.edms.fileStagingDirectoryLinux=/usr/local/dms_staging com.lgc.edms.runRemote=u:\\scripts\\submitdmsjob.bat com.lgc.edms.maxPendingCluster=0
(Note that backslashes in the properties file are an escape character and must be doubled to denote an actual backslash. Alternatively, you can use a single forward slash.)
If the JobAgent.properties file has been modified previously, be sure that any other settings for com.lgc.edms.algorithmDispatcher are commented out or removed. Explanation of settings
The algorithmDispatcher=Sge5 string sets the job agent to use the DMS algorithm dispatcher for external job queuing systems.
The fileStagingDirectory and fileStagingDirectoryLinux settings refer to the common folder that is accessible by both the Windows PC and the Linux cluster. The first setting is the folder's name as seen on Windows, and the second setting is the folder's name as seen on Linux.
The runRemote setting is the name of the batch file that DMS will launch on Windows to submit an iteration to the cluster. An example of this batch file is presented later in this document.
The maxPendingCluster setting lets you limit the number of iterations that are simultaneously running on the cluster. This setting is optional. If it is not present or set to 0, DMS dispatches iterations to the cluster as soon as it can generate them. If this setting is present, DMS limits the number of iterations simultaneously running on the cluster to this value.
Configure DMS to Find the Job Registry
DMS uses an environment variable called COM_LGC_EDMS_JOB_REGISTRY to find the job registry. DMS will submit jobs to the job agent via the job registry. Set the environment variable to the Windows machine name of the PC running the job registry. Example:
Create Script Files and Configuration Files
The DMS-Torque configuration uses the batch files and script shells listed below to execute DMS jobs on the nodes of the Torque cluster. Templates for these files are deployed in the following directory:
..\Landmark\DecisionSpaceDMS5000.3.1\modules\jobagent\batch\cluster_examples
You can copy the templates, place them in the shared "scripts" directory on the system where DMS is installed, and edit them as appropriate for your environment.
In the example files shown below, items that will need to be customized for the deployment environment are shown in red.
Script Files
You must create the following script files and place them in the “scripts” subdirectory of the shared directory where the job inputs and outputs get written.
submitdmsjob.bat
This batch file is executed by the job agent. The function of this batch file is to use SSH to execute a command on the cluster. The example below uses the plink application from PuTTY and a public key identity file to avoid password prompts:
@echo off setlocal
set CLUSTER_USER=username
set IDENTITY_FILE=c:\Putty\data\MyKey.ppk
set CLUSTER_HOST=torquehost.mydomain.com
set CLUSTER_SUBMIT=/usr/local/dms_staging/scripts/submitdmsjob.sh
plink -l %CLUSTER_USER% -i %IDENTITY_FILE% %CLUSTER_HOST% %CLUSTER_SUBMIT% %*
Explanation of Settings
CLUSTER_USER is the Linux username that will be used to log on to the cluster.
IDENTITY_FILE is the PuTTY public key identity file used with Plink. If you are using a different authentication method (for example, passwords), you will need to modify this batch file to incorporate settings appropriate for the method that you are using.
CLUSTER_HOST is the hostname of the cluster node where DMS will log on to submit jobs. CLUSTER_SUBMIT is the command that will be executed on the Linux cluster to submit the DMS iteration to the queueing system.
submitdmsjob.sh
This script file is executed by submitdmsjob.bat (via SSH), and submits a DMS job to Torque. #!/bin/sh DMS_QSUBMIT=/usr/local/bin/qsub DMS_QSUBMIT_QUEUE_NAME="batch" DMS_RUNJOB=/usr/local/dms_staging/rundmsjob.sh DMS_SUBSCRIPT_DIR=/tmp/dms mkdir -p ${DMS_SUBSCRIPT_DIR} DMS_SUBSCRIPT=${DMS_SUBSCRIPT_DIR}/${RANDOM}.pbs echo #PBS -m n >${DMS_SUBSCRIPT}
echo #PBS –q ${DMS_QSUBMIT_QUEUE_NAME} >>${DMS_SUBSCRIPT} echo ${DMS_RUNJOB} $@ >>${DMS_SUBSCRIPT}
${DMS_QSUBMIT} ${DMS_SUBSCRIPT} rm ${DMS_SUBSCRIPT}
Explanation of Settings
DMS_QSUBMIT is the command to submit a job to the queueing system. DMS_QSUBMIT_QUEUE_NAME is the Torque queue name.
DMS_RUNJOB is the command that will be executed when the queueing system runs the submitted job.
DMS_SUBSCRIPT_DIR is the directory where the temporary PBS script will be created for the queue submission command.
rundmsjob.sh
This is the script file that actually executes the DMS cluster component, which in turn executes Nexus, VIP, etc.
#!/bin/bash
export LM_LICENSE_FILE=2013@licensehost
DMS_HOME=/usr/local/DecisionSpaceDMS5000.3.1
DMS_SCRIPTS=/usr/local/dms_staging/scripts
# VIP Core and Exec
export COM_LGC_EDMS_VIPCORE=${DMS_SCRIPTS}/vipcore.sh
export COM_LGC_EDMS_VIPEXEC=${DMS_SCRIPTS}/vipexec.sh
# Nexus Standalone and Nexus exe
export COM_LGC_EDMS_NEXUSSA=${DMS_SCRIPTS}/standalone.sh
export COM_LGC_EDMS_NEXUSEXE=${DMS_SCRIPTS}/nexus.sh
# Other DMS environment variables as necessary for additional simulators DMS_PLATFORM=linux64 JAVA_HOME=${DMS_HOME}/jre DMS_BIN=${DMS_HOME}/modules/dsinfra_native/bin/${DMS_PLATFORM} CLASSPATH=${DMS_HOME}/modules/edms/jar/ com_lgc_edms.jar:${DMS_HOME}/modules/dsinfra/jar/\*:${DMS_HOME}/ modules/dsinfra_native/jar/\* export CLASSPATH LD_LIBRARY_PATH=${DMS_BIN}:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH ${JAVA_HOME}/bin/java \ com.lgc.edms.dispatchers.sge5.SgeRemoteExecute $@ Explanation of Settings
LM_LICENSE_FILE is the FlexLM license setting.
DMS_HOME is the directory where DMS is installed on the cluster. DMS_SCRIPTS is the directory where various DMS scripts are located.
COM_LGC_EDMS_VIPCORE, COM_LGC_EDMS_VIPEXEC, COM_LGC_EDMS_NEXUSSA and COM_LGC_EDMS_NEXUSEXE are the scripts launched by DMS to execute VIP Core, VIP Exec, Standalone and Nexus, respectively.
Simulator Scripts
These are simple wrapper scripts for the simulator executables. They serve as a handy place to put a custom LD_LIBRARY_PATH or other settings. If you already have scripts to handle launching the simulator executables you can use those instead.
nexusvip_env.sh #!/bin/sh
# environment variables for running flow simulators from DMS DMS_VIPNEXUS_DIR=/usr/local/Landmark/Nexus-VIP5000.0.2/ nexussimulators/LinuxEM64 DMS_VIPNEXUS_LDPATH=${DMS_VIPNEXUS_DIR}/HP-MPI/lib/linux_amd64 DMS_VIPCORE_EXE=${DMS_VIPNEXUS_DIR}/coreEM64_5000_0_2.exe DMS_VIPEXEC_EXE=${DMS_VIPNEXUS_DIR}/execEM64_5000_0_2.exe DMS_STANDALONE_EXE=${DMS_VIPNEXUS_DIR}/ standaloneEM64_5000_0_2.exe DMS_NEXUS_EXE=${DMS_VIPNEXUS_DIR}/nexusEM64_5000_0_2.exe # set the LD path
LD_LIBRARY_PATH=${DMS_VIPNEXUS_LDPATH}:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH vipcore.sh #!/bin/sh . /usr/local/dms_staging/scripts/nexusvip_env.sh ${DMS_VIPCORE_EXE} $@ vipexec.sh #!/bin/sh . /usr/local/dms_staging/scripts/nexusvip_env.sh ${DMS_VIPEXEC_EXE} $@
standalone.sh #!/bin/sh . /usr/local/dms_staging/scripts/nexusvip_env.sh ${DMS_STANDALONE_EXE} $@ nexus.sh #!/bin/sh . /usr/local/dms_staging/scripts/nexusvip_env.sh ${DMS_NEXUS_EXE} $@ PuTTY/Plink Notes
The first time you connect to a Linux system using PuTTY or plink.exe, you will be prompted to accept or reject the new (to PuTTY) host key. If this happens while running plink from a batch file from DMS, the batch file will become “stuck.” To prevent this from happening, connect to the Linux system manually using PuTTY or plink outside of DMS before trying to do so from DMS via the batch file. Be sure to use the exact host name that you will use in the DMS batch file.
To use public key authentication with plink, you will need to use the PuTTY utility
puttygen.exe to generate a public/private key pair, and then install the public key on the Linux system. See Chapter 8 in the PuTTY User Guide for details on how to generate the key pair and for details on how to install the public key on the Linux system.
The PuTTY home page: http://www.chiark.greenend.org.uk/~sgtatham/putty/
Run DMS
Running DMS jobs on the Torque cluster requires launching a job agent on the machine where DMS is installed.
Launch the Job Agent
Perform the following steps on the system where you have installed DMS. 1. Within the DMS install directory, locate the start-all.bat file in the
modules\jobagent\batch directory.
2. Execute start-all either by double-clicking it from Windows Explorer or invoking it from the command line after navigating to its directory.
This will start a job agent registry and a job agent. The machine's network name will be added to the registry that was started. (This is the job registry that is identified by your COM_LGC_EDMS_JOB_REGISTRY setting.)
Submit Jobs
This procedure assumes that you have already created the job that you want to submit to the cluster. See the DMS online help for instructions on how to create DMS jobs.
With the job agent running locally, perform the following steps on the system where DMS is installed.
1. Launch DMS.
2. Select the job to be submitted in the Jobs List. 3. Click on the Run Job icon.
4. In the Submit Job dialog box, select the job agent option in the Execute Job field.
5. Select the database where the job results should be stored. 6. Click on OK.
As the job is performed, the following events will occur on the cluster:
• On the system where DMS is installed, the DMS Job Agent window will post results as the iterations are performed.
• On the system where DMS is installed, the shared directory will display .job file names as the iterations are performed. When the job is completed, the .job name will be converted to a .done name.
• If you have specified that particular simulation files be saved (outside of the DMS database), these files will be written back to the directory specified by the DMS job.