Depending on the DNS setup, it may be necessary to add the IP address of the head node (obviously using the IP address of the NIC connecting the client machine and the head node) to the local hosts file (C:\Windows\System32\Drivers\etc\hosts) of the client machine
15 Troubleshooting
For general troubleshooting regarding cluster deployment,, operations and job scheduling we refer to TechNet Library http://technet.microsoft.com/en-us/library/cc947590(WS.10).aspx., Below is a summary of the most important tools and techniques.
15.1 Diagnostic tools
The built-in diagnostic tools that ship with HPC Pack 2008 R2 often provide good insight into what might cause a specific error.
15.2 Command output
Open the Job Management view of the HPC Cluster Manager to analyze the command output.
15.3 HPC Pack 2008 Log files
The HpcSdm.etf and HPCManagement.etf files should be examined for additional information in case of an error. These files are located in
C:\Program Files\Microsoft HPC Pack 2008 R2\Data\LogFiles
The HPCManagement.etf files exist on all nodes in the cluster; HpcSdm.etf is found only on the head node
15.4 Hosts file
Networking must work flawlessly in order for HPC Server 2008 R2 to operate reliably. This includes IP configuration, DHCP and DNS. In addition, HPC Server 2008 R2 maintains the hosts files on the head node and the compute nodes. The hosts files are located in
C:\Windows\System32\drivers\etc
Below is a sample hosts file from a compute node (NODE1).
15.5 Lizard
To help analyze performance related problems there is an MPI test that is built in to the diagnostics tools. To put heaiver load on the system it is recommended that the Lizard tool be downloaded
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=3313856B-02BC-4BDD-B8B6-541F5309F2CE&displaylang=enhttp://www.microsoft.com/downloads/en/details.aspx?FamilyID=3 313856B-02BC-4BDD-B8B6-541F5309F2CE&displaylang=en . This test provides a simple GUI that automates the standard performance test that is used to rank the 500 most powerful
supercomputers in the world http://www.top500.org.
15.6 Port query tool
As mentioned in earlier sections, accessing the remote database section can be difficult, especially since HPC cluster often are located on non-standard network segments. The port query tool found at http://support.microsoft.com/kb/832919. Is a an excellent companion in order to resolve connectivity issues.
15.7 Installation failure
If Windows Firewall service is stopped or turned off on the head node computer, installation of HPC Pack 2008 R2 does not complete successfully because the firewall rules required by Windows HPC Server 2008 R2 are not configured. If this problem occurs during installation, entries such as the following may appear in the hpcMsi-DATETIME.txt log file (located in the %Temp%\HPCSetupLogs folder):
CAQuietExec: Warning: failed to add rule HPC Host (TCP-In), continue
CAQuietExec: Firewall rule 'HPC Host for controller (TCP-In)' addition failed: 0x800706d9 CAQuietExec: Warning: failed to add rule HPC Host for controller (TCP-In), continue CAQuietExec: Firewall rule 'HPC Host x32 (TCP-In)' addition failed: 0x800706d9
A workaround can be obtained as follows. Ensure that the Windows Firewall service is started on the head node computer, and then try to install HPC Pack 2008 R2 again. You can start the Windows Firewall service by using the Services MMC snap-in, or by running the following command in an elevated Command Prompt window:
net start MpsSvc
16 Miscellaneous
16.1 Windows server update services (WSUS)
The default option is to apply the same update policy to the compute nodes as to other comparable servers hosted by the IT department. If special operational procedures are required it is recommended that a special AD organization unit (OU) be created, in which the regular corporate IT policies are blocked. The cluster administrator would thus be free to communicate update windows to the cluster users thereby ensuring smooth operations. Essentially, there is nothing on the HPC side of things that blocks the IT department from carrying out its tasks.
16.2 Environments
These instructions describe a generic failover head node configuration. It is the responsibility of the IT Department to plan for applicable environments:
Production
Acceptance testing
Unit testing
Development
The latter two can be virtualized; head node failover is not necessary. Acceptance testing, however, should be as close to production as possible.
16.3 Disaster recovery
HPC Server 2008 R2 does not provide any built in tools or hooks for disaster recovery. Any given HPC cluster can only be served by one head node. Any easy and robust approach to disaster recovery is to set up a second cluster in a different location that is controlled by another head node. In case of a disaster the HPC client would simply connect to the backup cluster instead of the primary one. All that would be required is to select a new cluster in whatever client application is being used. Hardcoded client application cluster strings should be avoided at any cost.
Disaster recovery for individual compute nodes is not necessary. If one node fails the cluster will still continue to operate, albeit at reduced capacity.
16.4 Co-hosting of applications
HPC Server 2008 R2 is designed is a general purpose HPC cluster. Thus, in general it would not be necessary to design an HPC architecture according to the principle One Application – One Cluster.
For a scalable design, one should obey a few simple rules
The head node should not be used for anything but controlling the cluster. In other words, do not install other applications on the head node. Nor should it be used as a file server for the compute nodes
Always install the full SQL Server, i.e., avoid the Express Edition. This allows unlimited table space, full memory and CPU utilization, which is important especially for large parametric sweep jobs that tend to put a heavy load on the job scheduler
Ultimately the decision whether to co-host a set of applications will depend on their characteristics.
This is a process that must be solved jointly by the stakeholders involved, i.e., the end users, the IT Department and the ISVs. Decision criteria include – among other things –
Processor architecture. Some application are very sensitive to processor/socket/core affinity and care must be taken when scheduling these
Memory architecture. For certain applications the memory channels – not the amount of memory – is the primary limiting factor
Local disks. Certain application may require large and very fast local disks for scratch space
Response time. Certain users must be served immediately, whereas others can wait
User access pattern. Not all applications may be run at the same time. Certain applications may be run during certain periods due to regulatory compliance
Licensing. Some applications require license tokens controlled by license servers such as FlexLM before they can execute on the cluster. Incorrect implementation of these applications may adversely impact the overall availability of the cluster
HPC Server 2008 R2 is equipped with a set of tools to address these issues (job templates, priorities, preemption, UAC, node groups, etc.). It will require careful analysis in each case, however.
16.5 Head node tuning
For general performance tuning guidelines, we refer to the whitepaper “Windows HPC Server 2008 Head Node Performance Tuning”, which has been distributed as part of the administration class for HPC Server 2008.