• No results found

Microsoft Windows Compute Cluster Server 2003 Evaluation

N/A
N/A
Protected

Academic year: 2021

Share "Microsoft Windows Compute Cluster Server 2003 Evaluation"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Microsoft Windows

Microsoft Windows

Compute Cluster Server 2003

Compute Cluster Server 2003

Evaluation

Georg

Georg HagerHager, Johannes , Johannes HabichHabich (RRZE)(RRZE) Stefan

Stefan DonathDonath ((LehrstuhlLehrstuhl ffüürr SystemsimulationSystemsimulation)) Universit

Universitäätt ErlangenErlangen--NNüürnbergrnberg

ZKI AK Supercomputing 25./26.10.2007, GWDG

(2)

Outline

ƒ Administrative issues

ƒ Installing the Head Node ƒ Cluster Network Topology ƒ RIS-Unattended Installation ƒ Domain integration

ƒ User Environment

ƒ Benchmarks for Intel MPI PingPong ƒ Current user projects

ƒ Evaluating Excel integration

ƒ LINPACK sample

ƒ Homebrew VBA macros for simple Jacobi benchmark ƒ NUMA and affinity issues

(3)

Opteron Test Platform

ƒ 7 quad Opteron nodes (Dual Core Dual Socket) ƒ 4 GB per node

8GB on head node

ƒ Windows 2003 Enterprise + Compute Cluster Pack

ƒ Visual Studio 2005,

Intel compilers, MKL, ACML

ƒ Star-CD

ƒ Gbit Ethernet

ƒ Access via RDP or ssh (sshd from Cygwin) ƒ GUI tool for job control: Cluster Job Manager

(4)

Cluster Layout

ƒ Plan: ccsfront as VMWare virtual machine on master node

ƒ Only machine accessible to users ƒ Easily movable

ƒ However: Serious problems with Visual Studio and Intel compilers inside VM

ƒ Same procedures work on ccsmaster ƒ Being investigated

ƒ For now, people work on ccsmaster

ccsmaster ccs002 ccs003 ccs004 ccs005 ccs006 ccs007 ccs008 RRZE

Switch

ccsfront

(5)

Installing the Head Node

ƒ Preinstalled headnode and cluster nodes from transtec with evaluation version of WinCCS2003

ƒ Preliminary network connection bandwidth evaluation with Intel IMB benchmarks

ƒ No SUA support Æ Clean installation of Win2003 Enterprise R2 x64

ƒ Installation of Intel Fortran and C compiler with Visual Studio 2005 integration

ƒ Intel C++ Compiler 9.1 and 10.0

ƒ Intel Fortran Compiler 9.1

(6)

Cluster Network Topology

ƒ Separate private and public 1GE networks available ƒ DHCP Server could not separate the two scopes to two

physical network adapters

ƒ DHCP Server is reconfigured without warning for Remote Installation Services (RIS) to install nodes unattended

Æ only private network was used, with NAT translation for outside communication

(7)

RIS Unattended Installation

ƒ Creating the RIS (Remote Installation Server) Image from Win2003 installation CD

ƒ By-hand inlining of R2 necessary packets and settings from Win2003 installation CD 2

ƒ Adjustment of wrong paths inside the configuration files ƒ First attempt to install RIS caused a complete DHCP

breakdown, as RIS changes complete DHCP configuration and launches without proper rights for standard scope

192.0.0.x

(8)

RIS Unattended Installation

ƒ Domain Administrator rights are necessary for creating automatically new computer accounts inside domain

ƒ RIS deploy wizard warns user that password is visible in plain text during deployment

Æ huge security risk during each setup! ƒ Suggestions:

ƒ No plain text passwords

ƒ Easy wizard creation of standard Win2003 images

ƒ DHCP Server with multiple instances for each network interface

(9)

Domain Integration

ƒ Headnode and cluster nodes were integrated into

UNI-Erlangen ADS

ƒ Problem of supplying a working directory which is both available as a network share via Windows Explorer and available to the jobs running on the cluster

ƒ User works on mounted network share

ƒ Job works on complete UNC path leading to same network share

Æ One common path for both, but UNC for users is not a preferred choice

Æ Automatic mapping of this location at job start and user login Æ Problems of some Windows software to work with UNC

(10)

Intel MPI Ping Pong benchmark

ƒ Compiling inside SUA and Visual Studio 2005

ƒ Issuing jobs with standard MPIEXEC causes 4 threads to run on each node Æ internode communication measured ƒ Hence:

“Fun and interesting ways to run MPI Jobs on CCS” from http://windowshpc.net/

Now:

pernode.bat script hacks %CCP_NODES% system variable

and only one task is executed per compute node

Æ Measuring real internode connection bandwidth and latency Æ Desirable: flexible specification for „processors per node“

(11)

Intel MPI Ping Pong benchmark

PingPong

INT EL MPI Benchmark on CCS (small packets / latency test)

0 20 40 60 80 100 120 140 160 180 200 1 10 100 1000

Message length [bytes]

L a te n cy [ µ sec] MS: Variations due to “interrupt coalescing” in GE driver

Æ No option to switch off in our driver /

(12)

Projects on the WinCCS Cluster

ƒ Crack propagation (FAU, Materials Science) ƒ F90/OpenMP, C++ (OpenMP to come)

ƒ VirtualFluids (TU Braunschweig) ƒ C++/MPI

ƒ Star-CD (FAU, Fluid Mechanics)

ƒ waLBerla (FAU, System Simulation)

(13)

waLBerla project @ .

Widely applicable lattice Boltzmann from Erlangen

ƒ CFD project based on lattice Boltzmann method

ƒ Modular software concept

ƒ Supports various applications, currently planned:

ƒ Blood flow in aneurysms

ƒ Moving particles and agglomerates

ƒ Free surfaces to simulate foams, fuel cells, a.m.m.

ƒ Charged colloids

ƒ Arbitrary combinations of above

ƒ Integration in efficient massive-parallel environment

ƒ Standardized input and output routines

ƒ User-friendly interface

(14)

waLBerla project @ .

Widely applicable lattice Boltzmann from Erlangen ƒ Porting issues concerning CMake:

ƒ CMake has to be configured to find MPI

ƒ Not possible to specify Cluster Debugger Configurations via CMake (overwrites settings when project is built)

ƒ Visual Studio & Queues

ƒ Not possible to automatically submit and debug parallel job via Visual Studio

ƒ Debugging issues

ƒ MPI Cluster Debugger: Configuration pain to run jobs on remote sites

(15)

Goals

ƒ WCCS as a development environment ƒ Visual Studio

ƒ Parallel debugging ƒ Different compilers

ƒ Some issues (Intel project system, parallel debugging), but ok in general

ƒ Coupling of CCS cluster to MS Excel by use of VBA

ƒ Job construct, submit ƒ Result retrieval

ƒ Visualization

ƒ Behaviour of Windows on a ccNUMA architecture ƒ Locality & affinity issues

(16)

Excel-CCS coupling

ƒ CCS API available for C++ and VBA

ƒ Documentation issues for VBA, but many examples online: http://www.microsoft.com/technet/scriptcenter/ scripts/ccs/job/default.mspx?mfr=true

ƒ API is part of Office Professional only

ƒ Example Excel Sheet and VBA macros for LINPACK parameter scans provided

ƒ Toy project: Make Excel sheet for simple heat equation solver with graphical output of performance numbers

(17)

Evaluating Excel Integration (LINPACK Eample)

ƒ Taking precompiled binaries

offered by MS with ACML

ƒ Tuning LINPACK parameters as suggested in:

“Hands-On Lab –

Building HPC LINPACK Tool” ƒ Issuing jobs directly to Job

Manager

ƒ Querying jobs

ƒ Results are instantly plotted

in Excel 0.00 2.00 4.00 6.00 8.00 10.00 12.00 60 64 68 72 76 80 84 88 92 96 Blocksize Gflops Matrix 44000

(18)

Excel-CCS coupling

ƒ Principles of operation

ƒ Provide Excel worksheet with necessary parameters ƒ Binary name, working dir

ƒ Number of CPUs, walltime limit ƒ Input parameters for application

ƒ Position active elements (buttons,…) linked to VBA macros ƒ VBA communicates with CCS using XML

ƒ First time generation of XML structure from ƒ XML Schema

ƒ template file (saved from Job Manager app.) ƒ scratch

ƒ Link entries to worksheet cells

ƒ Use VBA-CCS API to construct and submit jobs ƒ Many options possible

(19)

job.xml

template

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Job xmlns:ns1="http://www.microsoft.com/ComputeCluster/"

Name="job1"

MaximumNumberOfProcessors="4"

MinimumNumberOfProcessors="4"

Owner="unrz55" Priority="Normal" Project="Jacobi"

Runtime="Infinite"> <ns1:Tasks>

<ns1:Task Name="task1"

CommandLine="echo 10 10 5 | heat_ccs.exe"

MaximumNumberOfProcessors="4" MinimumNumberOfProcessors="4" Runtime="Infinite" WorkDirectory="\\Ccsmaster\ccsshare\unrz55\xlstest\" Stderr="3700-3700-50-err.out" Stdout="3700-3700-50-heat.out“ /> </ns1:Tasks> </Job>

(20)
(21)

Submitting a Job with VBA in Excel

’ connect to cluster, defined by xls cell “Cluster” Set objCluster =

CreateObject("Microsoft.ComputeCluster.Cluster")

objCluster.Connect (Range("Cluster").Value) ’ Job object from XML description

Set Job = objCluster.CreateJobFromXml(strXML)

’ obtain user credentials

Set WshNetwork = CreateObject("WScript.Network") UserName = WshNetwork.UserDomain & "\" &

WshNetwork.UserName ’ submit job to queue

ID = objCluster.QueueJob((Job), UserName, "", False, 0)

Many other CCS-API and general VBA functions available ƒ Status query, job cancel etc.

(22)

Result retrieval e.g. by

(23)

Windows on ccNUMA

ƒ Locality and congestion problems on ccNUMA ƒ “First touch” policy for memory

pages ensures local placement ƒ Watch OpenMP init loops

ƒ Even if placement is correct, make sure it stays that way

ƒ “pin” threads/processes to initial sockets

ƒ Issue with OpenMP and MPI

ƒ To make matters worse, FS buffer cache can fill LDs so that apps must use nonlocal memory

ƒ Use “memory sweeper” before production

ƒ How does all this work on Windows?

P C P C C C MI Memory P C P C C C MI Memory LD0 LD1 BC data BC data data

(24)

NUMA Placement and Pinning

with Heat Conduction Solver (Relaxation)

0 100 200 300 400 500 600 700 800 100 250 400 550 700 850 1000 1150 1300 1450 1600 1750 1900 2050 2200 2350 2500 2650 2800 ML U P s /s

placement+pinning placement only no placement

4MB L2

limit NUMA placement: +60%

additional pinning: +30%

Pinning benefit is only due to better NUMA locality!

(25)

0 50 100 150 200 250 300 350 400 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000 0 11000 12000 13000 1400 0 N ML U P s/ s

Buffer Cache and Page Placement

1GB file write in LD0 before benchmark No file write 2GB working set limit

How to limit FS

buffer cache in

Windows?

(26)

Future plans

ƒ Test of Ansys CFX 11

ƒ Test of StarCD

ƒ Test of Allinea DDTLite for Visual Studio

ƒ WalbErla (CFD) development and evaluation of Windows Performance (see Projects)

ƒ Customized Excel sheets for standard production applications (see above)

(27)

Conclusions

ƒ “Well-known” development environment with HPC add-ons

ƒ Batch system/scheduler is not “enterprise-class”

ƒ Ease of use (develop/compile/debug/job submit)

ƒ Room for improvement with parallel debugging

ƒ Similar ccNUMA issues as with Linux, same remedies ƒ Process/thread pinning absolutely essential

ƒ CCS VBA API is extremely fun to play with ƒ May be attractive to production-only users

ƒ Still lacks some coherent documentation

References

Related documents

 Continuing and diversified education with experience in: Systems Engineering and Administration; Project Management; Business Operations, Communications, and

Microsoft, Microsoft Windows, Active Directory, ActiveSync, Internet Explorer, Windows Mobile, Windows Server, Windows XP, SQL Server, Windows XP Tablet PC Edition and Windows

Reach 4 utilizes Microsoft’s integrated range of server products, using handheld devices running Microsoft Windows Mobile and a server network based on Microsoft Windows Server

Describe how Microsoft SQL Server 2000 failover clustering works with Microsoft Windows Server 2003 Cluster service to help maximize availability.. Discuss how SQL Server

Configuring Windows Compute Cluster Server 2003 involves installing the operating system on the head node, joining it to an existing Active Directory domain, and then installing

NOTE: On certain Operating Systems such as Windows Vista and Windows 7 you will need to launch the command prompt (cmd.exe) as an administrator otherwise you may get an

The analysis identifies and documents the common variations of the routine process. Analysis of the routine process and variations can be used to develop a template for the

This chapter puts emphasis on the using of methodical series to calculate manually by means of using computer tools such as Microsoft Excel and later on making source codes