Maui

From shankerbalan.net
Jump to: navigation, search

Maui Cluster Scheduler, the precursor to Moab Cluster SuiteĀ®, is an open source job scheduler for clusters and supercomputers. It is an optimized, configurable tool capable of supporting an array of scheduling policies, dynamic priorities, extensive reservations, and fairshare capabilities. It is currently in use at hundreds of government, academic, and commercial sites throughout the world. All of the capabilities found in Maui are also found in Moab, while Moab has added features including virtual private clusters, basic trigger support, graphical administration tools, and a Web-based user portal.

Contents

Why maui?

I had two choices for a batch queuing system - Condor and PBS (Torque). After having deployed Condor on a previous cluster I went with Torque on the next one just for kicks. Maui just happened to be one of the popular schedulers to be used with PBS.

Maui was the default choice due to easy integration. After having it in production for over 3 months, I am quiet happy with it. Torque+MAUI is my default combination for getting a Cluster together for the moment.

Getting Started

The cluster grid should already have Torque configured with the default scheduler pbs_sched. pbs_sched is a no frills scheduler and requires minimal configuration changes. Have this working before switching to the maui scheduler.

OS Environment

  • Fedora Core 4 running on AMD Opteron x86_64 hardware (fully updated)
  • Torque 1.2.0p4 installed to /usr/spool/PBS
  • MAUI 3.2.6p13 installed to /usr/local/maui

Compiling and Installation

Install from the source into /usr/local. Default ./configure works well over here. Create a start/stop script for maui in /etc/init.d/

Configuration

maui.cfg

# maui.cfg 3.2.6p13

#SERVERMODE TEST
SERVERHOST            master.XXX
# primary admin must be first in list
ADMIN1                maui cluster

# Resource Manager Definition

RMCFG[XXX] TYPE=PBS TIMEOUT=90

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

# full parameter docs at http://clusterresources.com/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration
RMPOLLINTERVAL        00:00:30
SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://clusterresources.com/mauidocs/a.esecurity.html

LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              1
LOGFILEROLLDEPTH      7

# Job Priority: http://clusterresources.com/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1 
# FairShare: http://clusterresources.com/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies: http://clusterresources.com/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://clusterresources.com/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://clusterresources.com/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE

# QOS: http://clusterresources.com/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations: http://clusterresources.com/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://clusterresources.com/mauidocs/6.1fairnessoverview.html

# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR

#NODEMAXLOAD    10.00

USERWEIGHT     1
USERCFG[cluster]  PRIORITY=300

NODECFG[node1]          MAXLOAD=10 MAXJOB=4
NODECFG[node2]          MAXLOAD=10 MAXJOB=4
NODECFG[node3]          MAXLOAD=10 MAXJOB=4
NODECFG[node4]          MAXLOAD=10 MAXJOB=4
NODECFG[node5]          MAXLOAD=10 MAXJOB=4
NODECFG[node6]          MAXLOAD=10 MAXJOB=4
NODECFG[node7]          MAXLOAD=10 MAXJOB=4
NODECFG[node8]          MAXLOAD=10 MAXJOB=4
NODECFG[node9]          MAXLOAD=10 MAXJOB=4
NODECFG[node10]         MAXLOAD=10 MAXJOB=4
#NODEAVAILABILITYPOLICY UTILIZED
#NODEACCESSPOLICY       SHARED
MAXJOBPERUSERPOLICY     40
MAXJOBPERGROUPPOLICY    40
MAXJOBPERACCOUNTPOLICY  40

MAXJOBPERUSERCOUNT      20
MAXJOBPERGROUPCOUNT     20
MAXJOBPERACCOUNTCOUNT   20

SMAXJOBPERUSERCOUNT     40
SMAXJOBPERGROUPCOUNT    40
SMAXJOBPERACCOUNTCOUNT  40

DEFERTIME               0

Tuning

Increasing Load on Nodes

By default, maui starts N processes on each node where N equals CPU count. Sometimes it is desirable to have more processes per node to better utilize node resources like memory and CPU%.

Increase this limit by carefully studying average utilizations of CPU, Disk, RAM and Swap. Use SNMP based graphing/monitoring tools like Cacti for keeping track of system parameters.

The below cfg options permits each node (node1 to node10) to run a mximum of four jobs (2 jobs per CPU) subject to a maximum system load of 10. This utilizes node resources to the best in my setup.

NODECFG[node1]          MAXLOAD=10 MAXJOB=4
NODECFG[node2]          MAXLOAD=10 MAXJOB=4
NODECFG[node3]          MAXLOAD=10 MAXJOB=4
NODECFG[node4]          MAXLOAD=10 MAXJOB=4
NODECFG[node5]          MAXLOAD=10 MAXJOB=4
NODECFG[node6]          MAXLOAD=10 MAXJOB=4
NODECFG[node7]          MAXLOAD=10 MAXJOB=4
NODECFG[node8]          MAXLOAD=10 MAXJOB=4
NODECFG[node9]          MAXLOAD=10 MAXJOB=4
NODECFG[node10]         MAXLOAD=10 MAXJOB=4

Resuming Jobs

After a scheduler restart, maui will put all the jobs previously submitted into a DEFER state which is one day by default IIRC. New jobs submitted get batched but the old ones dont. To immediately execute jobs in the queue after a restart add the below line:

DEFERTIME               0

Resource Sharing

Maui does a first come, fist serve approach to job execution. This is undesirable as a user who has submit 1000s of jobs will block a user with 10 jobs. The 10 jobs will have to wait till the first 1000 jobs are completed.

The scheduler allows for a rather elaborate controls on fair sharing the system resources. The below lines tries to achieve the follosing:

  1. All user have the same weight
  2. User cluster has a high priority. His jobs get executed sooner
  3. Set maxjobs per user hardlimit to 40
  4. Set maxjobs per user softlimit to 20

This allows for other users to squeeze in their jobs into the run queue.

USERWEIGHT              1
USERCFG[cluster]        PRIORITY=300

MAXJOBPERUSERPOLICY     40
MAXJOBPERGROUPPOLICY    40
MAXJOBPERACCOUNTPOLICY  40

MAXJOBPERUSERCOUNT      20
MAXJOBPERGROUPCOUNT     20
MAXJOBPERACCOUNTCOUNT   20

SMAXJOBPERUSERCOUNT     40
SMAXJOBPERGROUPCOUNT    40
SMAXJOBPERACCOUNTCOUNT  40

Monitoring

Nagios

checkcommands.cfg

define command{
       command_name    check_maui
       command_line    $USER1$/check_tcp -H $HOSTADDRESS$ -p 42559
}

services.cfg

define service{
       use                             generic-service
       host_name                       master
       service_description             MAUI
       is_volatile                     0
       check_period                    24x7
       max_check_attempts              3
       normal_check_interval           5
       retry_check_interval            1
       contact_groups                  cluster-admins
       notification_interval           120
       notification_period             24x7
       notification_options            c,r
       check_command                   check_maui
       }

Resources

Personal tools