Torque

From shankerbalan.net
Jump to: navigation, search

TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project and, with more than 1,200 patches, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations.

Contents

Overview

Torque is the logical successor to PBS/OpenPBS. I had a prior experience with Condor when Open Source PBS was just coming around. Now that PBS is kickass, it makes sense to just use Torque and the myraid of schedulers available for it.

Installation

OS Environment

  • Fedora Core on x86_64 (fully updated)
  • Maui 3.2.6p13
  • Keybased ssh already setup from the master to the nodes and between nodes themselves.

Compile / Install

Install Torque into /usr/local.

$ pwd
/home/cluster/packages/torque/torque-1.2.0p4
./configure  --prefix=/usr/local --with-scp

The spool directory goes to /usr/spool/PBS by default. I am not sure whether this is the ideal location but it works well for me.

Be sure to put in a start/stop script for pbs_server and pbs_sched after the install. pbs_sched needs to be disabled before switching to Maui. For testing basic functionality of the grid, pbs_sched works well.

Configuration

torque.cfg

SERVERHOST             opterome.ncbs.res.in
ALLOWCOMPUTEHOSTSUBMIT true

nodes

node1 np=4
node2 np=4
node3 np=4
node4 np=4
node5 np=4
node6 np=4
node7 np=4
node8 np=4
node9 np=4
node10 np=4

server_name

master

Monitoring

Nagios

checkcommands.cfg

define command{
       command_name    check_mom
       command_line    $USER1$/check_tcp -H $HOSTADDRESS$ -p 15002
}

services.cfg

define service{
       use                             generic-service
       host_name                       nodeN
       service_description             MOM
       is_volatile                     0
       check_period                    24x7
       max_check_attempts              3
       normal_check_interval           5
       retry_check_interval            1
       contact_groups                  cluster-admins
       notification_interval           120
       notification_period             24x7
       notification_options            c,r
       check_command                   check_mom
       }

Resources

Personal tools