TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project and, with more than 1,200 patches, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations.
Contents |
Torque is the logical successor to PBS/OpenPBS. I had a prior experience with Condor when Open Source PBS was just coming around. Now that PBS is kickass, it makes sense to just use Torque and the myraid of schedulers available for it.
Install Torque into /usr/local.
$ pwd /home/cluster/packages/torque/torque-1.2.0p4 ./configure --prefix=/usr/local --with-scp
The spool directory goes to /usr/spool/PBS by default. I am not sure whether this is the ideal location but it works well for me.
Be sure to put in a start/stop script for pbs_server and pbs_sched after the install. pbs_sched needs to be disabled before switching to Maui. For testing basic functionality of the grid, pbs_sched works well.
SERVERHOST opterome.ncbs.res.in ALLOWCOMPUTEHOSTSUBMIT true
node1 np=4 node2 np=4 node3 np=4 node4 np=4 node5 np=4 node6 np=4 node7 np=4 node8 np=4 node9 np=4 node10 np=4
master
define command{
command_name check_mom
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 15002
}
define service{
use generic-service
host_name nodeN
service_description MOM
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups cluster-admins
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_mom
}