Birmingham GRID integration into the BlueBEAR System

This documents those changes to the BlueBEAR setup which were need to implement GRID which needed admin privileges to put into effect, and the changes on the Grid CE side done to match the BB configuration. Most of this work was done in January-March 2009.

Changes required to allow job submission network calls from outside the cluster

In order to support Grid job submission from the Computing Element (CE) in Physics, it is necessary to allow machines there to be able to talk to the qmaster on BlueBEAR (BB). The BB qmaster, however, has no external IP address (in the 147.188 network). The BB export server, on the other hand, has an external IP address, and already has networking rules to allow worker clients to access the outside world, and so it is a small extension to allow that to act as the communication path for job-related network requests.

Preliminary tests using Shorewall were not successful - Alex of ClusterVision was of the view that Shorewall had bugs and could not successfully generate the right raw iptables rules. (This may change in future versions: if Shorewall can in the future reproduce the same additional rules that we have implemented as an add-on, then it can take over that job). For the present, we have additional raw iptables rules, tuned to work with the Particle Physics subnets, but easily configurable as will be clear, which are run after the Shorewall service has started, as follows:

On export server, /etc/rc.d/rc.local contains the extra lines:

 # Perform post-shorewall fixups for routing to qmaster, if required
if [ -x /root/nat-qmaster.sh ]; then /root/nat-qmaster.sh; fi

And on that server, /root/nat-qmaster.sh contains the lines:

# L.S.Lowe at bham.ac.uk
# Rules for forwarding torque/moab packets: these rules are on bbexport only, applied after shorewall etc.
# Torque/Moab packets from campus to bbexport are translated to have a destination address of qmaster.
# Implicitly, Torque/Moab packets from qmaster to campus are translated to have a source address of bbexport.
# We assume that the route on qmaster for packets to 147.188.0.0/16 is via gateway 10.141.245.101 (bbexport).
# Note: bbexport is (147.188.126.18 and 10.14x.245.101) and qmaster is (10.14x.255.250) where x is 1 and 3.
# These settings allow a communication path for any client within the PREROUTING source mask:
# the pbs_server itself is configured (via qmgr) to narrow-down which of those clients it will accept or reject.

src=147.188.46.0/23
dst=10.141.255.250
port1=15001
port2=42559
ext=eth2
set -u

if [ $# -eq 0 ] && iptables -t nat --list PREROUTING | grep -q 15001; then exit 0; fi # exit if done already

/sbin/iptables -t nat -I PREROUTING -p tcp -i $ext -s $src --dport $port1 -j DNAT --to-destination $dst
/sbin/iptables -t nat -I PREROUTING -p tcp -i $ext -s $src --dport $port2 -j DNAT --to-destination $dst

/sbin/iptables -t filter -I FORWARD -p tcp -i $ext -s $src -d $dst --dport $port1 -j ACCEPT
/sbin/iptables -t filter -I FORWARD -p tcp -i $ext -s $src -d $dst --dport $port2 -j ACCEPT
/sbin/iptables -t filter -I FORWARD -p tcp -o $ext -d $src -s $dst --sport $port1 -j ACCEPT
/sbin/iptables -t filter -I FORWARD -p tcp -o $ext -d $src -s $dst --sport $port2 -j ACCEPT

On the qmaster server, we need packets returning to campus machines, in particular the machines in Particle Physics, to go via that same export server, so we have the following additional lines in /etc/rc.d/rc.local:

# L.S.Lowe. For qsub/qstat/showq for grid submission from Physics,
# so packets NATted from export to qmaster return on that same path.
# This could be defined instead in route-eth0, but it's here for now:
route add -net 147.188.46.0 netmask 255.255.254.0 gw 10.141.245.101

Changes to the Torque setup

The above section deals with getting external packets in and out of the qmaster. This section deals with the additional requirements in the Torque configuration itself.

The grid nodes (at the time of writing) are u4n081-u4n128. These are allocated to queues using the acl_hosts technique in Torque's qmgr setup, just as is done for other existing queues. There are two grid queues: glong and gshort, with different wall and cput limits. The glong queue has access to fewer nodes than gshort (two fewer, at the time of writing), with the idea that gshort jobs will thereby get a better turnround. Which queue is selected is done by the external request broker: there is currently no need for a routing queue to feed these two queues. Both queues use acl_groups settings to limit who can submit to them; this is to prevent local users submitting to those queues.

# Create and define queue glong
create queue glong
set queue glong queue_type = Execution
set queue glong max_user_queuable = 5000
set queue glong resources_max.cput = 48:00:00
set queue glong resources_max.walltime = 72:00:00
set queue glong resources_default.cput = 48:00:00
set queue glong resources_default.walltime = 72:00:00
set queue glong resources_default.nodes = 1:ppn=1
set queue glong resources_default.pmem = 1996mb
set queue glong enabled = True
set queue glong started = True
set queue glong acl_group_enable = True
set queue glong acl_groups = g-atlas
set queue glong acl_groups += g-atlasp
set queue glong acl_groups += g-atlass
... etc ...
set queue glong acl_host_enable = False
set queue glong acl_hosts = u4n083.cvos.cluster
set queue glong acl_hosts += u4n084.cvos.cluster
set queue glong acl_hosts += u4n085.cvos.cluster
... etc ...

# Create and define queue gshort
create queue gshort
set queue gshort queue_type = Execution
set queue gshort max_user_queuable = 5000
set queue gshort resources_max.cput = 00:20:00
set queue gshort resources_max.walltime = 00:30:00
set queue gshort resources_default.cput = 00:20:00
set queue gshort resources_default.walltime = 00:30:00
set queue gshort resources_default.nodes = 1:ppn=1
set queue gshort resources_default.pmem = 1996mb
set queue gshort enabled = True
set queue gshort started = True
set queue gshort acl_group_enable = True
set queue gshort acl_groups = g-atlas
set queue gshort acl_groups += g-atlasp
set queue gshort acl_groups += g-atlass
... etc ...
set queue gshort acl_host_enable = False
set queue gshort acl_hosts = u4n081.cvos.cluster
set queue gshort acl_hosts += u4n082.cvos.cluster
set queue gshort acl_hosts += u4n083.cvos.cluster
... etc ... 

In order that particular hosts are accepted as valid submitters of jobs, it is necessary for them to be made known to Torque. There are two ways within Torque to allow hosts to be valid submitters: one is via qmgr:

set server submit_hosts += cename

The other method is to add a line to file /etc/hosts.equiv for the submit host on the server running pbs_server, namely qmaster. I did the latter, so our CE which handles BB jobs is added to that file.

In order for information subsystems on the CE to be able to issue certain privileged sorts of Torque commands, they need to be added as operators in the Torque sense, using qmgr (Note added later: this needs to be reviewed, because more recent CE setups seem to work perfectly well without it):

set server operators += edginfo@cename 
set server operators += edguser@cename
set server operators += rgma@cename

Changes to the Torque prologue/epilogue

Some changes to the Torque prologue and epilogue provided by LSL were already in place in order to give non-grid users well-formatted additional information at start and end of job.

Further changes were added for Grid simply to ensure that the BB system was not accidentally or deliberately abused by local users or grid users. The prologue checks to ensure that a job submitted to a grid queue has a grid unix-group, and vice-versa. If a job is found in the wrong queue, the prologue cancels it without retry. In practice it is actually not possible (with our setup) for a grid job submitted via WMS to end up in a non-grid queue, or for a local BB user to submit to a grid-queue (because of use of acl_groups), so this is largely belt and braces. This code would also be superfluous if all BB queues (that is, including non-grid ones) used acl_groups to limit the userids/groups which could submit to them, but that would be an extra burden on the BB administrators to keep pace with all the different non-grid groups.

# Check queue and user/group and disallow invalid combinations. Also see qmgr acl_groups.
case "$6" in
g*) case "$3" in g-*) : ;; ?*) echo Invalid job queue for non-grid user; exit 1;; esac;;
?*) case "$3" in g-*) echo Invalid job queue for NGS/WLCG user; exit 1;; ?*) : ;; esac;;
esac

Changes to the Moab setup

In order that information on jobs is available to the CE information system, I believe the following needs to be in effect in the moab.cfg file, or something like it but more specific for the information system userids. The following entry was present anyway on BB as it was required by the Moab Portal (not used by grid). It is perhaps analogous to the recommended entry in a maui.cfg file: ADMIN3 edginfo edguser rgma.

ADMINCFG[4]     USERS=ALL

Additional lines were requested to be added to the Moab scheduler setup in order to apply the appropriate fair-share and job-throttling policies for grid userids/groups. For example:

 GROUPCFG[g-atlasp]   FSTARGET=20 MAXPROC=somenumber

Changes to support grid userids and groups

A number of grid users and groups were added to the LDAP database to support GRID. All grid users and groups begin with "g-" to ensure that they can never clash with local userid naming rules. All grid uids and gids are above 100000 for similar reasons. The length of userids was kept within 8 characters, so that output of commands like qstat and showq continue to look nicely formatted.

The IDs are generated by my /home/lcgdata/make-users-groups/makeug* script. Running this script produces three files: /etc/passwd compatible entries, /etc/group compatible entries, and a script which uses the useradd/mod and groupadd/mod commands. These can be passed to the BB technical team in order for them to add them to the BB LDAP system. The resulting changes should be checked when they've been done to make sure there are no discrepancies between the request and the result.

Changes to help with maintenance of grid user accounts

So that our local grid administrators can maintain the contents of grid users' home areas, additions were made to the /etc/sudoers file on login nodes and worker nodes. The aliases and rule below allow those grid admins to run a bash shell in the accounts of grid users only. The command suitable for this purpose is sudo -H -s -u g-atl001, for example.

User_Alias  GRIDADMINS = list-of-pp-GRID-admins 
Runas_Alias GRIDUSERS = %g-atlas,%g-atlasp,%g-atlass,%g-cms,%g-cmsp,%g-cmss,%g-lhcb, %g-lhcbp,%g-lhcbs,%g-dteam,%g-dteamp,%g-dteams,%g-ops,%g-opsp, %g-opss,%g-hone,%g-honep,%g-hones,%g-na48,%g-na48p,%g-na48s,%g-ngs,%g-ngsp,%g-ngss   
GRIDADMINS ALL = (GRIDUSERS) /bin/bash 

Changes on the Physics grid CE to match the BB configuration

On the Physics CE for BB, there are no running servers for Torque or Maui. So pbs_server and maui binaries are not started at boot time. The client commands (binaries) associated with Torque and Maui/Moab are present (as they must be) but they talk to the qmaster node on BlueBEAR.

The BB qmaster node, which runs the pbs_server and the moab scheduler for the BlueBEAR cluster, is on a private network behind a firewall. As discussed in a previous section, this difficulty was circumvented by configuring the BB export machine, which has both public and private network interfaces, to route network packets associated with Torque and Moab from the Physics machines to the qmaster machine, and back.

Client binaries for Moab (mdiag, and some others) were copied from BB to the Physics CE, into directory /usr/bin. This is possible because the systems are compatible (eg are 64-bit), and it's necessary for Moab because maui binaries and moab binaries are not compatible.

Client binaries for Torque (qsub and qstat) could optionally be copied from BB to the Physics CE: if this is done then it's necessary to copy the libraries that those binaries use too, because Torque on BB is built in an unusual directory /cvos/shared/apps/Torque/ version /lib, so that directory has to be replicated on the Physics CE. Actually I suspect that it's perfectly OK to use the gLite-provided version of Torque client commands and they will talk happily to the pbs_server on the BB qmaster, provided the versions are sufficiently similar that the network API hasn't changed.

In order for the Torque client binaries to make use of the BB qmaster, file /var/spool/pbs/server_name contains the string:

qmaster.cvos.cluster

In order for Moab commands on the CE (like mdiag) to make use of the BB qmaster, file /etc/moab.cfg on the CE contains the lines

SERVERHOST=qmaster.cvos.cluster
SERVERPORT=42559

The SERVERPORT line is unnecessary (as port 42559 is what moab uses anyway) but is a useful reminder that Moab uses a different communication port from Maui (which uses 40559).

In order for packets to qmaster.cvos.cluster to reach the BB qmaster via the BB export server, the file /etc/hosts on the Physics CE has the additional line of

147.188.126.20          qmaster.cvos.cluster  bbexport.bham.ac.uk

Why have we used qmaster.cvos.cluster in place of using bbexport.bham.ac.uk as the definition for the PBS server and the Moab server? Because otherwise Torque jobs will acquire jobids containing the string bbexport in place of the usual qmaster, which would be confusing.

-- LawrenceLowe - 22 Dec 2009

Topic revision: r5 - 12 Jan 2010 - 14:44:30 - LawrenceLowe
Computing.LocalGridBBintegrate moved from Computing.LocalGridBBchanges on 22 Dec 2009 - 20:48 by LawrenceLowe - put it back
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback