How BlueBEAR grid workers are initialised at boot time


The BlueBEAR cluster operates under the ClusterVision OS method of booting: when a machine boots, it is reset (via rsync to the master machine) to a known disk image. Grid worker nodes are no different in that respect, and boot from an image not tailored for Grid. Here it's described how the worker machine is adapted for a Grid environment.

When BB grid started in January 2009, the grid workers booted to the same disk image as all other BB workers: default-image. That changed when grid machines were required to have an up-to-date secure kernel: since then they have booted from an image called default-image-gridpp. In principle this image could have been itself tailored to accomplish some of the things listed below, but this hasn't been done. This is because we may return to using the same image as other BB workers. In any case the tailoring differs slightly for different grid workers (eg cron jobs). So we continue to use the same technique as when BB grid started in January 2009.

The start-up procedure

In the usual system script /etc/rc.d/rc.local, in the default-image-gridpp, there are a couple of extra lines. If the hostname of the current worker is found in the file /home/grid-support/gridnodes, which lists the known grid workers, one per line, then control passes to a script /home/grid-support/system/grid-worker-start, to perform the grid worker initialisation. The latter file consists of:
# L.S.Lowe at
# This script is only called for nodes identified as grid nodes.
# As all worker nodes are reset to the default image on reboot, any changes
# that these scripts make to local system files are for this boot only.
# Invokes all matching scripts in our rc.d directory.
set -u
export PATH=/sbin:/bin:/usr/sbin:/usr/bin
export GLOG=/var/log/grid-worker-start.log
export GHOST=$(hostname -s)
export GHDIR=/home/grid-support
export GSDIR=/home/grid-support/system
if touch $GLOG 2>/dev/null; then exec > $GLOG 2>&1; fi
for f in $GSDIR/rc.d/S*.sh; do [ -x $f ] && $f; done
exit 0

So script files matching /home/grid-support/system/rc.d/S*.sh are executed for grid worker nodes. These are described in the next section.

If the worker's hostname was not present in that /home/grid-support/gridnodes file, then the worker will consider itself to be a normal BB worker, not a grid worker. It is worth noting at this point that this is not the end of the story: a grid worker also has to be on the list of nodes known to Torque and/or Moab to be on the appropriate grid Torque queues, and likewise an ordinary worker on the list of nodes known to Torque and/or Moab for local job queues. This is a separate issue dealt with elsewhere. It would have been good to be able to have just the one source of information for which nodes had what roles, but since this can be defined in Torque (using acl_hosts) or in Moab (using HOSTLISTs), and currently (20091201) is defined in both (slightly differently!) , I opted for the simple safe approach, at the small cost of having to update things in more than one place if the list of nodes changes.

The system/rc.d/S* scripts

These scripts are executed in turn by the grid-worker-start script. They provide a modular way of handling the initialisation for a grid environment.

Where appropriate, the scripts have to apply their actions not only to the native (SL5) image, but also to other sub-images that might be used by grid jobs, eg the SL4 sub-image. This is necessary for example for profiles, shift, torque, and umount.

This worker init script puts into place restrictive iptables rules to prevent grid jobs accessing other machines on campus, apart from the grid servers in Physics.

Computing services (unrelated to grid) on a campus are sometimes configured on the basis that clients with an on-campus address are granted more access rights than clients with an off-campus address, on the grounds that they are used by local staff and students. There are usually other access controls in place (eg user authentication) but nevertheless, on-campus addresses are more likely in some sense to be trusted. So for our grid machines, the philosophy that has been adopted for grid services is that grid jobs should not be able to take advantage of their position of being on-campus to abuse that trust. This is achieved by blocking all access from grid machines to on-campus machines, apart from specific grid services in Physics, and a few essential infrastructure services.

Previous to Dec 2010, the home areas for BB grid-users in /egee/home/ was on shared disk. This sharing is unnecessary for grid, and differs from what we have on our main PP cluster. It has performance implications.

This new script creates an /egee/home area for grid user accounts on local worker disk. This may or may not exist already when the grid worker boots, depending on whether the corresponding partition has been re-created or re-synched. In the present version, the /egee/home area is created on the biggest partition (100GB) on the grid-workers hard disk. This corresponds to /tmp, so a directory is created there, and then this directory is bind-mounted to /egee/home so that /egee/home/userid is the physical home of each grid user's files.

To re-establish the user home directories within /egee/home, with their minimal contents, the file /home/grid-support/model/homedirs.tgz is unpacked into directory /egee/home. This tgz file can be re-created by grid administrators when required. The files therein need to have the correct permissions and user/group attributes already set up.

Note that a daily cron job on each worker will access essential files in these home areas (using touch or a read) and so protect them from deletion by tmpwatch, while other files will be deleted in accordance with the normal tmpwatch criterion (10 days without access).

There remains an alternative of mounting this large partition as /egee, creating the home areas within that, and bind-mounting /tmp to (say) /egee/tmp. There seems no big advantage currently of doing it that way, save that it would remove the need to circumvent tmpwatch.

This new script mounts the /egee/soft area from our RAID NFS server. This contains local grid files, the grid middleware, and the experimental software.

This new script mounts the /egee/torque area from the NFS server, for those worker nodes which require access to it. A cron job on such nodes will copy the latest PBS/Torque server log to this area, as is needed by the Cream CE (which also mounts this area).

This worker init script checks the hostname of the current worker and copies cron tasks from /egee/system/cron.d/ for matching hostnames to the /var/spool/cron directory of the worker. These cron tasks include a root task to push-synchronise Torque accounting information with the CE, and other tasks to pull-synchronise certificates and revocation lists.

On a normal gLite worker (not BB), the following cron tasks exist (at the time of writing, for SL5): cleanup-grid-accounts, edg-pbs-knownhosts, fetch-crl, mom_logs.

On BB, the cron tasks that are run are not limited to the ones that would normally run on a grid worker. Another consideration is that, since much of the software and data is shared by virtue of being in the /egee area, there is no need to have the cron tasks on every grid worker. For example, there is only one certificate revocation area on BB, not one per worker as in the case of a normal gLite setup, so the script that fetches the CRL does not need to run on every worker. The approach that is taken is to have the cron tasks spread across a small number (4 or 5) of the grid workers, with some duplication (redundancy), but set up to be run at different times of day. This way, if one or two of the relevant grid workers are down, the tasks are still run, albeit less frequently.

This worker init script simply puts in place a /etc/motd file which will remind vendor and local admin staff who happen to login to a worker node to the fact that this is a grid worker and is configured differently from other BB workers.

This worker init script copies files from /egee/system/profile.d directory to the worker's /etc/profile.d directory. This ensures that grid user jobs get the appropriate grid environment before they start.

The important file at the time of writing is as follows:

  • this profile script first checks the current userid and environment (SL4 or SL5). If the current user is non-grid, it does nothing. Otherwise it invokes /egee/soft/middleware/etc/profile.d/ (for SL4) or /egee/soft/SL5/middleware/etc/profile.d/ (for SL5). These scripts are owned by the grid administrators and can be tailored to invoke scripts supplied with the middleware tar-ball, and/or other scripts. After running this script, it is expected that all relevant grid environmental variables for the user are in place.

This script simply puts a soft-link at /etc/shift.conf on the worker to /home/grid-support/model/shift.conf, or /egee/soft/SL5/local/shift.conf, whichever is found first, so that grid workers use the appropriate buffer size when doing rfio. The source file is owned by the grid administrators and its current content (at the time of writing) is:


This worker init script tweaks the /etc/ssh/sshd_config on the worker so that local BB users cannot ssh into a grid worker machine.

This gives grid users access to the /usr/bin/qstat command, which otherwise would be in a non-standard location. This was requested for LHCb jobs.

This worker init script unmounts file-systems which are not required for grid jobs, thereby giving local BB user files extra security against grid users. In practice, it may be that such file-systems were not mounted in the first place, though this depends on /etc/rc.d/rc.local in the grid image in use.

-- LawrenceLowe - 09 Dec 2009

Topic revision: r9 - 10 Dec 2010 - 10:33:23 - LawrenceLowe
Computing.LocalGridBBstartup moved from Computing.LocalGridBBinit on 09 Dec 2010 - 15:09 by LawrenceLowe - put it back
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback