Local Grid Internals

This is a local topic intended for grid-related notes by Lawrie Lowe and Chris Curtis.

Types of node

CE role

CEs are epgce3, epgce4 (as of 1st May 2009).

epgce3 heads the access to the local grid twin cluster: epgd01-epgd16. It provides the usual CE functionality and has the Torque pbs_server and maui scheduler. The CE epgce3 and its WNs all use an RPM style installation for the LCG/gLite middleware.

epgce4 heads the access to the grid section of the uni BlueBEAR cluster u4n081-u4n128. It provides the usual CE functionality only, and uses the qmaster node on the BlueBEAR cluster as the Torque pbs_server and moab scheduler. The CE epgce4 uses a RPM style installation for the middleware. The WNs use a non-RPM tar style installation for middleware.

Historical note: on epgce4, /opt/edg/var/info/atlas used to be mounted from epgce2, for historical reasons. It is now a local directory, as is normal. (The directory /opt/edg/var/info is the experimental s/w installation status directory, maintained by the experimental software managers).

Histical note: old CEs:

epgce1 (in its new Dell incarnation) exports a software area to the now-defunct epcf* farm.

epgce2 (old white supermicro box) exports a info area to the present epgce4 (see above). It's now a BDII only.

SE role

The SE is epgse1.

We run SRM Disk Pool Manager (DPM) rather than (say) dCache.

SE epgse1 forms the main node and we have additional pool-node epgsr1. As well as being a pool-node, epgsr1 has an area which is NFS-exported to the epgce3 WNs, as described below.

epgse1 has attached RAID f9. This has 16 drives of 750 GB each and is firmware-configured as one RAID-5 logical drive, with one drive as spare.This logical drive is subdivided in the RAID firmware into 6 firmware-partitions of 1668669 MiB? (about 1.7 TB) each. Each of these has a fdisk partition table with a single partition, except for one of those six, which has a fdisk partition table with 3 partitions of about 570 MB each. The first five partitions are mounted as f9a to f9e. The 3 smaller ones are mounted as f9f, f9g, f9h. These 3 are system-level copies of a withdrawn RAID f3, and so for backward compatibility there are symlinks from /disk/f3[abc] to /disk/f9[fgh]. All partitions are formatted as ext3.

epgsr1 has attached RAIDs f12, f13, f14 (soon), and f15. Each of these has 24 drives of 1TB, and each is firmware-configured as two RAID-6 logical drives of about 10TB with no spare drives. Each logical drive is subdivided in the RAID firmware into 2 firmware-partitions of 4768045 MiB? (about 5TB) each. Each of these has a GPT partition table with a single GPT-partition. So f12, for example, appears to epgsr1 as four scsi drives, which are mounted as f12a, f12b, f12c, f12d, and similarly for f13, f14 (soon), and f15. All partitions are formatted as xfs.

Filesystem f15d is different: it is not a DPM area, but is a large software area which is NFS-exported to the WNs of epgce3. At one time, the software areas on these WNs were mounted from several servers: some mounted from epgce3 itself, and some from f15d on epgsr1. As of August 2009, all software areas occupy directories on f15d, so they can expand and also benefit from the reliability of RAID.

epgsr1 uses both its gigabit interfaces, since 20090611: see LocalGridBonding page.

BDII role

epgce2. This is a 2004 white Streamline-supplied box. It used to be CE for the eScience cluster but this is now defunct (as of April 2009).

Historical note: epgce2:/opt/edg/var/info/atlas used to be exported to epgce4, read-only. This is no longer used.

MON role

epgmo1 serves this role. It also has the DHCP setup, kickstart setups (tftp), mysql, syslogd central logger, ganglia.

VO-box role

Currently epcf01 serves this role as the Alice VO-box.

Some useful information about VOBoxes can be found here

WN role


epgd01-16 (16 twin nodes with 8 cores each) on CE epgce3

These worker nodes are located within the group. Their user home areas are local to the worker node (that is, /home on worker hard disk), and so are shared with other jobs on the same WN but not with other WNs, and not with the CE. Job output (and any staged input) uses the usual Torque scp to/from the CE. LCG/gLite middleware is installed from the usual RPMs. Experiment software areas are mounted from 3 different places (as of May 2009): epcf01 for Alice, epsr1:/disk/f15d for Atlas, and epgce3 for other areas. The aim is to move all of these to epgsr1:/disk/f15d when time permits. As of 20090709, LHCb software has been moved too.

u4n081-128 (48 bluebear nodes with 4 cores each) on CE epgce4

These worker nodes are located with BlueBEAR. They use a relocated tar-style LCG middleware installation (in /egee/soft/middleware), and not LCG RPMs. Their user home areas are on GPFS /egee/home/, and so are shared with other jobs on the same WN and with other grid WNs, but not with the CE. Job output (and any staged input) uses the usual Torque scp to/from the CE. An alternative of using a home area on the WN's hard-disk has not been found necessary (performance-wise) so far. Experiment software areas are all on GPFS in /egee/soft.

GOCDB site configuration

External site monitors

See GridLinks.

Linux installation

Operating system

SL4.6.

PXE boot

Kickstart

Immediate post-install

Pacakge update step: to bring all installed packages up to date: configure /etc/yum.conf or files in /etc/yum.repos.d/ so that it has entries for updating packages from our local Birmingham mirror of the Scientific Linux packages, so that immediately it's possible to do a yum update. This is often done as part of the%post processing of the kickstart install.

gLite install: see the LCG 3.1 Install Guide or later. In order to install such packages, add lines to the /etc/yum.conf or files in /etc/yum.repos.d/. Then, for example for a BDII, enter yum install glite-BDII and yum install lcg-CA.

yaim (configures the box according to its future role)
site-info.def in principle everything that makes a native SL4 into a glite CE/SE/...

See Yaim User Guide, including site configuration variables.

See CERN's LCG Documents for other YAIM and LCG guides, and the LCG Port table.

cfengine vs rdist vs rsync

Ongoing maintenance

As a useful rule, when modifying system files, we always try to keep a copy of the originally-supplied version of a system configuration files that we modify, using an .ORIG suffix: cp -pi $fn $fn.ORIG. If a file is vulnerable to being overwritten by a subsequent package update, we also keep a copy of the modified file: cp -pi $fn $fn.CURR. Subsequently, we keep a dated copy of the previous version according to its creation date: cp -p $fn $fn.yyyymmdd. To locate system changes, we just need to locate ORIG.

When determining where a file has come from, or how to update it, it's useful to check if it's a member of a package: rpm -qf _fullpathname_, which will give you the package name; and if you're wondering if it's been modified from its original, then rpm -V packagename will list it if it has been modified.

Keeping packages up to date (RPMs) using yum

Security: IP tables and hosts.allow, user-level with ssh-keys for scp to/from CE

Torque and Maui/Moab

Torque server runs on the CE in the case of epgce3, and on a BlueBEAR server in the case of epgce4.

Service pbs_mom runs on all WNs. Service pbs_server and maui scheduler runs on epgce3.

To test that job submission and output retrieval are healthy, use LSL's pbstest command on a CE. You can optionally supply arguments of a grid userid and a number of jobs. The jobs should run for 30 seconds and output returns into directory /tmp/pbstest/.

To put a node offline so that the batch system will start no further jobs on it: pbsnodes -o nodename -N "reason for putting offline"

To list offline nodes: use pbsnodes -ln (dash el en; the -n option works in latest Torque versions to list Notes assigned to a node).

To start or stop a queue: easiest to use qstart or qstop. In the stopped state, a queue will accept jobs but not start them. Jobs which are already running will continue.

To enable or disable a queue: use qenable or qdisable. In the disabled state, a queue will not accept new jobs.

A number of these operations can also be done via the qmgr interface: eg: set queue long started = false.

APEL

Accounting statistics are published via Apel. This is a software collection that parses local log files (such as those related to PBS), collates statistics on number of jobs and users etc. A central accounting server then aggregates the records.

On the BB lcg-CE and CreamCE (currently epgr04 and epgr07), pbs log and accounting files are mounted from ep19x.ph.bham.ac.uk.

BlueBEAR

Particular issues for bluebear: firewalls etc.

Grid admins Lawrie and Chris are allowed to ssh from bluebear login nodes to grid worker nodes. The usual personal password is required (keys won't work as /bb is unmounted). User g-admin can do the ssh as well, and in that case its ssh key will work (since its home area is visible on the grid worker node). From there, you can quickly check how things are running on the node, and/or sudo into a gridaccount (as below) to delve more deeply. Bear in mind the CV health-check script will kill any non-job-matching sessions from time to time.

To allow local grid administrators to help grid users, the admins are permitted under sudo to access those accounts, using the following sort of command on a BlueBEAR node:

sudo -H -s -u gridaccount

Useful local commands

showusers: on a CE/SE, lists local userids and their corresponding DNs.

qs: basically just a qstat. In the case of BlueBEAR (epgr04/07), it runs under ID edguser, which has more listing privileges than root.

qsr: like qs, but only lists running jobs. If you specify an argument, then lines matching that argument are marked with an X on the right.

qsdn: like qs, but summarises the jobs in the system (queued and running) with counts against each userid and their DN.

qsdni and qsdnr: like qsdn, but gives info for input-queued and for running jobs respectively.

listdone: on a CE, lists accounting information for completed jobs, for all users or a particular userid. There are options to sort the output into various orders.

pbstest: on a CE, submits short jobs to Torque; the output will return to directory /tmp/pbstest/. Optionally specify args griduserid and number of jobs.

Local Grid Journal

For a diary of events, look at the LocalGridJournal page.

Topic revision: r25 - 05 Jan 2011 - 10:49:29 - ChristopherCurtis
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback