TWiki> Computing Web>LocalGridCookbook (revision 9)EditAttach

Local Grid Cookbook

A How-To guide on installing, monitoring and maintaining new Grid nodes at Birmingham.

Solving key problems

Most problems can be solved by doing the following:
  1. Reboot - Log onto epgmo1 and use the command cfrun $host -- -D reboot, where $host is the fully qualified host name (eg
  2. Run yaim - If rebooting doesn't fix the problem, run yaim from epgmo1 with the command cfrun $host -- -D reyaim.
  3. Reinstall - If all else fails, reinstall the node.

Note that rebooting/reyaiming DPM nodes can take some time. These actions kill the SRM processes, but will only do so after allowing existing transfers to complete!


There are a number of key links which should be monitored periodically (ranked in order of importance):

  1. GridPP Nagios - Click on the "Problems" link and search for
  2. ATLAS Production - This should show a) lots of jobs assigned to Birmingham (black line) and b) a low number of failures (light green line). Useful to compare with Oxford

A more detailed list of useful links can be found here. There are also a number of locally managed monitoring web pages, including Ganglia? , Nagios? and Pakiti. These are only viewable from the PP subnet (although some are publically accessible on port 8888).

Recurring Problems

Job Submission to Cream CE

For some reason, we are periodically failing Ops Nagios tests on the CreamCE This problem can be temporarily fixed by rebooting, but the problem will reoccur.


Package update

It is not safe to leave yum to update automatically. Applying automatic updates to nodes (especially the worker nodes) can have unexpected consequences. All updates should be completed "manually", that is to say, via cfengine under controlled conditions. The cfengine template will then take care of any known problems.

  1. Log onto epgmo1 and run the command cfrun $host -- -D yum_update
  2. After updating, it is prudent to rerun yaim with the command cfrun $host -- -D reyaim

Edit iptables

  1. Log onto epgmo1 and edit the relevant iptables.rules files. These are kept in /var/cfengine/inputs/repo/$role/iptables.rules where role is the role of the machine (eg dpm_head_node, or glexec_wn. A list of roles is maintained here). Note that the CE, VM host and DPM Pool Node roles have an extra directory layer in order to further differentiate.
  2. Run cfengine with the command cfrun $host, where $host is the name of the machine to update. Cfengine will then copy the new rules onto the appropriate server and restart the iptables service.

Adjust Maui settings

  1. Log onto epgmo1 and edit the file /var/cfengine/inputs/repo/ce/twin_cream_ce/maui.cfg as appropriate.
  2. Run cfengine with the command cfrun -- -D restart_maui

Restart services

The pbs_mom/pbs_server, cron, maui, nagios and ganglia services may all be restarted via cfengine. Simply run the command cfrun $host -- -D restart_XXXX, where XXXX is the service name. Note that for pbs services the command is restart_pbs!


This can be achieved with the command cfrun $host -- -D reboot. Note that some machines can take a long time to shutdown as they may require some service tasks to finish. For example, DPM nodes will wait for GridFTP? transfers to complete before shutting down the service.


Sometimes APEL does not publish accounting information properly. First notice of failure to publish will probably come in the form of a failed org.apel.APEL-Pub nagios test. More information about problems can be found in the apel log file in /var/log/apel.log.

Some problems can be fixed by running the APEL publisher manually, with the appropriate config file. To run the APEL publisher, log onto the APEL node and execute the commands:

export APEL_HOME=/opt/glite/
/opt/glite/bin/apel-publisher -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log 2>&1

The default config in /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml, should publish all new records. It does not (despite it's misleading name) publish missing records, ie it will not publish records it previously missed if it has already published more recent records. If you need to publish records from a gap period (it should be obvious from the nagios error message), the gap publisher can be invoked by editing publisher-config-yaim.xml. Change the Republish tag from missing to gap, with an appropriate date attribute, eg:


<Republish recordStart="2010-12-01" recordEnd="2011-01-05">gap</Republish>



Local Cluster CE



Local Cluster WN


  1. Log onto u4n128 on BB and sudo -s -H -u g-admin
  2. Make the directory BB:/egee/soft/SL5/middleware/X.Y.Z-0, where X.Y.Z-0 is the version number of the latest glite-WN_TAR release.
  3. Download the latest glite-WN_TAR and glite-WN_TAR-externals tarball releases into the directory BB:/egee/soft/SL5/middleware/X.Y.Z-0 and unzip.
  4. Change the softlink BB:/egee/soft/SL5/middleware/prod to point to BB:/egee/soft/SL5/middleware/X.Y.Z-0
  5. Run the script BB:/egee/soft/SL5/local/yaim-conf/ This will create the directory structure needed to download the CRLs.
  6. Run yaim to configure the middleware: /egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /egee/soft/SL5/local/yaim-conf/site-info.def -n glite-WN_TAR
  7. Run yaim to download the CRL config files: /egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -r -s /egee/soft/SL5/local/yaim-conf/site-info.def -n glite-WN_TAR -f config_certs_userland -f config_crl
  8. Download the actual CRLs: /egee/soft/SL5/local/bin-cron/local-fetch-crl >> /egee/soft/SL5/local/log/fetch-crl-cron.log 2>&1
  9. Run the script BB:/egee/soft/SL5/local/yaim-conf/ This creates a script BB:/egee/soft/SL5/middleware/prod/external/etc/profile.d/, which is sourced by all grid users in order to setup the correct $X509_CERT_DIR and $X509_VOMS_DIR variables. It also creates the directory BB:/egee/soft/SL5/middleware/prod/external/etc/grid-security/gridmapdir, which is used by the /egee/soft/SL5/middleware/prod/lcg/sbin/ script to clean user home areas, and fixes the libldap bug.

It is sometimes required that the user home areas and software experiment areas be recreated (usually after a problem with the NAS). This may be achieved by running the scripts and in u4n128:/egee/soft/SL5/local/yaim-conf/. The script simply recreates software areas in /egee/soft/SL5/, ensuring that the directories are owned by experiment software users and are group readable.

The script obtains a list of users from the users.conf file and then creates home directories for those users if they are not found in /egee/home. This script will also generate dsa keys for new users. Finally, it harvests all dsa keys from all users and places them in the file public_keys. This file can then be copied into /etc/ssh/extra/opYtert2hpwTCsaRT9f36grTz on epgr04 and epgr07 (ie submission nodes for BlueBEAR). The sshd service should then be restarted.

Note that and both make heavy use of sudo, so these scripts should be run only with the proper rights (currently only tested as curtisc)!


  1. Make the directory /home/lcgui/$ARCH/middleware/X.Y.Z-0, where X.Y.Z-0 is the version number of the latest glite-UI_TAR release.
  2. Download the latest glite-UI_TAR and glite-UI_TAR-externals tarball releases into the directory /home/lcgui/$ARCH/middleware/X.Y.Z-0 and unzip.
  3. Change the softlink /home/lcgui/$ARCH/middleware/prod to point to /home/lcgui/$ARCH/middleware/X.Y.Z-0
  4. Run the script /home/lcgui/$ARCH/yaim-conf/ This will create the directory structure needed to download the CRLs.
  5. Run yaim to configure the middleware: /home/lcgui/$ARCH/middleware/prod/glite/yaim/bin/yaim -c -s /home/lcgui/$ARCH/yaim-conf/site-info.def -n glite-UI_TAR
  6. Run yaim to download the CRL config files: /home/lcgui/$ARCH/middleware/prod/glite/yaim/bin/yaim -r -s /home/lcgui/$ARCH/yaim-conf/site-info.def -n glite-UI_TAR -f config_certs_userland -f config_crl
  7. Download the actual CRLs: /home/lcgui/$ARCH/local/bin-cron/local-fetch-crl >> /home/lcgui/$ARCH/local/log/fetch-crl-cron.log 2>&1

The installation should then be tested from eprexa/b by using the voms-proxy-init, glite-wms-job-submit and glite-wms-job-status commands.


  • Valid $ARCH values are either or SL5
  • To setup the grid UI, users source the script /usr/local/bin/lcguisetup. This calls /home/lcgui/$ARCH/local/lcguisetup.bash. This in turn calls appropriate script and sets the DPM variables required by ATLAS.
  • UI_TAR 3.1.44-0 appears to have a bug in that external/usr/lib is not appended to LD_LIBRARY_PATH. This bug is fixed by appending the variable in the external/etc/profile.d/ script. This script should be created by running /home/lcgui/ after running yaim.
  • UI_TAR 3.2.6-0 appears not to install any .pem files in external/etc/grid-security/vomsdir/. This was fixed by copying them from the SL4 installation.

DPM Head Node (SE)

DPM Pool Node





-- ChristopherCurtis - 22 Feb 2010

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 05 Jan 2011 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis?
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback