Local Grid Cookbook
A How-To guide on installing, monitoring and maintaining new Grid nodes at Birmingham.
Solving key problems
Most problems can be solved by doing the following:
- Reboot - Log onto epgmo1 and use the command
cfrun $host -- -D reboot
, where $host
is the fully qualified host name (eg epgse1.ph.bham.ac.uk
)
- Run yaim - If rebooting doesn't fix the problem, run yaim from epgmo1 with the command
cfrun $host -- -D reyaim
.
- Reinstall - If all else fails, reinstall the node.
Note that rebooting/reyaiming DPM nodes can take some time. These actions kill the SRM processes, but will only do so after allowing existing transfers to complete!
Monitoring
There are a number of key links which should be monitored periodically (ranked in order of importance):
- SAM Nagios - Click on the "Problems" link and search for .bham.ac.uk
- ATLAS Production - This should show a) lots of jobs assigned to Birmingham (black line) and b) a low number of failures (light green line). Useful to compare with Oxford
- The ATLAS Ganga Blacklist - Aim to stay off this list!
A more detailed list of useful links can be found
here. There are also a number of locally managed monitoring web pages, including
Ganglia? ,
Nagios? and
Pakiti. These are only viewable from the PP subnet (although some are publically accessible on port 8888).
Recurring Problems
ATLASDATADISK
DATADISK's around the UK are filling up very rapidly. At the suggestion of ATLAS UK, I have taken space from the other disks and added it to DATADISK. This should avoid any space related difficulties. In case of problems,
GRIDPP-STORAGE@JISCMAIL.AC.UK
can help with DPM difficulties.
LHCb SAM tests on BlueBEAR
LHCb SAM tests sometimes timeout because of the slow filesystem on
BlueBEAR. If this happens, reduce the number of ATLAS jobs running by restarting the appropriate
qfeed
scripts on epgr04.
Job Submission to Cream CE
For some reason the blparser has recently started entering a funny state on the Cream CE (epgr05), which causes problems for direct submission of jobs. Complaints usually come from LHCb or ALICE. It can be fixed by running yaim on the Cream CE (via cfengine on epgmo1).
Maintenance
Package update
It is not safe to leave yum to update automatically. Applying automatic updates to nodes (especially the worker nodes) can have unexpected consequences. All updates should be completed "manually", that is to say, via cfengine under controlled conditions. The cfengine template will then take care of any known problems.
- Log onto
epgmo1
and run the command cfrun $host -- -D yum_update
- After updating, it is prudent to rerun yaim with the command
cfrun $host -- -D reyaim
Edit iptables
- Log onto
epgmo1
and edit the relevant iptables.rules
files. These are kept in /var/cfengine/inputs/repo/$role/iptables.rules
where role
is the role of the machine (eg dpm_head_node
, or glexec_wn
. A list of roles is maintained here). Note that the CE, VM host and DPM Pool Node roles have an extra directory layer in order to further differentiate.
- Run cfengine with the command
cfrun $host
, where $host
is the name of the machine to update. Cfengine will then copy the new rules onto the appropriate server and restart the iptables service.
Adjust Maui settings
- Log onto
epgmo1
and edit the file /var/cfengine/inputs/repo/ce/twin_cream_ce/maui.cfg
as appropriate.
- Run cfengine with the command
cfrun epgr05.ph.bham.ac.uk -- -D restart_maui
Restart services
The
pbs_mom/pbs_server
,
cron
,
maui
,
nagios
and
ganglia
services may all be restarted via cfengine. Simply run the command
cfrun $host -- -D restart_XXXX
, where
XXXX
is the service name.
Note that for pbs services the command is restart_pbs
!
Reboot
This can be achieved with the command
cfrun $host -- -D reboot
. Note that some machines can take a long time to shutdown as they may require some service tasks to finish. For example, DPM nodes will wait for
GridFTP? transfers to complete before shutting down the service.
Installation
lcg-CE
Local Cluster CE
glite-WN
Local Cluster WN
- Log onto u4n128 on BB and
sudo -s -H -u g-admin
- Make the directory
BB:/egee/soft/SL5/middleware/X.Y.Z-0
, where X.Y.Z-0
is the version number of the latest glite-WN_TAR release.
- Download the latest
glite-WN_TAR
and glite-WN_TAR-externals
tarball releases into the directory BB:/egee/soft/SL5/middleware/X.Y.Z-0
and unzip.
- Change the softlink
BB:/egee/soft/SL5/middleware/prod
to point to BB:/egee/soft/SL5/middleware/X.Y.Z-0
- Run the script
BB:/egee/soft/SL5/local/yaim-conf/pre_yaim.sh
. This will create the directory structure needed to download the CRLs.
- Run yaim to configure the middleware:
/egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /egee/soft/SL5/local/yaim-conf/site-info.def -n glite-WN_TAR
- Run yaim to download the CRL config files:
/egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -r -s /egee/soft/SL5/local/yaim-conf/site-info.def -n glite-WN_TAR -f config_certs_userland -f config_crl
- Download the actual CRLs:
/egee/soft/SL5/local/bin-cron/local-fetch-crl >> /egee/soft/SL5/local/log/fetch-crl-cron.log 2>&1
- Run the script
BB:/egee/soft/SL5/local/yaim-conf/post_yaim.sh
. This creates a script BB:/egee/soft/SL5/middleware/prod/external/etc/profile.d/x509.sh
, which is sourced by all grid users in order to setup the correct $X509_CERT_DIR
and $X509_VOMS_DIR
variables. It also creates the directory BB:/egee/soft/SL5/middleware/prod/external/etc/grid-security/gridmapdir
, which is used by the /egee/soft/SL5/middleware/prod/lcg/sbin/cleanup-grid-accounts.sh
script to clean user home areas, and fixes the libldap
bug.
It is sometimes required that the user home areas and software experiment areas be recreated (usually after a problem with the NAS). This may be achieved by running the scripts
config_users.sh
and
config_software.sh
in
u4n128:/egee/soft/SL5/local/yaim-conf/
. The
config_software.sh
script simply recreates software areas in
/egee/soft/SL5/
, ensuring that the directories are owned by experiment software users and are group readable.
The
config_users.sh
script obtains a list of users from the
users.conf
file and then creates home directories for those users if they are not found in
/egee/home
. This script will also generate dsa keys for new users. Finally, it harvests all dsa keys from all users and places them in the file
public_keys
. This file can then be copied into
/etc/ssh/extra/opYtert2hpwTCsaRT9f36grTz
on epgr04 and epgr07 (ie submission nodes for
BlueBEAR). The
sshd
service should then be restarted.
Note that
config_users.sh
and
config_software.sh
both make heavy use of sudo, so these scripts should be run only with the proper rights (currently only tested as
curtisc
)!
UI
- Make the directory
/home/lcgui/$ARCH/middleware/X.Y.Z-0
, where X.Y.Z-0
is the version number of the latest glite-UI_TAR release.
- Download the latest
glite-UI_TAR
and glite-UI_TAR-externals
tarball releases into the directory /home/lcgui/$ARCH/middleware/X.Y.Z-0
and unzip.
- Change the softlink
/home/lcgui/$ARCH/middleware/prod
to point to /home/lcgui/$ARCH/middleware/X.Y.Z-0
- Run the script
/home/lcgui/$ARCH/yaim-conf/pre_yaim.sh
. This will create the directory structure needed to download the CRLs.
- Run yaim to configure the middleware:
/home/lcgui/$ARCH/middleware/prod/glite/yaim/bin/yaim -c -s /home/lcgui/$ARCH/yaim-conf/site-info.def -n glite-UI_TAR
- Run yaim to download the CRL config files:
/home/lcgui/$ARCH/middleware/prod/glite/yaim/bin/yaim -r -s /home/lcgui/$ARCH/yaim-conf/site-info.def -n glite-UI_TAR -f config_certs_userland -f config_crl
- Download the actual CRLs:
/home/lcgui/$ARCH/local/bin-cron/local-fetch-crl >> /home/lcgui/$ARCH/local/log/fetch-crl-cron.log 2>&1
The installation should then be tested from eprexa/b by using the
voms-proxy-init
,
glite-wms-job-submit
and
glite-wms-job-status
commands.
Notes:
- Valid
$ARCH
values are either SL4.new
or SL5
- To setup the grid UI, users source the script
/usr/local/bin/lcguisetup
. This calls /home/lcgui/$ARCH/local/lcguisetup.bash. This in turn calls appropriate grid-env.sh script and sets the DPM variables required by ATLAS.
- UI_TAR 3.1.44-0 appears to have a bug in that
external/usr/lib
is not appended to LD_LIBRARY_PATH
. This bug is fixed by appending the variable in the external/etc/profile.d/x509.sh
script. This script should be created by running /home/lcgui/SL4.new/yaim-conf/post_yaim.sh
after running yaim.
- UI_TAR 3.2.6-0 appears not to install any
.pem
files in external/etc/grid-security/vomsdir/
. This was fixed by copying them from the SL4 installation.
DPM Head Node (SE)
DPM Pool Node
Site BDII
MonBox?
ATLAS Squid
ALICE VOBox
--
ChristopherCurtis - 22 Feb 2010