(r11) LocalGridCookbook < Computing

Computing Web>LocalGridCookbook (revision 11) (raw view)~~EditAttach~~
---+ Local Grid Cookbook

A How-To guide on installing, monitoring and maintaining new Grid nodes at Birmingham.

%TOC%

---++ Solving key problems
 Most problems can be solved by doing the following: 
   1 *Reboot* - Log onto epgmo1 and use the command =cfrun $host -- -D reboot=, where =$host= is the fully qualified host name (eg =epgse1.ph.bham.ac.uk=) 
   1 *Run yaim* - If rebooting doesn't fix the problem, run yaim from epgmo1 with the command =cfrun $host -- -D reyaim=. 
   1 *Reinstall* - If all else fails, reinstall the node. 

Note that rebooting/reyaiming DPM nodes can take some time. These actions kill the SRM processes, but will only do so after allowing existing transfers to complete!

---++ Monitoring

There are a number of key links which should be monitored periodically (ranked in order of importance):
   1 [[https://gridppnagios.physics.ox.ac.uk/nagios/][GridPP Nagios]] - Click on the "Problems" link and search for .bham.ac.uk 
   1 [[http://gridinfo.triumf.ca/panglia/sites/day.php?SITE=UKI-SOUTHGRID-BHAM-HEP&SIZE=large][ATLAS Production]] - This should show a) lots of jobs assigned to Birmingham (black line) and b) a low number of failures (light green line). Useful to compare with [[http://gridinfo.triumf.ca/panglia/sites/day.php?SITE=UKI-SOUTHGRID-OX-HEP&SIZE=large][Oxford]] 

A more detailed list of useful links can be found [[GridLinks][here]]. There are also a number of locally managed monitoring web pages, including [[epgr08.ph.bham.ac.uk/ganglia][Ganglia]], [[epgr08.ph.bham.ac.uk/nagios][Nagios]] and [[http://epgr08.ph.bham.ac.uk/pakiti/hosts.php?t=all&o=host&d=&a=all][Pakiti]]. These are only viewable from the PP subnet (although some are publically accessible on port 8888).

---++ Recurring Problems
---+++ Job Submission to Cream CE
 For some reason, we are periodically failing Ops Nagios tests on the CreamCE =epgr07.ph.bham.ac.uk=. This problem can be temporarily fixed by rebooting, but the problem will reoccur. _This appears to have some relation to the blparser service_.

---++ Maintenance
---+++ Package update
 It is not safe to leave yum to update automatically. Applying automatic updates to nodes (especially the worker nodes) can have unexpected consequences. All updates should be completed "manually", that is to say, via cfengine under controlled conditions. The cfengine template will then take care of any known problems.

   1 Log onto =epgmo1= and run the command =cfrun $host -- -D yum_update= 
   1 After updating, it is prudent to rerun yaim with the command =cfrun $host -- -D reyaim= 

---+++ Edit iptables
   1 Log onto =epgmo1= and edit the relevant =iptables.rules= files. These are kept in =/var/cfengine/inputs/repo/$role/iptables.rules= where =role= is the role of the machine (eg =dpm_head_node=, or =glexec_wn=. A list of roles is maintained [[LocalGridMachines][here]]). Note that the CE, VM host and DPM Pool Node roles have an extra directory layer in order to further differentiate. 
   1 Run cfengine with the command =cfrun $host=, where =$host= is the name of the machine to update. Cfengine will then copy the new rules onto the appropriate server and restart the iptables service. 

---+++ Adjust Maui settings
   1 Log onto =epgmo1= and edit the file =/var/cfengine/inputs/repo/ce/twin_cream_ce/maui.cfg= as appropriate. 
   1 Run cfengine with the command =cfrun epgr05.ph.bham.ac.uk -- -D restart_maui= 

---+++ Restart services
 The =pbs_mom/pbs_server=, =cron=, =maui=, =nagios= and =ganglia= services may all be restarted via cfengine. Simply run the command =cfrun $host -- -D restart_XXXX=, where =XXXX= is the service name. *Note that for pbs services the command is =restart_pbs=!*

---+++ Reboot
 This can be achieved with the command =cfrun $host -- -D reboot=. Note that some machines can take a long time to shutdown as they may require some service tasks to finish. For example, DPM nodes will wait for GridFTP transfers to complete before shutting down the service.

---+++ APEL
 Sometimes APEL does not publish accounting information properly. First notice of failure to publish will probably come in the form of a failed =org.apel.APEL-Pub= nagios test. More information about problems can be found in the apel log file in /var/log/apel.log.

Some problems can be fixed by running the APEL publisher manually, with the appropriate config file. To run the APEL publisher, log onto the APEL node and execute the commands:

<verbatim>
export APEL_HOME=/opt/glite/
/opt/glite/bin/apel-publisher -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log 2>&1
</verbatim>

The default config in =/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml=, should publish all new records. It does not (despite it's misleading name) publish missing records, ie it will not publish records it previously missed if it has already published more recent records. If you need to publish records from a gap period (it should be obvious from the nagios error message), the gap publisher can be invoked by editing =publisher-config-yaim.xml=. Change the Republish tag from =missing= to =gap=, with an appropriate date attribute, eg:

<verbatim>
From
<Republish>missing</Republish>

to
<Republish recordStart="2010-12-01" recordEnd="2011-01-05">gap</Republish>
</verbatim>

---++ Installation

---+++ CE
 The Computing Elements are deployed and managed via cfengine. Relevant files are stored in =/var/cfengine/inputs/repo/ce=. This directory contains some scripts that are global to all CEs (ie =listdone=, =qsub.sh= and =showusers=). It also holds a subdirectory per CE, using the cfengine label (ie files for the lcg-CE that feeds the local cluster are stored in =twin_lcg_ce=). In addition, CEs will require machine certificates, and these are held in =/var/cfengine/inputs/repo/certificates/$hostname=.

CEs feeding the local cluster will submit jobs via local torque server, which currently runs on the local cluster Cream CE ( _ie epgr05_). The CEs feeding the BlueBEAR cluster submit jobs via the BlueBEAR torque server, which can be reached via bbexport.bham.ac.uk (147.188.126.20).

---++++ lcg-CE
 The lcg-CEs are close to being deprecated, but are still used by some VOs. Strictly speaking, they require SL4 on a 32bit machine. However, the BlueBEAR lcg-CE requires some 64bit binaries (see below), so the Birmingham lcg-CEs have always been deployed on 64bit machines.

Assuming that a vanilla 64bit server has been kickstarted, an lcg-CE can be deployed in the following manner:

   1 Edit =/var/cfengine/inputs/cfagent.conf= to reflect the machine name. Typically you will have to change the =class_role=, =cert_path= and =classes= variables. 
   1 Make sure the machine has =hostcert.pem= and =hostkey.pem= certificates stored in =/var/cfengine/inputs/repo/certificates/$host=. 
   1 Make sure the relevant site-info.def file has been updated. This can be found in =/var/cfengine/inputs/repo/ce/<role>/site-info.def=, where =role= takes the form either =twin_lcg_ce= or =bb_lcg_ce=. The only yaim variables that should need editing are =CE_HOST= and =BATCH_SERVER=. 
   1 Run =cfrun $host=. This will make sure all the relevant files are available and directories are created. 
   1 Run =cfrun $host -- -D install_fresh=. This will install =lcg-CA=, =lcg-CE=, and =glite-TORQUE_utils=. 
   1 Configure the CE using yaim with the command =cfrun $host -- -D reyaim=. 

---++++ CreamCE
 The Cream CEs are the new standard broker between the Grid and clusters. They currently require SL5 on a 64bit machine - they will not run on a 32bit machine.

Assuming that a vanilla 64bit server has been kickstarted, a CreamCE can be deployed in much the same manner as an lcg-CE:

   1 Edit =/var/cfengine/inputs/cfagent.conf= to reflect the machine name. Typically you will have to change the =class_role=, =cert_path= and =classes= variables. 
   1 Make sure the machine has =hostcert.pem= and =hostkey.pem= certificates stored in =/var/cfengine/inputs/repo/certificates/$host=. 
   1 Make sure the relevant site-info.def file has been updated. This can be found in =/var/cfengine/inputs/repo/ce/<role>/site-info.def=, where =role= takes the form either =twin_cream_ce= or =bb_cream_ce=. The only yaim variables that should need editing are =CE_HOST= and =BATCH_SERVER=. 
   1 Run =cfrun $host=. This will make sure all the relevant files are available and directories are created. 
   1 Run =cfrun $host -- -D install_fresh=. This will install =lcg-CA=, =lcg-CE=, and =glite-TORQUE_utils=. 
   1 Configure the CE using yaim with the command =cfrun $host -- -D reyaim=. 

---++++ Local Cluster CEs
 Jobs submitted to the local grid cluster will be submitted via the local torque server, which currently runs on the local CreamCE (epgr05). This requires that =glite-TORQUE_server= also be installed and configured on this machine. This is handled automatically by the cfengine =install_fresh= and =reyaim= commands if the host has been properly marked as =twin_cream_ce=.

---++++ BlueBEAR CEs
 In order to communicate with the BlueBEAR torque server, the CEs must use torque client binaries containing the appropriate signature. These are available via Lawrie and are distributed via cfengine. They are stored in the file =/var/cfengine/inputs/repo/ce/bb_cream_ce/bbmoab.tar= and are unpacked into =/root/= automatically when the =install_fresh= cfengine command is used.

---+++ glite-WN

---++++ Local Cluster WN

---++++ BlueBEAR WN
   1 Log onto u4n128 on BB and =sudo -s -H -u g-admin= 
   1 Make the directory =BB:/egee/soft/SL5/middleware/X.Y.Z-0=, where =X.Y.Z-0= is the version number of the latest glite-WN_TAR release. 
   1 Download the latest =glite-WN_TAR= and =glite-WN_TAR-externals= tarball releases into the directory =BB:/egee/soft/SL5/middleware/X.Y.Z-0= and unzip. 
   1 Change the softlink =BB:/egee/soft/SL5/middleware/prod= to point to =BB:/egee/soft/SL5/middleware/X.Y.Z-0= 
   1 Run the script =BB:/egee/soft/SL5/local/yaim-conf/pre_yaim.sh=. This will create the directory structure needed to download the CRLs. 
   1 Run yaim to configure the middleware: =/egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /egee/soft/SL5/local/yaim-conf/site-info.def -n glite-WN_TAR= 
   1 Run yaim to download the CRL config files: =/egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -r -s /egee/soft/SL5/local/yaim-conf/site-info.def -n glite-WN_TAR -f config_certs_userland -f config_crl= 
   1 Download the actual CRLs: =/egee/soft/SL5/local/bin-cron/local-fetch-crl >> /egee/soft/SL5/local/log/fetch-crl-cron.log 2>&1= 
   1 Run the script =BB:/egee/soft/SL5/local/yaim-conf/post_yaim.sh=. This creates a script =BB:/egee/soft/SL5/middleware/prod/external/etc/profile.d/x509.sh=, which is sourced by all grid users in order to setup the correct =$X509_CERT_DIR= and =$X509_VOMS_DIR= variables. It also creates the directory =BB:/egee/soft/SL5/middleware/prod/external/etc/grid-security/gridmapdir=, which is used by the =/egee/soft/SL5/middleware/prod/lcg/sbin/cleanup-grid-accounts.sh= script to clean user home areas, and fixes the =libldap= bug. 

It is sometimes required that the user home areas and software experiment areas be recreated (usually after a problem with the NAS). This may be achieved by running the scripts =config_users.sh= and =config_software.sh= in =u4n128:/egee/soft/SL5/local/yaim-conf/=. The =config_software.sh= script simply recreates software areas in =/egee/soft/SL5/=, ensuring that the directories are owned by experiment software users and are group readable.

The =config_users.sh= script obtains a list of users from the =users.conf= file and then creates home directories for those users if they are not found in =/egee/home=. This script will also generate dsa keys for new users. Finally, it harvests all dsa keys from all users and places them in the file =public_keys=. This file can then be copied into =/etc/ssh/extra/opYtert2hpwTCsaRT9f36grTz= on epgr04 and epgr07 (ie submission nodes for BlueBEAR). The =sshd= service should then be restarted.

Note that =config_users.sh= and =config_software.sh= both make heavy use of sudo, so these scripts should be run only with the proper rights (currently only tested as =curtisc=)!

---+++ UI
   1 Make the directory =/home/lcgui/$ARCH/middleware/X.Y.Z-0=, where =X.Y.Z-0= is the version number of the latest glite-UI_TAR release. 
   1 Download the latest =glite-UI_TAR= and =glite-UI_TAR-externals= tarball releases into the directory =/home/lcgui/$ARCH/middleware/X.Y.Z-0= and unzip. 
   1 Change the softlink =/home/lcgui/$ARCH/middleware/prod= to point to =/home/lcgui/$ARCH/middleware/X.Y.Z-0= 
   1 Run the script =/home/lcgui/$ARCH/yaim-conf/pre_yaim.sh=. This will create the directory structure needed to download the CRLs. 
   1 Run yaim to configure the middleware: =/home/lcgui/$ARCH/middleware/prod/glite/yaim/bin/yaim -c -s /home/lcgui/$ARCH/yaim-conf/site-info.def -n glite-UI_TAR= 
   1 Run yaim to download the CRL config files: =/home/lcgui/$ARCH/middleware/prod/glite/yaim/bin/yaim -r -s /home/lcgui/$ARCH/yaim-conf/site-info.def -n glite-UI_TAR -f config_certs_userland -f config_crl= 
   1 Run the script =/home/lcgui/$ARCH/yaim-conf/post_yaim.sh=. 
   1 Download the actual CRLs: =/home/lcgui/$ARCH/local/bin-cron/local-fetch-crl >> /home/lcgui/$ARCH/local/log/fetch-crl-cron.log 2>&1= 
   1 Don't forget to download the appropriate voms certificates into =/home/lcgui/SL5/middleware/prod/external/etc/grid-security/vomsdir/=.

The installation should then be tested from eprexa/b by using the =voms-proxy-init=, =glite-wms-job-submit= and =glite-wms-job-status= commands.

*Notes:*
   * Valid =$ARCH= values are either =SL4.new= or =SL5= 
   * To setup the grid UI, users source the script =/usr/local/bin/lcguisetup=. This calls /home/lcgui/$ARCH/local/lcguisetup.bash. This in turn calls appropriate grid-env.sh script and sets the DPM variables required by ATLAS. 
   * UI_TAR 3.1.44-0 appears to have a bug in that =external/usr/lib= is not appended to =LD_LIBRARY_PATH=. This bug is fixed by appending the variable in the =external/etc/profile.d/x509.sh= script. This script should be created by running =/home/lcgui/SL4.new/yaim-conf/post_yaim.sh= after running yaim. 
   * UI_TAR 3.2.6-0 appears not to install any =.pem= files in =external/etc/grid-security/vomsdir/=. This was fixed by copying them from the SL4 installation. 

---+++ DPM Head Node (SE)

---+++ DPM Pool Node

---+++ Site BDII

---+++ MonBox

---+++ ATLAS Squid

---+++ ALICE VOBox

-- Main.ChristopherCurtis - 22 Feb 2010
Topic revision: r11 - 17 Jan 2011 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis?
Computing
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback