Installing SL5 on the Birmingham Worker Nodes
The aim of this project is to install SL5 on the Birmingham Worker Nodes. The plan is as follows:
- Move one worker node (epgd01) offline on epgce3
- Install SL5 on singlet node
- Install WN middleware on singlet node
- Create a new test queue on epgce3, enabled for atlas users and software
- Add the singlet node to new queue
- Submit Hello World test job to new queue
- If that works, request new software installation for SL5
- Submit ATLAS test job
If this is successful, the intention is to role out SL5 to all other WNs so that Birmingham is entirely SL5.
Node Isolation
Started by using the command
pbsnodes -o epgd01.ph.bham.ac.uk
on epgce3 to move the first node offline. 7 jobs were running on this node at the time - 6 camont and 1 CMS. Attempted to move the jobs to other nodes by changing their status to held and then restarting. This failed. The job status could not be changed, and they could not be moved or restarted (
this might be something that needs to change in the future!). Jobs were killed instead.
It doesn't look like separate nodes for separate queues is directly available in yaim, so the following method was used to partition the nodes:
-
qmgr -c 'set node epgd01.ph.bham.ac.uk properties+=SL5'
-
qmgr -c set queue long resources_default.neednodes=lcgpro'
-
qmgr -c set queue short resources_default.neednodes=lcgpro'
This sets properties of the nodes to either have
SL5
or the default
lcgpro
. Each queue then requires a node have the appropriate property before a job is submitted to it. The command
pbsnodes -c epgd01.ph.bham.ac.uk
was used to move the singlet node back online. The
listnodejobs
command (custom script!) was used to confirm that no jobs were being submitted to the singlet node.
Test Queue Setup
This was achieved by first setting up the new queue in PBS and then broadcasting it's existance to the BDII by re-running yaim.
-
qmgr -c "print server long" > output.dat
prints the long queue details to a file
- Changed queue name to
sl5_test
- Removed most access list rights (jobs restricted to ATLAS and dteam)
- Copied
output.dat
commands into qmgr. This setup the new queue.
- =qmgr -c "set queue sl5_test resources_default.neednodes="SL5"
Reran yaim. The
site-info.def
file was updated to include details of the new queue. epgce3 was rebooted at this point, though a restart of pbs was probably all that was required.
TEST 1: QUEUE SUBMISSION
The first test was to ensure that jobs submitted to the sl5 queue ran on the singlet node. A simple
helloWorld.jdl
job proved this to be true (
although WMS took forever to update job status).
The atlas software group was removed from the acl - Alessandro's automated jobs found the queue very quickly and tried to update software! The group will be added once SL5 has been installed and validated.
SL5 Installation...
... is tricky! An overview of the boot procedures can be found
here. A 64 bit version of SL 5.3 was installed (which also appears to be available at Oxford). Disk partitioning settings were copied from the existing
sl46-x86_64-worker.ks
. New kernel images were required for epgmo1, which were obtained
here.
gLite Installation
The following yum repository entries were configured (in addition to the defaults sl.repo etc enabled in SL5):
-
http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-WN.repo
-
http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-TORQUE_client.repo
-
http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo
and the following packages installed:
-
yum update
( then rebooted to ensure new kernel used)
-
yum -y install lcg-CA
-
yum -y groupinstall glite-WN
-
yum -y install glite-TORQUE_client
The node was then configured using yaim (
/opt/glite/yaim/bin/yaim -c -s /root/yaim-conf/site-info.def -n WN -n TORQUE_client
). Copies of the yaim config files may be currently found on the local filesystem (
/home/cjc/Grid/ManRepo/Yaim/sl5-WN
). These files will be distributed using cfengine at some point.
Firewall settings
After rebooting the singlet node, a grid test job failed to run, with
checkjob
complaining of
"Server could not connect to MOM"
. The iptables were still on the default settings. The settings from another node (epgd02) were exported using the command
iptables-save > rules.dat
and then imported on epgd01 with the command
iptables-restore < rules.dat
(where the file rules.dat was copied from one node to another inbetween save and restore).
The firewall settings were returning to the defaults after every restart.
/etc/init.d/iptables save
fixed this (
read up on iptables!!!). Jobs now enter the waiting state on epgd01 because the ssh key has changed. This needs to be updated.
The following was entered into the
/etc/hosts.allow
file:
sshd: 147.188.47.199 ALL
sshd: 147.188.47.210
sshd: 147.188.47.212
sshd: 147.188.47.33
sshd: 147.188.46.5
#
sshd: 147.188.46.9
nrpe: 147.188.46.5
And the following was added to
/etc/hosts.deny
:
sshd: ALL : spawn /bin/echo `/bin/date` from %h >> /var/log/ssh.log : deny
ALL: ALL
These files could be distributed using cfengine in the future!
SSH Keys
The same key is used for all twin worker nodes. The key pairs are kept in
epgmo1:/data1/grid/s/twins/
, and are copied into
epgd01:/etc/ssh
(via an NFS mount of epgmo1).
These files could be distributed using cfengine in the future!
TEST 2: SCP
Worker nodes are required to pull and push files to and onto epgce3 without the use of a passphrase, hence the ssh keys setup. This may be tested by sudo'ing to another user (eg atl088) on a worker node and then using the command
scp -rpB test_file atl088@epgce3.ph.bham.ac.uk:test_file
. If this fails, jobs will not be able be submitted!
Note that the reverse does not have to be true - epgce3 does not have to be able to send files to worker nodes! As this is not required, the facility should not be set up, keeping security risks to a minimum.
TEST 3: QUEUE SUBMISSION
This test is exactly the same as the previous test - a simple hello world job running on the singlet node, the difference being the new node should now be running SL5. The test job was successful.
Automated Installation
A cfengine script was developed to automate the installation process. The working principle is to use kickstart to install the OS and run
/etc/rc.d/rc.local
as a first boot script. This script only installs cfengine. Once the OS installation has been completed, the
cfrun
command on epgmo1 configures the node to be a worker node, downloading the glite middleware and running yaim etc.
Full details can be found
here.
ATLAS Requirements
ATLAS plans to have native SL5 builds of Athena in 2010. Until then, some additional packages are required. Most of these have been bundled into the HEPOSLibs meta package (available
here). This does not install all of the libraries required by ATLAS! ATLAS also requires 32 bit versions of various libraries - details can be found
here.
On the Birmingham twin SL 5.3 worker nodes, the
http://ftp.scientificlinux.org/linux/scientific/53/i386/SL
repository is used for obtaining 32 bit libraries. Yum configuration details are stored in
/etc/yum.repos.d/sl-i386.repo
, which is available for download on from the
epgmo1.ph.bham.ac.uk
webserver (thus enabling configuration by cfengine).
Finally, SELinux is set to permissive mode (
SELINUX=permissive
in the file
/etc/sysconfig/selinux
), as detailed
here.
ATLAS Installation
Installation of ATLAS software is automated. Requests can be made via
Alessandro De Salvos webpage. After a number of failures, the automated jobs did manage to install a large number of Athena releases. Note that 15.1.0 and 15.2.0 consistantly failed to install. The initial problems were solved by installing the following libraries (in addition to those recommended on the other twikis):
### fresh SL5 machine
lcg-CA
glite-WN
glite-TORQUE_client
HEP_OSlibs_SL5
compat-gcc-34
compat-gcc-34-c++
compat-gcc-34-g77
compat-gcc-34-g77-3.4.6-4.x86_64 #RPM install with excludedocs option
compat-glibc
compat-glibc-headers
compat-libf2c-34
compat-libgcc-296.i386
compat-libstdc++-296
compat-libstdc++-33
compat-readline43
lapack
libgfortran
ghostscript.x86_64
ghostscript.i386
libXpm
glibc-devel-2.5
giflib
compat-openldap
openssl097a
compat-db
libxml2-devel.i386
libxml2-devel.x86_64
popt.i386
popt.x86_64
blas.i386
blas.x86_64
blas-devel.i386
blas-devel.x86_64
sharutils.x86_64
bc.x86_64
curl.x86_64
procmail.x86_64
gcc
gcc-c++
libstdc++-devel
python-devel
openldap-clients
libxslt
PyXML
It is interesting to note that an ldap query shows that
t2ce05.physics.ox.ac.uk
is the Oxford SL 5.3 system.
lcg-infosites --vo atlas tag
shows that this CE does not have any recognizable Athena version installed (the tags instead taking the form
atlas-offline-rel_0-15.X.0
). A test grid job has been submitted to see if 14.5.0 is available.
--
ChristopherCurtis - 06 Oct 2009