Installing SL5 on the Birmingham Worker Nodes

The aim of this project is to install SL5 on the Birmingham Worker Nodes. The plan is as follows:

  • Move one worker node (epgd01) offline on epgce3
  • Install SL5 on singlet node
  • Install WN middleware on singlet node
  • Create a new test queue on epgce3, enabled for atlas users and software
  • Add the singlet node to new queue
  • Submit Hello World test job to new queue
  • If that works, request new software installation for SL5
  • Submit ATLAS test job

If this is successful, the intention is to role out SL5 to all other WNs so that Birmingham is entirely SL5.

Node Isolation

Started by using the command pbsnodes -o epgd01.ph.bham.ac.uk on epgce3 to move the first node offline. 7 jobs were running on this node at the time - 6 camont and 1 CMS. Attempted to move the jobs to other nodes by changing their status to held and then restarting. This failed. The job status could not be changed, and they could not be moved or restarted ( this might be something that needs to change in the future!). Jobs were killed instead.

It doesn't look like separate nodes for separate queues is directly available in yaim, so the following method was used to partition the nodes:

  1. qmgr -c 'set node epgd01.ph.bham.ac.uk properties+=SL5'
  2. qmgr -c set queue long resources_default.neednodes=lcgpro'
  3. qmgr -c set queue short resources_default.neednodes=lcgpro'

This sets properties of the nodes to either have SL5 or the default lcgpro. Each queue then requires a node have the appropriate property before a job is submitted to it. The command pbsnodes -c epgd01.ph.bham.ac.uk was used to move the singlet node back online. The listnodejobs command (custom script!) was used to confirm that no jobs were being submitted to the singlet node.

Test Queue Setup

This was achieved by first setting up the new queue in PBS and then broadcasting it's existance to the BDII by re-running yaim.

  1. qmgr -c "print server long" > output.dat prints the long queue details to a file
  2. Changed queue name to sl5_test
  3. Removed most access list rights (jobs restricted to ATLAS and dteam)
  4. Copied output.dat commands into qmgr. This setup the new queue.
  5. =qmgr -c "set queue sl5_test resources_default.neednodes="SL5"

Reran yaim. The site-info.def file was updated to include details of the new queue. epgce3 was rebooted at this point, though a restart of pbs was probably all that was required.

TEST 1: QUEUE SUBMISSION

The first test was to ensure that jobs submitted to the sl5 queue ran on the singlet node. A simple helloWorld.jdl job proved this to be true ( although WMS took forever to update job status).

The atlas software group was removed from the acl - Alessandro's automated jobs found the queue very quickly and tried to update software! The group will be added once SL5 has been installed and validated.

SL5 Installation...

... is tricky! An overview of the boot procedures can be found here. A 64 bit version of SL 5.3 was installed (which also appears to be available at Oxford). Disk partitioning settings were copied from the existing sl46-x86_64-worker.ks. New kernel images were required for epgmo1, which were obtained here.

gLite Installation

The following yum repository entries were configured (in addition to the defaults sl.repo etc enabled in SL5):

  • http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-WN.repo
  • http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-TORQUE_client.repo
  • http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo

and the following packages installed:

  • yum update ( then rebooted to ensure new kernel used)
  • yum -y install lcg-CA
  • yum -y groupinstall glite-WN
  • yum -y install glite-TORQUE_client

The node was then configured using yaim ( /opt/glite/yaim/bin/yaim -c -s /root/yaim-conf/site-info.def -n WN -n TORQUE_client). Copies of the yaim config files may be currently found on the local filesystem ( /home/cjc/Grid/ManRepo/Yaim/sl5-WN). These files will be distributed using cfengine at some point.

Firewall settings

After rebooting the singlet node, a grid test job failed to run, with checkjob complaining of "Server could not connect to MOM". The iptables were still on the default settings. The settings from another node (epgd02) were exported using the command iptables-save > rules.dat and then imported on epgd01 with the command iptables-restore < rules.dat (where the file rules.dat was copied from one node to another inbetween save and restore).

The firewall settings were returning to the defaults after every restart. /etc/init.d/iptables save fixed this ( read up on iptables!!!). Jobs now enter the waiting state on epgd01 because the ssh key has changed. This needs to be updated.

The following was entered into the /etc/hosts.allow file:

sshd: 147.188.47.199 ALL
sshd: 147.188.47.210
sshd: 147.188.47.212
sshd: 147.188.47.33
sshd: 147.188.46.5
#
sshd: 147.188.46.9
nrpe: 147.188.46.5

And the following was added to /etc/hosts.deny:

sshd: ALL : spawn /bin/echo `/bin/date` from %h >> /var/log/ssh.log : deny
ALL: ALL

These files could be distributed using cfengine in the future!

SSH Keys

The same key is used for all twin worker nodes. The key pairs are kept in epgmo1:/data1/grid/s/twins/, and are copied into epgd01:/etc/ssh (via an NFS mount of epgmo1).

These files could be distributed using cfengine in the future!

TEST 2: SCP

Worker nodes are required to pull and push files to and onto epgce3 without the use of a passphrase, hence the ssh keys setup. This may be tested by sudo'ing to another user (eg atl088) on a worker node and then using the command scp -rpB test_file atl088@epgce3.ph.bham.ac.uk:test_file. If this fails, jobs will not be able be submitted!

Note that the reverse does not have to be true - epgce3 does not have to be able to send files to worker nodes! As this is not required, the facility should not be set up, keeping security risks to a minimum.

TEST 3: QUEUE SUBMISSION

This test is exactly the same as the previous test - a simple hello world job running on the singlet node, the difference being the new node should now be running SL5. The test job was successful.

Automated Installation

A cfengine script was developed to automate the installation process. The working principle is to use kickstart to install the OS and run /etc/rc.d/rc.local as a first boot script. This script only installs cfengine. Once the OS installation has been completed, the cfrun command on epgmo1 configures the node to be a worker node, downloading the glite middleware and running yaim etc.

Full details can be found here.

ATLAS Requirements

ATLAS plans to have native SL5 builds of Athena in 2010. Until then, some additional packages are required. Most of these have been bundled into the HEPOSLibs meta package (available here). This does not install all of the libraries required by ATLAS! ATLAS also requires 32 bit versions of various libraries - details can be found here.

On the Birmingham twin SL 5.3 worker nodes, the http://ftp.scientificlinux.org/linux/scientific/53/i386/SL repository is used for obtaining 32 bit libraries. Yum configuration details are stored in /etc/yum.repos.d/sl-i386.repo, which is available for download on from the epgmo1.ph.bham.ac.uk webserver (thus enabling configuration by cfengine).

Finally, SELinux is set to permissive mode ( SELINUX=permissive in the file /etc/sysconfig/selinux), as detailed here.

ATLAS Installation

Installation of ATLAS software is automated. Requests can be made via Alessandro De Salvos webpage. After a number of failures, the automated jobs did manage to install a large number of Athena releases. Note that 15.1.0 and 15.2.0 consistantly failed to install. The initial problems were solved by installing the following libraries (in addition to those recommended on the other twikis):

### fresh SL5 machine

lcg-CA
glite-WN
glite-TORQUE_client

HEP_OSlibs_SL5

compat-gcc-34
compat-gcc-34-c++
compat-gcc-34-g77

compat-gcc-34-g77-3.4.6-4.x86_64 #RPM install with excludedocs option

compat-glibc
compat-glibc-headers
compat-libf2c-34
compat-libgcc-296.i386
compat-libstdc++-296
compat-libstdc++-33
compat-readline43
lapack
libgfortran
ghostscript.x86_64
ghostscript.i386
libXpm
glibc-devel-2.5
giflib
compat-openldap
openssl097a
compat-db
libxml2-devel.i386
libxml2-devel.x86_64
popt.i386
popt.x86_64
blas.i386
blas.x86_64
blas-devel.i386
blas-devel.x86_64
sharutils.x86_64
bc.x86_64
curl.x86_64
procmail.x86_64

gcc
gcc-c++
libstdc++-devel
python-devel
openldap-clients
libxslt
PyXML

It is interesting to note that an ldap query shows that t2ce05.physics.ox.ac.uk is the Oxford SL 5.3 system. lcg-infosites --vo atlas tag shows that this CE does not have any recognizable Athena version installed (the tags instead taking the form atlas-offline-rel_0-15.X.0). A test grid job has been submitted to see if 14.5.0 is available.

-- ChristopherCurtis - 06 Oct 2009


This topic: Computing > SL5WN
Topic revision: r16 - 02 Feb 2010 - 13:04:04 - ChristopherCurtis
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback