SL5WN < Computing

---+ Installing SL5 on the Birmingham Worker Nodes

%TOC%

The aim of this project is to install SL5 on the Birmingham Worker Nodes. The plan is as follows:

   * Move one worker node (epgd01) offline on epgce3 
   * Install SL5 on singlet node 
   * Install WN middleware on singlet node 
   * Create a new test queue on epgce3, enabled for atlas users and software 
   * Add the singlet node to new queue 
   * Submit Hello World test job to new queue 
   * If that works, request new software installation for SL5 
   * Submit ATLAS test job 

If this is successful, the intention is to role out SL5 to all other WNs so that Birmingham is entirely SL5.

---++ Node Isolation
 Started by using the command =pbsnodes -o epgd01.ph.bham.ac.uk= on epgce3 to move the first node offline. 7 jobs were running on this node at the time - 6 camont and 1 CMS. Attempted to move the jobs to other nodes by changing their status to held and then restarting. This failed. The job status could not be changed, and they could not be moved or restarted ( _this might be something that needs to change in the future!_). Jobs were killed instead.

It doesn't look like separate nodes for separate queues is directly available in yaim, so the following method was used to partition the nodes:

   1 =qmgr -c 'set node epgd01.ph.bham.ac.uk properties+=SL5'= 
   1 =qmgr -c set queue long resources_default.neednodes=lcgpro'= 
   1 =qmgr -c set queue short resources_default.neednodes=lcgpro'= 

This sets properties of the nodes to either have =SL5= or the default =lcgpro=. Each queue then requires a node have the appropriate property before a job is submitted to it. The command =pbsnodes -c epgd01.ph.bham.ac.uk= was used to move the singlet node back online. The =listnodejobs= command (custom script!) was used to confirm that no jobs were being submitted to the singlet node.

---++ Test Queue Setup
 This was achieved by first setting up the new queue in PBS and then broadcasting it's existance to the BDII by re-running yaim.

   1 =qmgr -c "print server long" > output.dat= prints the long queue details to a file 
   1 Changed queue name to =sl5_test= 
   1 Removed most access list rights (jobs restricted to ATLAS and dteam) 
   1 Copied =output.dat= commands into qmgr. This setup the new queue. 
   1 =qmgr -c "set queue sl5_test resources_default.neednodes="SL5" 

Reran yaim. The =site-info.def= file was updated to include details of the new queue. epgce3 was rebooted at this point, though a restart of pbs was probably all that was required.

---++ TEST 1: QUEUE SUBMISSION
 The first test was to ensure that jobs submitted to the sl5 queue ran on the singlet node. A simple =helloWorld.jdl= job proved this to be true ( _although WMS took forever to update job status_).

The atlas software group was removed from the acl - Alessandro's automated jobs found the queue very quickly and tried to update software! The group will be added once SL5 has been installed and validated.

---++ SL5 Installation...

... is tricky! An overview of the boot procedures can be found [[https://www.ep.ph.bham.ac.uk/twiki/bin/view/Computing/GridBoot][here]]. A 64 bit version of SL 5.3 was installed (which also appears to be available at Oxford). Disk partitioning settings were copied from the existing =sl46-x86_64-worker.ks=. New kernel images were required for epgmo1, which were obtained [[http://ftp.scientificlinux.org/linux/scientific/53/x86_64/images/pxeboot/][here]].

---++ gLite Installation
 The following yum repository entries were configured (in addition to the defaults sl.repo etc enabled in SL5):

   * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-WN.repo= 
   * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-TORQUE_client.repo= 
   * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo= 

and the following packages installed:

   * =yum update= ( _then rebooted to ensure new kernel used_) 
   * =yum -y install lcg-CA= 
   * =yum -y groupinstall glite-WN= 
   * =yum -y install glite-TORQUE_client= 

The node was then configured using yaim ( =/opt/glite/yaim/bin/yaim -c -s /root/yaim-conf/site-info.def -n WN -n TORQUE_client=). Copies of the yaim config files may be currently found on the local filesystem ( =/home/cjc/Grid/ManRepo/Yaim/sl5-WN=). These files will be distributed using cfengine at some point.

---++ Firewall settings
 After rebooting the singlet node, a grid test job failed to run, with =checkjob= complaining of ="Server could not connect to MOM"=. The iptables were still on the default settings. The settings from another node (epgd02) were exported using the command =iptables-save > rules.dat= and then imported on epgd01 with the command =iptables-restore < rules.dat= (where the file rules.dat was copied from one node to another inbetween save and restore).

The firewall settings were returning to the defaults after every restart. =/etc/init.d/iptables save= fixed this ( _read up on iptables!!!_). Jobs now enter the waiting state on epgd01 because the ssh key has changed. This needs to be updated.

The following was entered into the =/etc/hosts.allow= file:
<verbatim>
sshd: 147.188.47.199 ALL
sshd: 147.188.47.210
sshd: 147.188.47.212
sshd: 147.188.47.33
sshd: 147.188.46.5
#
sshd: 147.188.46.9
nrpe: 147.188.46.5
</verbatim>

And the following was added to =/etc/hosts.deny=:

<verbatim>
sshd: ALL : spawn /bin/echo `/bin/date` from %h >> /var/log/ssh.log : deny
ALL: ALL
</verbatim>

_These files could be distributed using cfengine in the future!_

---++ SSH Keys
 The same key is used for all twin worker nodes. The key pairs are kept in =epgmo1:/data1/grid/s/twins/=, and are copied into =epgd01:/etc/ssh= (via an NFS mount of epgmo1).

_These files could be distributed using cfengine in the future!_

---++ TEST 2: SCP
 Worker nodes are required to pull and push files to and onto epgce3 without the use of a passphrase, hence the ssh keys setup. This may be tested by sudo'ing to another user (eg atl088) on a worker node and then using the command =scp -rpB test_file atl088@epgce3.ph.bham.ac.uk:test_file=. If this fails, jobs will not be able be submitted!

Note that the reverse does not have to be true - epgce3 does not have to be able to send files to worker nodes! As this is not required, the facility should not be set up, keeping security risks to a minimum.

---++ TEST 3: QUEUE SUBMISSION
 This test is exactly the same as the previous test - a simple hello world job running on the singlet node, the difference being the new node should now be running SL5. The test job was successful.

---++ Automated Installation
 A cfengine script was developed to automate the installation process. The working principle is to use kickstart to install the OS and run =/etc/rc.d/rc.local= as a first boot script. This script only installs cfengine. Once the OS installation has been completed, the =cfrun= command on epgmo1 configures the node to be a worker node, downloading the glite middleware and running yaim etc.

Full details can be found [[https://www.ep.ph.bham.ac.uk/twiki/bin/view/Computing/GridFabricManagement][here]].

---++ ATLAS Requirements
 ATLAS plans to have native SL5 builds of Athena in 2010. Until then, some additional packages are required. Most of these have been bundled into the HEPOSLibs meta package (available [[https://twiki.cern.ch/twiki/bin/view/LCG/SL5DependencyRPM][here]]). This does not install all of the libraries required by ATLAS! ATLAS also requires 32 bit versions of various libraries - details can be found [[https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration#Compatibility_libraries][here]].

On the Birmingham twin SL 5.3 worker nodes, the =http://ftp.scientificlinux.org/linux/scientific/53/i386/SL= repository is used for obtaining 32 bit libraries. Yum configuration details are stored in =/etc/yum.repos.d/sl-i386.repo=, which is available for download on from the =epgmo1.ph.bham.ac.uk= webserver (thus enabling configuration by cfengine).

Finally, SELinux is set to permissive mode ( =SELINUX=permissive= in the file =/etc/sysconfig/selinux=), as detailed [[https://twiki.cern.ch/twiki/bin/view/Atlas/RPMCompatSLC5#SL5_issues][here]].

---++ ATLAS Installation
 Installation of ATLAS software is automated. Requests can be made via [[https://atlas-install.roma1.infn.it/atlas_install/][Alessandro De Salvos webpage]]. After a number of failures, the automated jobs did manage to install a large number of Athena releases. Note that 15.1.0 and 15.2.0 consistantly failed to install. The initial problems were solved by installing the following libraries (in addition to those recommended on the other twikis):

<verbatim>
### fresh SL5 machine

lcg-CA
glite-WN
glite-TORQUE_client

HEP_OSlibs_SL5

compat-gcc-34
compat-gcc-34-c++
compat-gcc-34-g77

compat-gcc-34-g77-3.4.6-4.x86_64 #RPM install with excludedocs option

compat-glibc
compat-glibc-headers
compat-libf2c-34
compat-libgcc-296.i386
compat-libstdc++-296
compat-libstdc++-33
compat-readline43
lapack
libgfortran
ghostscript.x86_64
ghostscript.i386
libXpm
glibc-devel-2.5
giflib
compat-openldap
openssl097a
compat-db
libxml2-devel.i386
libxml2-devel.x86_64
popt.i386
popt.x86_64
blas.i386
blas.x86_64
blas-devel.i386
blas-devel.x86_64
sharutils.x86_64
bc.x86_64
curl.x86_64
procmail.x86_64

gcc
gcc-c++
libstdc++-devel
python-devel
openldap-clients
libxslt
PyXML
</verbatim>

It is interesting to note that an ldap query shows that =t2ce05.physics.ox.ac.uk= is the Oxford SL 5.3 system. =lcg-infosites --vo atlas tag= shows that this CE does not have any recognizable Athena version installed (the tags instead taking the form =atlas-offline-rel_0-15.X.0=). A test grid job has been submitted to see if 14.5.0 is available.

-- Main.ChristopherCurtis - 06 Oct 2009
Attachments
Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
o213	test.sh.o213	r1	manage	0.1 K	07 Jun 2018 - 14:40	UnknownUser	test of new web server
This topic: Computing > SL5WN
Topic revision: r17 - 07 Jun 2018 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61mark_32slater
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback