(r11) SL5WN < Computing

Computing Web>SL5WN (revision 11) (raw view)~~EditAttach~~
---+ Installing SL5 on the Birmingham Worker Nodes

%TOC%

The aim of this project is to install SL5 on the Birmingham Worker Nodes. The plan is as follows:

   * Move one worker node (epgd01) offline on epgce3 
   * Install SL5 on singlet node 
   * Install WN middleware on singlet node 
   * Create a new test queue on epgce3, enabled for atlas users and software 
   * Add the singlet node to new queue 
   * Submit Hello World test job to new queue 
   * If that works, request new software installation for SL5 
   * Submit ATLAS test job 

If this is successful, the intention is to role out SL5 to all other WNs so that Birmingham is entirely SL5 ( _confirm this with Pete!_).

---++ Node Isolation
 Started by using the command =pbsnodes -o epgd01.ph.bham.ac.uk= on epgce3 to move the first node offline. 7 jobs were running on this node at the time - 6 camont and 1 CMS. Attempted to move the jobs to other nodes by changing their status to held and then restarting. This failed. The job status could not be changed, and they could not be moved or restarted ( _this might be something that needs to change in the future!_). Jobs were killed instead.

It doesn't look like separate nodes for separate queues is directly available in yaim, so the following method was used to partition the nodes:

   1 =qmgr -c 'set node epgd01.ph.bham.ac.uk properties+=SL5'= 
   1 =qmgr -c set queue long resources_default.neednodes=lcgpro'= 
   1 =qmgr -c set queue short resources_default.neednodes=lcgpro'= 

This sets properties of the nodes to either have =SL5= or the default =lcgpro=. Each queue then requires a node have the appropriate property before a job is submitted to it. The command =pbsnodes -c epgd01.ph.bham.ac.uk= was used to move the singlet node back online. The =listnodejobs= command (custom script!) was used to confirm that no jobs were being submitted to the singlet node.

---++ Test Queue Setup
 This was achieved by first setting up the new queue in PBS and then broadcasting it's existance to the BDII by re-running yaim.

   1 =qmgr -c "print server long" > output.dat= prints the long queue details to a file 
   1 Changed queue name to =sl5_test= 
   1 Removed most access list rights (jobs restricted to ATLAS and dteam) 
   1 Copied =output.dat= commands into qmgr. This setup the new queue. 
   1 =qmgr -c "set queue sl5_test resources_default.neednodes="SL5" 

Reran yaim. The =site-info.def= file was updated to include details of the new queue. epgce3 was rebooted at this point, though a restart of pbs was probably all that was required.

---++ TEST 1: QUEUE SUBMISSION
 The first test was to ensure that jobs submitted to the sl5 queue ran on the singlet node. A simple =helloWorld.jdl= job proved this to be true ( _although WMS took forever to update job status_).

The atlas software group was removed from the acl - Alessandro's automated jobs found the queue very quickly and tried to update software! The group will be added once SL5 has been installed and validated.

---++ SL5 Installation...

... is tricky! An overview of the boot procedures can be found [[https://www.ep.ph.bham.ac.uk/twiki/bin/view/Computing/GridBoot][here]]. A 64 bit version of SL 5.3 was installed (which also appears to be available at Oxford). Disk partitioning settings were copied from the existing =sl46-x86_64-worker.ks=. New kernel images were required for epgmo1, which were obtained [[http://ftp.scientificlinux.org/linux/scientific/53/x86_64/images/pxeboot/][here]].

---++ gLite Installation
 The following yum repository entries were configured (in addition to the defaults sl.repo etc enabled in SL5):

   * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-WN.repo= 
   * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-TORQUE_client.repo= 
   * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo= 

and the following packages installed:

   * =yum update= ( _then rebooted to ensure new kernel used_) 
   * =yum -y install lcg-CA= 
   * =yum -y groupinstall glite-WN= 
   * =yum -y install glite-TORQUE_client= 

The node was then configured using yaim ( =/opt/glite/yaim/bin/yaim -c -s /root/yaim-conf/site-info.def -n WN -n TORQUE_client=). Copies of the yaim config files may be currently found on the local filesystem ( =/home/cjc/Grid/ManRepo/Yaim/sl5-WN=). These files will be distributed using cfengine at some point.

---++ Firewall settings
 After rebooting the singlet node, a grid test job failed to run, with =checkjob= complaining of ="Server could not connect to MOM"=. The iptables were still on the default settings. The settings from another node (epgd02) were exported using the command =iptables-save > rules.dat= and then imported on epgd01 with the command =iptables-restore < rules.dat= (where the file rules.dat was copied from one node to another inbetween save and restore).

The firewall settings were returning to the defaults after every restart. =/etc/init.d/iptables save= fixed this ( _read up on iptables!!!_). Jobs now enter the waiting state on epgd01 because the ssh key has changed. This needs to be updated.

The following was entered into the =/etc/hosts.allow= file:
<verbatim>
sshd: 147.188.47.199 ALL
sshd: 147.188.47.210
sshd: 147.188.47.212
sshd: 147.188.47.33
sshd: 147.188.46.5
#
sshd: 147.188.46.9
nrpe: 147.188.46.5
</verbatim>

And the following was added to =/etc/hosts.deny=:

<verbatim>
sshd: ALL : spawn /bin/echo `/bin/date` from %h >> /var/log/ssh.log : deny
ALL: ALL
</verbatim>

_These files could be distributed using cfengine in the future!_

---++ SSH Keys
 The same key is used for all twin worker nodes. The key pairs are kept in =epgmo1:/data1/grid/s/twins/=, and are copied into =epgd01:/etc/ssh= (via an NFS mount of epgmo1).

_These files could be distributed using cfengine in the future!_

---++ TEST 2: SCP
Worker nodes are required to pull and push files to and onto epgce3 without the use of a passphrase, hence the ssh keys setup. This may be tested by sudo'ing to another user (eg atl088) on a worker node and then using the command =scp -rpB test_file atl088@epgce3.ph.bham.ac.uk:test_file=. If this fails, jobs will not be able be submitted!

Note that the reverse does not have to be true - epgce3 does not have to be able to send files to worker nodes!  As this is not required, the facility should not be set up, keeping security risks to a minimum.

---++ TEST 3: QUEUE SUBMISSION
 This test is exactly the same as the previous test - a simple hello world job running on the singlet node, the difference being the new node should now be running SL5. The test job was successful.

---++ Automated Installation
 A cfengine script was developed to automate the installation process. The working principle is to use kickstart to install the OS and run =/etc/rc.d/rc.local= as a first boot script. This script only installs cfengine. Once the OS installation has been completed, the =cfrun= command on epgmo1 configures the node to be a worker node, downloading the glite middleware and running yaim etc.

Full details can be found [[https://www.ep.ph.bham.ac.uk/twiki/bin/view/Computing/GridFabricManagement][here]].

-- Main.ChristopherCurtis - 06 Oct 2009
Topic revision: r11 - 16 Oct 2009 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis?
Computing
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback