---+ Installing SL5 on the Birmingham Worker Nodes %TOC% The aim of this project is to install SL5 on the Birmingham Worker Nodes. The plan is as follows: * Move one worker node (epgd01) offline on epgce3 * Install SL5 on singlet node * Install WN middleware on singlet node * Create a new test queue on epgce3, enabled for atlas users and software * Add the singlet node to new queue * Submit Hello World test job to new queue * If that works, request new software installation for SL5 * Submit ATLAS test job If this is successful, the intention is to role out SL5 to all other WNs so that Birmingham is entirely SL5. ---++ Node Isolation Started by using the command =pbsnodes -o epgd01.ph.bham.ac.uk= on epgce3 to move the first node offline. 7 jobs were running on this node at the time - 6 camont and 1 CMS. Attempted to move the jobs to other nodes by changing their status to held and then restarting. This failed. The job status could not be changed, and they could not be moved or restarted ( _this might be something that needs to change in the future!_). Jobs were killed instead. It doesn't look like separate nodes for separate queues is directly available in yaim, so the following method was used to partition the nodes: 1 =qmgr -c 'set node epgd01.ph.bham.ac.uk properties+=SL5'= 1 =qmgr -c set queue long resources_default.neednodes=lcgpro'= 1 =qmgr -c set queue short resources_default.neednodes=lcgpro'= This sets properties of the nodes to either have =SL5= or the default =lcgpro=. Each queue then requires a node have the appropriate property before a job is submitted to it. The command =pbsnodes -c epgd01.ph.bham.ac.uk= was used to move the singlet node back online. The =listnodejobs= command (custom script!) was used to confirm that no jobs were being submitted to the singlet node. ---++ Test Queue Setup This was achieved by first setting up the new queue in PBS and then broadcasting it's existance to the BDII by re-running yaim. 1 =qmgr -c "print server long" > output.dat= prints the long queue details to a file 1 Changed queue name to =sl5_test= 1 Removed most access list rights (jobs restricted to ATLAS and dteam) 1 Copied =output.dat= commands into qmgr. This setup the new queue. 1 =qmgr -c "set queue sl5_test resources_default.neednodes="SL5" Reran yaim. The =site-info.def= file was updated to include details of the new queue. epgce3 was rebooted at this point, though a restart of pbs was probably all that was required. ---++ TEST 1: QUEUE SUBMISSION The first test was to ensure that jobs submitted to the sl5 queue ran on the singlet node. A simple =helloWorld.jdl= job proved this to be true ( _although WMS took forever to update job status_). The atlas software group was removed from the acl - Alessandro's automated jobs found the queue very quickly and tried to update software! The group will be added once SL5 has been installed and validated. ---++ SL5 Installation... ... is tricky! An overview of the boot procedures can be found [[https://www.ep.ph.bham.ac.uk/twiki/bin/view/Computing/GridBoot][here]]. A 64 bit version of SL 5.3 was installed (which also appears to be available at Oxford). Disk partitioning settings were copied from the existing =sl46-x86_64-worker.ks=. New kernel images were required for epgmo1, which were obtained [[http://ftp.scientificlinux.org/linux/scientific/53/x86_64/images/pxeboot/][here]]. ---++ gLite Installation The following yum repository entries were configured (in addition to the defaults sl.repo etc enabled in SL5): * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-WN.repo= * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-TORQUE_client.repo= * =http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo= and the following packages installed: * =yum update= ( _then rebooted to ensure new kernel used_) * =yum -y install lcg-CA= * =yum -y groupinstall glite-WN= * =yum -y install glite-TORQUE_client= The node was then configured using yaim ( =/opt/glite/yaim/bin/yaim -c -s /root/yaim-conf/site-info.def -n WN -n TORQUE_client=). Copies of the yaim config files may be currently found on the local filesystem ( =/home/cjc/Grid/ManRepo/Yaim/sl5-WN=). These files will be distributed using cfengine at some point. ---++ Firewall settings After rebooting the singlet node, a grid test job failed to run, with =checkjob= complaining of ="Server could not connect to MOM"=. The iptables were still on the default settings. The settings from another node (epgd02) were exported using the command =iptables-save > rules.dat= and then imported on epgd01 with the command =iptables-restore < rules.dat= (where the file rules.dat was copied from one node to another inbetween save and restore). The firewall settings were returning to the defaults after every restart. =/etc/init.d/iptables save= fixed this ( _read up on iptables!!!_). Jobs now enter the waiting state on epgd01 because the ssh key has changed. This needs to be updated. The following was entered into the =/etc/hosts.allow= file: <verbatim> sshd: 147.188.47.199 ALL sshd: 147.188.47.210 sshd: 147.188.47.212 sshd: 147.188.47.33 sshd: 147.188.46.5 # sshd: 147.188.46.9 nrpe: 147.188.46.5 </verbatim> And the following was added to =/etc/hosts.deny=: <verbatim> sshd: ALL : spawn /bin/echo `/bin/date` from %h >> /var/log/ssh.log : deny ALL: ALL </verbatim> _These files could be distributed using cfengine in the future!_ ---++ SSH Keys The same key is used for all twin worker nodes. The key pairs are kept in =epgmo1:/data1/grid/s/twins/=, and are copied into =epgd01:/etc/ssh= (via an NFS mount of epgmo1). _These files could be distributed using cfengine in the future!_ ---++ TEST 2: SCP Worker nodes are required to pull and push files to and onto epgce3 without the use of a passphrase, hence the ssh keys setup. This may be tested by sudo'ing to another user (eg atl088) on a worker node and then using the command =scp -rpB test_file atl088@epgce3.ph.bham.ac.uk:test_file=. If this fails, jobs will not be able be submitted! Note that the reverse does not have to be true - epgce3 does not have to be able to send files to worker nodes! As this is not required, the facility should not be set up, keeping security risks to a minimum. ---++ TEST 3: QUEUE SUBMISSION This test is exactly the same as the previous test - a simple hello world job running on the singlet node, the difference being the new node should now be running SL5. The test job was successful. ---++ Automated Installation A cfengine script was developed to automate the installation process. The working principle is to use kickstart to install the OS and run =/etc/rc.d/rc.local= as a first boot script. This script only installs cfengine. Once the OS installation has been completed, the =cfrun= command on epgmo1 configures the node to be a worker node, downloading the glite middleware and running yaim etc. Full details can be found [[https://www.ep.ph.bham.ac.uk/twiki/bin/view/Computing/GridFabricManagement][here]]. ---++ ATLAS Requirements ATLAS plans to have native SL5 builds of Athena in 2010. Until then, some additional packages are required. Most of these have been bundled into the HEPOSLibs meta package (available [[https://twiki.cern.ch/twiki/bin/view/LCG/SL5DependencyRPM][here]]). This does not install all of the libraries required by ATLAS! ATLAS also requires 32 bit versions of various libraries - details can be found [[https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration#Compatibility_libraries][here]]. On the Birmingham twin SL 5.3 worker nodes, the =http://ftp.scientificlinux.org/linux/scientific/53/i386/SL= repository is used for obtaining 32 bit libraries. Yum configuration details are stored in =/etc/yum.repos.d/sl-i386.repo=, which is available for download on from the =epgmo1.ph.bham.ac.uk= webserver (thus enabling configuration by cfengine). Finally, SELinux is set to permissive mode ( =SELINUX=permissive= in the file =/etc/sysconfig/selinux=), as detailed [[https://twiki.cern.ch/twiki/bin/view/Atlas/RPMCompatSLC5#SL5_issues][here]]. ---++ ATLAS Installation Installation of ATLAS software is automated. Requests can be made via [[https://atlas-install.roma1.infn.it/atlas_install/][Alessandro De Salvos webpage]]. After a number of failures, the automated jobs did manage to install a large number of Athena releases. Note that 15.1.0 and 15.2.0 consistantly failed to install. The initial problems were solved by installing the following libraries (in addition to those recommended on the other twikis): <verbatim> ### fresh SL5 machine lcg-CA glite-WN glite-TORQUE_client HEP_OSlibs_SL5 compat-gcc-34 compat-gcc-34-c++ compat-gcc-34-g77 compat-gcc-34-g77-3.4.6-4.x86_64 #RPM install with excludedocs option compat-glibc compat-glibc-headers compat-libf2c-34 compat-libgcc-296.i386 compat-libstdc++-296 compat-libstdc++-33 compat-readline43 lapack libgfortran ghostscript.x86_64 ghostscript.i386 libXpm glibc-devel-2.5 giflib compat-openldap openssl097a compat-db libxml2-devel.i386 libxml2-devel.x86_64 popt.i386 popt.x86_64 blas.i386 blas.x86_64 blas-devel.i386 blas-devel.x86_64 sharutils.x86_64 bc.x86_64 curl.x86_64 procmail.x86_64 gcc gcc-c++ libstdc++-devel python-devel openldap-clients libxslt PyXML </verbatim> It is interesting to note that an ldap query shows that =t2ce05.physics.ox.ac.uk= is the Oxford SL 5.3 system. =lcg-infosites --vo atlas tag= shows that this CE does not have any recognizable Athena version installed (the tags instead taking the form =atlas-offline-rel_0-15.X.0=). A test grid job has been submitted to see if 14.5.0 is available. -- Main.ChristopherCurtis - 06 Oct 2009
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
o213
test.sh.o213
r1
manage
0.1 K
07 Jun 2018 - 14:40
UnknownUser
test of new web server
This topic: Computing
>
SL5WN
Topic revision: r17 - 07 Jun 2018 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61mark_32slater
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback