Local Grid Journal

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20111115 MWS Removed an empty pool 'DPM001' from epgse1 using dpm-rmpool.
20111024 LSL On local /home/lcgui setup, patched-in some files to support eScience CA 2A/2B, so voms-proxy-init works for new students and recent renewers.
20111017 LSL Power socket supplying UPS for grid left rack (epgsr1 and RAIDs f12-f15, epgpe01-04) has hair-line crack and failed at around 9am. Swapped plug to another socket and got things going again by 09:40. Will inform electrician Mark Wicks about this problem. [He will replace during some downtime].
20111010 LSL After SCSI problem on epgsr1 for f12 f13 f15, swapped that SCSI chain with f14 chain by moving cables between cards to see if SCSI problem moves or not.
20111009 MWS epgsr1 was acting up which was assumed to be a SCSI problem. However, after doing a reboot, the server didn't come back up. Offlined the site and closed the queues. Hopefully this can be fixed tomorrow!
20110923 LSL After SCSI problem on epgsr1 for f12 f13 f15, replace that SCSI card with a new one and see if this fixes the problem.
20110921 MWS SE was acting up this morning with what seemed to be a dpns hang. Restarted but got strange SOAP errors about token headers. Restarted the srmv2.2 and that seemed to fix it.
20110909 MWS After several days of being hit by batches of 100s of jobs at a time, banned user Rafael Mayo (Fusion) across the site by adding to /opt/glite/etc/lcas/ban_users.db. Will email and try to get him to stop.
20110713 MWS As per GGUS 72515, added acl cern_dest dstdomain .cern.ch http_access allow cern_dest to squid config file.
20110712 MWS Fixed GLExec issues so local tests now work. Polices on the Argus server needed sorting out. Future problems may involve not having roles set properly here!
20110708 MWS Started failling certificate NAGIOS tests as we were running the wrong version (1.38 rather than 1.40). Updated using yum update ca-policy-egi-core on the local WNs and copying the resultant certs to the BB WNs.
20110609 MWS Noticed ATLAS analysis jobs were failling with liblcgdm errors. Checked ndoes and the links were broken in /opt/lcg/lib. Fixed by hand but in next WN update, this should be fixed.
20110526 MWS SE problems narrowed down to excessive H1 jobs hammering the SE.
20110525 MWS epgse1 showed strange network issues overnight. Rebooting (eventually) fixed it but should keep an eye on odd 'fetch-crl' and 'voms' errors
20110511 MWS All appears well on the recovered VMs on epgpe10. Did a test copy to the SE and that was fine so reopened the queues and jobs have started coming in. WIll keep an eye on the Nagios tests to make sure everything is up and running again.
20110511 LSL For epgpe10 problem, updated base system kernel and kernel-xen from 2.6.18-194.32.1 to 2.6.18-238.9.1. Also, Dell have supplied new disk, so after booting from a CD, did a dd copy to new disk: dd if=/dev/sda of=/dev/sdb bs=51200000. This took about 2 hours. Rebooted.
20110510 MWS Update Maui config to use MAXJOB instead of MAXJOBS and slightly altered weighting to prioritise Atlas, LHCb and Alice. Will keep an eye to make sure jobs go through as expected.
20110508 LSL/MWS Mark notes epgpe10 has gone down again, like 20110505. System messages for epgpe10 logged on epgmo1 starts with mpt2sas0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630), then sd 0:0:0:0: SCSI error: return code = 0x00010000, then scsi 0:0:0:0: rejecting I/O to dead device . Come in and reboot.
20110503 LSL For BB, I asked Alan to propagate sudoers changes of March, including Mark's account, from front-ends to worker nodes too. I've made a rc.d/S60sudo.sh to include Chris's account as required.
20110503 LSL On BB, I've added Mark's account as an extra allowed-user in rc.d/S60sshd.sh which configures /etc/ssh/sshd_config on the grid worker nodes.
20110427 MWS Entered DT. Set epgr02 & 05 queues to Draining and stopped (qstop --) long, short and alice.
20112204 MWS epgr05 was failing NAGIOS with LB Query failures. Rebooting fixed the problem.
20112104 MWS Set epgr04/07 queues back online and set status in /opt/lcg/libexec/lcg-info-dynamic-pbs from Draining to $Status. Needed to reboot epgr04 as qsub/qstat didn't work, but other than that, all fine.
20111504 MWS ALICE BB VO Box was under very heavy load (>15) with CPU idle. Contacted ALICE experts who had a look but recommened reboot. Tried soft reboot and didn't work so hard reset (xm destroy + xm create). All seems well now.
20111404 MWS Request from ATLAS to take 22.5 TB from DATADISK and redistribute to other spacetokens.
20111404 MWS Noticed that new jobs into epgr05 weren't coming in. Rebooted and found BDII didn't start due to 0 diskspace left. Deleted a load of cfengine backup files, rebooted and all is well. Did the same for epgr02 just in case as well. Need a more permanent solution in the long term though.
20111304 MWS Added the new certificate for epgr08. Didn't reyaim/reboot as didn't seem necessary.
20111304 MWS Attempted to reyaim epgse1 after putting the new certificate but it got stuck when restarting the dpm. This was eventually traced to gmetric going nuts as it was run every minute. Reduced this time to every 30mins, rebooted (after Ctrl-C'ing out of the reyaim) and reyaim again. Everything seems to be back up and OK!
20111204 MWS On request from Elena (Atlas) added 1TB to PRODDISK (taken from DATADISK).
20111204 MWS Reyaimed and rebooted epgr07 to put in new certificate.
20111104 MWS Marked epgr07 and epgr04 in downtime and stopped the queues due to 1.5 weeks of BB downtime.
20110404 MWS On Friday 1st, what looks like an ALICE job took out two nodes (kernel crash in the log) and at about the same time, we either recieved 2000 jobs causing the local CE and Torque to fall over OR the CE and Torque fell over and jobs were getting hung up over the weekend. Either way, the CE needed rebooting and ~2000 jobs were left in the queue in an odd state. Deleted all these and everything returned to normal!
20110307 CJC All nodes require ssh keys to login, with the exception of epgmo1 (though this may change in the future). Public keys should be stored in the directory epgmo1:/var/cfengine/inputs/repo/general/public_keys/. They will then be distributed to all nodes via the module script modules/modules:ssh. New keys are added to the authorized_keys file using the command cfrun -- -D restart_ssh.
20110217 LSL On BB, process accounting now starts via system/rc.d/S*psacct.sh: uses directory /local/account/, a 2GB area which survives reboot.
20110209 LSL On BB, implement logging of outgoing ssh calls on bluebear workers via iptables rule. Test process accounting on one node u4n128.
20110117 CJC New dteam voms supported on local system SL5 UI.
20101208 LSL f8 RAID now in place in BB serving /egee/soft via NFS. Now /egee/home areas are on local worker disk, like on our local cluster, as a performance enhancement.
20101206 LSL/CJC Take f8 RAID and eprex6 server over to BB; their BB team want to do the physical installation though.
20101125 LSL ep19x BB NAS server fails again with kernel traceback from alloc_pages_internal. Do a soft reboot but filesystem then disappears. Start preparing for redeployment of f8 RAID for BB.
20101112 LSL Prepared a bbmoab.tar of current (5.4.3.s1) Moab client binaries, including Green Computing options, for Chris to put on epgr04.
20101111 LSL RAM memory tests (memtest86+ 2.0.1 and then 4.1.0) on ep19x BB NAS server ran clean for 24 hours, so put /egee filesystem back online.
20101109 CJC Marked epgr04 and epgr07 as draining ahead of the BB downtime.
20101109 CJC Created new home directories and ssh keys for new BB grid users. Full details on how this was automated can be found here.
20101105 CJC After fixing epgr11 DN in GOCDB, apel data appears to be uploading successfully. Check back on November figures in 24 hours. Also check back on September - currently at 1007940 (want to avoid double counting). Full details of upgrade here.
20101104 CJC Removed all tags from epgr04:/opt/edg/var/info/atlas/atlas.list except VO-atlas-cloud-UK and VO-atlas-tier-T2. This should be enough to trigger reinstallation of ATLAS software. This will affect epgr07 as well as tag file was shared over NFS.
20101104 CJC Test jobs successfully processed on BB. Submitting full grid type job.
20101103 CJC Re-enabled BB queues and submitted large number of test jobs to ensure that nodes offlined by Green Computing systems can come back online. It appears as though all appropriate nodes have come back online, but no jobs are submitting. Check moab status?
20101103 CJC Emailed tb-support as problems with APEL still persist and there is no reply to the https://gus.fzk.de/ws/ticket_info.php?ticket=63654][GGUS Ticket]].
20101102 CJC Restored grid middleware according to the LocalGridCookbook instructions, but test jobs submitted from epgr04 are not picking up the grid environment variables. Local config scripts, yaim etc picked up by chance from old /egee filesystem still mounted on BB3. These files need to be added to the backup policy!
20101102 CJC Submitted a helpdesk ticket (and emailed) Alan requesting that the kernel and glibc be updated on the BB nodes. Reinstalling middleware.
20101102 LSL The NAS1104L box which provides /egee has been fitted with a new usb disk-on-module including up to date Open-E software. 3ware firmware already up to date. RAID now reformatted from scratch. Aslam has moved its power to the UPS, and is not to hard power-off the device on future occasions.
20101101 CJC Submitted GGUS Ticket to APEL after epgr11 fails to upload updated accounting data to Accounting Portal.
20101029 CJC Hard reboot of epaf17.ph.bham.ac.uk after failed reboot due to mount binds still being in place. Removed suid from cfengine modules.
20101029 CJC Moved MonBox? role over to SL5 on epgr11.ph.bham.ac.uk. Reyaimed all CEs (epgr02, 04,, 05 and 07) and Site BDII to reflect change. Updated GOCDB. accounting currently reads 770316 for Birmingham - this should have increased by Monday. If not, GGUS ticket APEL for help.
20101029 CJC All local nodes have been updated and rebooted, and so have been patched against CVE-2010-3904 and CVE-2010-3847. Waiting for BB to come back online before patching.
20101020 CJC Disabled rds module in SL5 installations via cfengine CVE-2010-3904. This should be extended to BB once the filesystem has been fixed.
20101020 CJC NFS server logs copied to /home/lcgdata/logs/NAS/20101019.
20101020 CJC Added module:suid to cfengine tasks. This executes the script /root/cfengine/files/suid_fix.sh on SL5 nodes, either automatically if the lock file /var/cfengine/reports/suid_fix cannot be found or on demand (by setting the variable force_suid_fix). The suid_fix.sh script prevents unauthorised root access via hard links, as described in CVE-2010-3847. Note that rebooting a node will undo the fix, so the reboot and halt cfengine commands attempt to remove the lock file! This should be extended to BlueBEAR once the filesystem has been fixed.
20101020 CJC Unscheduled downtime for epgr04 (BB lcg-CE), epgr07 (BB CreamCE) and epgr10 (BB Alice VOBox) due to NFS filesystem problems.
20101018 CJC BB NFS box unresponsive. Requested Aslam do a hard restart.
20101012 CJC Enabled ATLAS and ALICE (along with other normal VOs) on epgr07 (the CreamCE for BB). Notified Patricia and Graeme about sending ALICE and ATLAS jobs to this CE. Note that the CreamCE requires access to the torque server logs (not just accounting). These are currently copied onto the NFS server ( ep19x.ph.bham.ac.uk:/egee/torque/server_logs) on the BB side every 10 minutes by a cron job. This directory is then NFS mounted onto epgr07:/var/spool/pbs/server_logs.
20101004 CJC Replicated cond10_data.000007.gen.COND._0002.pool.root.4801537.0 and DBRelease-12.7.1.tar.gz.6244710.0 on SE after job failures due to timeouts.
20100929 CJC mysqld service failed to restart after rebooting epgmo1 during kernel upgrades. This caused APEL to fail to publish for 8 days. Restarted mysqld service and republished APEL data. Accounting data should now be up to date.
20100927 CJC epgr07 not accepting jobs because it was not redeployed when the BB pool accounts were redefined. Backing up VM and redeploying.
20100924 CJC epgsr4 (40TB) brought online. Space distributed between ATLAS spacetokens (DATA, MC, SCRATCH, HOT, LOCALGROUP).
20100923 CJC Problem with yaim generated /etc/sudoers file on CreamCE for BB (epgr07). Emailed lcg-rollout.
20100923 CJC Deploying epgr10 as a second VOBox for Alice. This will manage the BlueBEAR software area. NFS mounted ep19x.ph.bham.ac.uk:/egee/soft/SL5/alice on the VOBox.
20100923 CJC BlueBEAR WNs back online, and using the updated kernel. Reyaiming epgr04 to allow jobs again. Updating ticket 62359.
20100923 CJC BlueBEAR WNs appear to be in the down,offline state since late last night. Emailed Alan Reed.
20100922 CJC Official kernel patch released. Updated DPM pool nodes, reyaimed and rebooted. Requested kernel be installed on BlueBEAR WNs.
20100921 CJC epgd[01-24] nodes have kernel updated using yum --enablerepo=sl-testing update. Nodes rebooted, grub checked to make sure that nodes are using new kernel. All other service nodes, with the exception of the DPM pool nodes are updated in the same way. BB nodes are waiting for official kernel release. Supported VOs on epgr04 are reduced to ops only. Downtimes cleared from GOCDB.
20100915 CJC Draining the epgd[01-24] nodes in preparation for kernel fix for the problem described here.
20100915 CJC 4000+ ILC jobs submitted to epgr02/05 by Stephane Poss. Killed off 3500 queued and emailed user. Checking efficiency of remaining jobs - could be useful to distribute to other SouthGrid? sites.
20100915 CJC Replaced bucket in server room with a bucket and crate. This should have a large enough volume to contain the air conditioning drainage for the weekend (bucket approximately 3/4 full after 24 hours.
20100914 CJC Submission problem on epgr02. Submitted jobs run, but no output appears to be returned. This would explain the nagios timeouts on epgr02 jobs. Rebooted.
20100914 CJC Pump broken in AirCon? D. Maintenance logged problem with central services, waiting on quote for fix. In the meantime, they have uncoupled the drainage, which now empties into a bucket. This is not ideal (bucket should be checked everyday), but it does mean the unit is switched on. All WNs brought back online. Temperature steady at 18.5C.
20100914 CJC Switched more nodes off. David Clifford sending someone to look at air conditioning. Temperature peaked at 25C.
20100913 CJC Added SRCFG definition to maui config on epgr05, reserving one slot on both epgd01 and epgd02 for ops jobs and Steve Lloyd. Check back on SAM tests in 24 hours to see if this makes a difference!.
20100913 CJC Changed "MAXPROC" to "MAXJOBS" in epgr05 maui definition, following advice on ScotGrid Blog.
20100913 CJC AirCon? D powered back on (~5pm). Temperature drops to < 17C.
20100913 CJC Installed epgf01 and epgf02 behind the f12-15 RAIDs to help with air flow. Temperature holding steady at 19.5C.
20100913 CJC AirCon? D in W332 failed (switched off permanently). Contacted Dave clifford. Proceeding to drain alternate WNs (1,4,5,8,9,12,13,16,17,20,21,24)with the intention of powering them off once the jobs have completed. Air Temp currently at 19.46C.
20100913 CJC Set epgr04 to draining and glong to enabled = False in preparation for BlueBEAR downtime.
20100831 CJC Moved epgd17 back online, but gave it the property "raid". All other nodes have the property "lcgpro". Modified qsub script so that all jobs require the "lcgpro" property, with the exception of jobs submitted by "atl059", which require the "raid" property. In this way, epgd17.ph.bham.ac.uk has been isolated for the purposes of testing the RAID performance.
20100831 CJC Moved epgd17 offline for the purposes of testing a RAID'ed WN.
20100827 CJC Noted that the ATLAS Squid was swapping about 700MB of RAM. Readjusted VM allocations to give Squid 3GB at the expense of epgr02 (hosted on the same server).
20100825 CJC Reyaimed epgr05 after ALICE complained of not being able to submit jobs. Jobs now successfully being submitted
20100811 CJC Disappeared from Top level BDII again last night when epgr09 stopped responding to ldap queries. Restarted BDII service on epgr09. Adding hourly restart to cfengine. Checking log files for problems.
20100810 CJC Rebooted epgsr1 due to the disappearance of 4 files systems. This fixed the problem.
20100810 CJC Added the 40TB of storage attached to epgsr3 to the DPM. Used to reinstate a 50TB MCDISK spacetoken (along with some storage from DATA and LOCALGROUP). Emailed Brian Davies about making this official.
20100806 CJC Accounting website reports no accounting data for epgr04.ph.bham.ac.uk (noted thanks to SpecInt? tagging idea put forward by Pete Gronbech). Checking accounting records on epgr04.ph.bham.ac.uk.
20100806 CJC (Software) bonded epgsr3, all four interface connections working. Waiting for replacement host certificate before adding into DPM. Still have to update epgse1:/etc/shift.conf
20100806 CJC Birmingham back in the information system, and on gstat.
20100806 LSL Reconfigure switch epsw22 to include bond for sr3 on ports 05-08 and future sr4 on ports 09-12. Note: had to use browser IE<=6 or FF<=2 to reconfigure trunking on this DLink switch.
20100806 CJC Rebooted Site BDII after it failed to respond to ldap queries. service bdii status checked out ok before reboot - checking logs...
20100803 CJC qfeed scripts superseded on epgr04 by the qall script. This reads in a list of prioritised usernames, along with a maximum number of jobs they're allowed to run. The script then runs through all queued jobs and submits as many as it can. The script is invoked as root using the command qalld cfengine/files/qall.priorities d&.
20100803 CJC Ran /sbin/start_udev on epgd02, 08 and 12 to fix the /dev/null bug. epgd10 remains unresponsive.
20100803 CJC Moved epgd02,08 10 and 12 offline as they have been hit by the overwriting /dev/null bug.
20100802 CJC Redeployed epgr04 with a reduced number of pool accounts.
20100802 CJC Moved DPM Head Node to SL5.4 machine (keeping same name and IP address). Moved Site BDII to SL5, deploying on VM epgr09. This required changing both node and GIIS information in the GOCDB.
20100730 CJC Moved epgd01 offline due to problem remotely rebooting (NFS?).
20100730 CJC Scheduled downtime for the whole of Monday 2nd August so that the SE can be migrated to SL5.
20100728 CJC Added 1TB to ATLASSCRATCHDISK (at the cost of LOCALGROUPDISK) to avoid being blacklisted. Available space must stay above 1TB!
20100727 CJC Added Pheno and DZero support to local cluster and SE.
20100722 CJC Changed maui FairShare? weighting scheme on epgr05 to more extreme values. Whereas previously the FS group weights were treated as a percentage, there did not appear to be enough of a discriminating factor between jobs. Ops jobs should now have the highest priority, followed by ATLAS/ALICE. LHCb jobs follow next, with all other VOs taking the lowest priorities.
20100721 CJC UK CPU and Storage ranks, based on information in the BDII, are made available online.
20100721 CJC Installed voms.gridpp.ac.uk and voms.ngs.ac.uk certificates in /etc/grid-security/vomsdir/ on epgr05. This CE shouldn't need the certificates (relying instead on the vomsdir/VO/*.lsc files), but a bug means that it can't deal with VOs that need to authenticate with the Manchester VOMS server.
20100710 LSL /egee progress: only u4n085 and u4n116 unconverted to new NAS; both are offline, so restart glong queue. Later: all done.
20100709 LSL /egee progress: BB worker nodes u4n081,082,110-128 are on new ep19x NAS. 100-109 are offline awaiting job finish.
20100709 LSL BB SAM tests for epgr04 have been showing u4n128:CRITICAL for the WN-CAver test: info files showed certs were version 1.34, but required 1.36. Later received GGUS ticket 59922. In /egee/soft/SL5/middleware/prod/external/etc/grid-security, I moved certificates/ to certificates.yyyymmdd/ and rsync'd afresh from epgr04.ph.bham.ac.uk::certificates. Done on both currently active /egee directory trees. Suggest re-instating g-admin cron job to do this.
20100709 LSL Our storage was not being reported by ldap to epgse1 or by Gstat2 on web. Rebooted epgse1 (last night) to remedy. Today found that there were log messages "dpm: failed" and "dpnsdaemon: failed". File dpm/log indicates epaf17 network-down. Restarted network on that, and restarted epgse1, 10am Friday. Query via ldap now showing sensible Size information, absent before.
20100708 LSL Around 5pm: on the console, logged on to all physical and virtual grid machines to check if they were down on the network: all were down except epgsr1 and epgsr2. Did service network restart for those.
20100708 LSL On epgmo1, truncated that big log file, manually set IP addr, copied iptables from iptables.save, moved /etc/cron.d/cfengine_cron to /root directory for now.
20100708 LSL Noticed that most grid servers had no network accessibility. Checked epgmo1 and found it had a 100% cfrun process, and 100% disk full. File /var/log/cfengine_backup.log was 77GB, with messages "You do not have a public key from host epgpe10.ph.bham.ac.uk", "Do you want to accept one on trust (yes/no)", and "Please answer yes or no". File /etc/sysconfig/iptables had been truncated at 4096 bytes, presumably by the disk full condition after an attempted update by cfengine.
20100708 LSL Rebooted several BB workers to check that access to new /egee server worked from a fresh image. It does. Also converting a further handful of workers to use the new /egee.
20100707 LSL On u4n128 tested new BlueBEAR transtec NAS /egee server, known on the network as ep19x and 10.143.245.103: no problems.
20100705 CJC Owing to the Great Pool Account Crisis of '2010 (BlueBEAR Moab hit a hard limit of manageable pool accounts), Camont, CMS, NA48, Southgrid and Zeus have been disabled on the BlueBEAR CEs.
20100705 CJC Allowed ssh connections in epgr03:/etc/hosts.allow to gridppnagios.physics.ox.ac.ukon the ALICE VOBox in order to pass nagios tests. As Patricia could already gsissh from lxplus, failing these OPS tests did not affect functionality.
20100705 CJC Completing dpm-drain of storage hanginf off epgse1 by fixing drain errors ( dpm-delreplica non-existant physical files and rm -f files marked as in the process of being deleted by the DPM).
20100705 CJC Copied (cp -a as g-admin) /egee/soft/SL5/middleware and /egee/soft/SL5/local/ onto the new file server, mounted at bluebear4x:/mnt/egee-new. Stopped queues to ensure no more software jobs are submitted and started to copy software directories.
20100628 CJC Incremented the kSI2K spec of the CreamCE by 1 to make differentiating between published accounting records on the accounting website easier.
20100622 CJC Serena Psoroulas having difficulties authenticating at Cambridge. Watching pilot job at Birmingham - local ID 2820474 on BB.
20100622 CJC Changed software tags area /opt/edg/var/info on epgr02/05 so that it's hosted on epgpe04 and NFS mounted on the CEs. This may help to alleviate the writing problems experienced by epgr02. Resubmitted 15.8.0 installation job.
20100622 CJC Installed Cream CE on epgr07 to submit jobs to BB. Not in site BDII yet due to problem with firewall on epgr07 preventing connections to bbexport.
20100622 CJC Problems with torque server on epgr05. Appears to be confused about which jobs are actually real. Killed 0% efficient jobs and started to manually qrun some of the backlog. Brought epgd24 back online. Accounting stats recorded at 39782 for June.
20100621 CJC Yum updated all nodes. No reyaim required.
20100621 CJC Removed queued LHCb pilot jobs from epgr02/05. Pilot factory appears to have read a wrong value from the information system and sent too many jobs. Queued pilots are safe to remove because no work has been assigned yet.
20100618 LSL BB user g-atl023 (Steve Lloyd) jobs generating 3000 emails per day recently. These get stuck in campus emailer, causing extra load. Solutions are (a) run a working sendmail server on epgr04, or (b) tweak our qsub so that emails are directed to some account of ours. If $HOME/.forward files on BB worked (they don't!) that would have been another option.
20100617 LSL Ran ATLAS squid test for Alastair according to tb-support emal recipe: success. Some discussion going on in Southgrid as to what to configure as our backup ATLAS squid.
20100609 CJC Added ATLAS spacetoken information to ganglia.
20100604 CJC Updated BB:/egee/soft/SL5/local/yaim-conf/users.conf, groups.conf and site-info.def to reflect new users and groups. Reconfigured BlueBEAR WN middleware.
20100603 CJC Generated ssh keys for g-ali, g-bio, g-cal, g-cam, g-stg, g-fus, and g-ze users on bluebear using the command echo ssh-keygen -v -t dsa -f /egee/home/$u/.ssh/id_dsa pipe sudo -H -s -u $u. New keys copied into /var/cfengine/inputs/repo/ce/sl5_bb_ce/opYtert2hpwTCsaRT9f36grTz on epgmo1 and distributed to epgr04 as = /etc/ssh/extra/opYtert2hpwTCsaRT9f36grTz= on epgr04. Added new groups to gshort and glong queue as edguser on epgr04. Waiting for new groups to be added to moab ( although jobs do run if the qfeed script is used). Added relevant software areas to BB:/egee/soft/SL5
20100603 CJC Restarted epgr02/5 queues after rebooting epgsr1.
20100603 CJC Stopped epgr02/5 queues whilst investigating epgsr1 unavailable problem. Unable to ping epgsr1 from all machines except epgse1.
20100601 CJC Updated local UI (SL4/5 Local/BB) to support Calice. This involved downloading the grid-voms.desy.de.11017.pem certificate into $GLITE/middleware/prod/external/etc/grid-security/vomsdir/. Also updated $GLITE/yaim-conf/vo.d/calice to reflect changes to available WMS.
20100601 CJC Updated epgr04 yaim definitions (via epgmo1) to reflect support for Biomed, Camont, Calice, and vo.southgrid.ac.uk. Camont uses names have also been created, but they're not supported yet. Still waiting for ssh keys and sudo access on BlueBEAR to be sorted.
20100601 CJC Requested 15.6.9.9 for epgr04. Installation tasks appear to be failing on epgr02/5.
20100528 CJC qfeed'ing g-honp14 jobs on epgr04. Check back later to see if this affects the LHCb SAM tests.
20100527 CJC Ping'ing epgsr1 from desktop and epgse1 results in 0% packet loss. Checking hostname...
20100527 CJC Reduced the number of concurrent ATLAS jobs on BlueBEAR (via the qfeed scripts) to 60. This will allow the LHCb SAM tests to execute successfully. This problem will be fixed properly by the new /egee filesystem, to be installed next week (1st June).
20100526 CJC Added ngs.ac.uk support to local SL5 UI. This required the ngs certficate be downloaded from CIC, and installed in /home/lcgui/SL5/middleware/prod/external/etc/grid-security/vomsdir/voms.ngs.ac.uk.25890.pem. Also added support for ngs.ac.uk on BlueBEAR SL5 UI.
20100526 CJC Updated, rebooted and reyaimed epgmo1. Check back tomorrow to make sure that accounting is still being updated.
2010525 CJC Problem authenticating as NGS user on local UI - is ngs supported?
20100525 CJC WNs rebooted and moved back online.
20100524 CJC Marked epgd01-12 offline to drain for the purposes of rebooting and installing a new kernel.
20100524 CJC Requested 15.6.9.4 on epgr04 for Tim. Also waiting for existing installation process to finish before installing on epgr02/05.
20100521 CJC Removed old /opt/edg/var/info/atlas/lock file, dated 12 May, which may be holding up installation processes on epgr02/5. Restarted 15.6.9 installation task.
20100521 CJC Released and re-reserved a new ATLASMCDISK spacetoken in an attempt to fix the ATLAS reporting problem. This was only possible because the ATLASMCDISK was already empty!
20100520 CJC Requesting Athena 15.6.9 on local cluster.
20100517 CJC Birmingham panda queues set back online. Closed related GGUS tickets. srmv2.2 still vulnerable to crashing when querying ATLAS production spacetokens (mcdisk, proddisk etc). SE still reporting invalid size allocations according to Peter Love.
20100514 CJC Noted that srmv2.2 service fails on epgse1 if it queried with the srm-get-space-metadata command. Added the service to the cfengine grid services script, so it should be restarted every hour if it has failed. Contacted dpm-users-forum@cern.ch for advice, but planning on upgrading head node to SL5 VM.
20100513 CJC Restarted cfservd on epgr05 after failure of glite-ce-job-submit. This rules out cfengine as a cause of the job submission problems.
20100512 CJC Updated vo.d/atlas to include the BNL VOMS server, as specified on the CIC Portal.
20100512 CJC Reconfigured CreamCE following these instructions after yum updating to glite-CREAM-1.42-3.jdk5. Job submission now possible via glite-ce-job-submit command. This should solve the ISB problems experienced by ALICE. Cfengine appears to break this, requiring yaim to be rerun. Killed cfservd process on epgr05 until this problem is understood.
20100510 CJC Added separate camont queue to epgr05 (and reconfigured epgr02 in the process) so as to limit camont jobs to 3 hour walltime. This should allow greater flexibility so that camont jobs can be more quickly throttled in the case of an ATLAS avalanche.
20100510 CJC Deleting remaining replicas on ATLASMCDISK spacetoken, as requested. Deletion completed by listing files using dpm-sql-spacetoken-list-files --st=ATLASMCDISK command, followed by dp-delreplica.
20100510 CJC Noted library problems using UI on Fedora 12.
20100510 CJC Rebooted epgsr1 after selected WNs failed to mount software area. Reboot appears to have fixed the problem.
20100509 CJC Fixed cfengine Grid Services module ( /var/cfengine/inputs/modules/module:grid_services), which checks and cleans up rfiod and dpm-gsiftp processes on storage nodes. Contained a bug which meant parent processes were not identified properly. They were then killed off if older than 1 hour. The script should now only kill off slave processes (ie user transfers). Killed servkick processes in order to test overnight.
20100507 LSL Those two services on the epgsr servers are stopping every 1 or 2 hours (between 0 and 10 mins past the hour), so /root/bin/servkick now restarts them. Once the problem is understood, that process should be killed.
20100506 LSL The ntpd service on epgsr1 had gone missing so the clock was a couple of minutes slow. Restarted.
20100505 LSL Both epgsr1 and epgsr2 are losing services regularly: dpm-gsiftp and rfiod needed to be restarted. A netstat -ntlp sort was handy for spotting missing services.
20100504 LSL The campus DNS is failing for about 5% of lookups (hostname not found for good hosts). Network team contacted.
20100430 CJC ALICE note problem using Input SandBox? with CreamCE.
20100430 CJC /etc/cron.d/grid_services.cron set to run on DPM Head and Pool nodes to check the status of dpm-gsiftp and rfiod and restart if appropriate. This might fix the source of the latest ATLAS errors.
20100428 CJC In light of pbswebmon data, restricting camont and ATLAS Pilot jobs, as their efficiency is particularly low.
20100428 CJC PBSWebMon now running on epgr08. The original script has been dited to allow monitoring of both epgr05 and BlueBEAR.
20100427 CJC Added epgsr1:/disk/f15c to the new DPM Pool DPM001. Moved all epgse1 filesystems offline and started draining into epgsr1:DPM001. The recovered storage on epgse1 will eventually become the experiment software area, with epgse1 demoted to providing NFS services.
20100426 CJC Rebooting epgsr1 after transfer problems occur.
20100423 CJC epgd01 back online. Installed pakiti2-client on all Grid nodes. Now reporting to epgr08. The pakiti server has yet to be properly configured, and should probably be password protected.
20100423 CJC Marked epgd01 offline due to package inconsistancies after yum update.
20100420 CJC Moved fabric monitoring software (currently Nagios and Ganglia, but soon also Pakiti and PBSWebMon? ) to epgr08 from epgmo1. If php compromises security, the accounting won't be at risk!
20100410 CJC Manual reboot of epgsr2 after it became unavailable on the network. Problem first occurred at 0317 10/04/2010. ATLAS disabled queues and issued a GGUS ticket. No clue in logs as to what the cause of the problem might have been. Also noted that NTP failed to update time from GMT to BST after rebooting.
20100409 CJC Removing epgd24 from twin cluster and adding to test cluster for the purposes of NFS performance testing.
20100409 CJC Allowing submission to u4n183-4 and u4n127-8 on Bluebear via the epgr04:/root/bin/qgogo script (updated via cfengine). These nodes were previously reserved for maui assigned jobs, but the the nodes can be better employed by allowing ATLAS jobs to run.
20100409 CJC Re-enabled nodes epgd23-24, and ensured that 8 ATLAS production jobs were running per node. ATLAS software area remounted on epgd24 with the actimeo=60 option, in an attempt to reduce the number of getattr calls. Draining epgd20-22 in order to test the effect of varying the block size.
20100409 CJC Draining nodes epgd23-24 to test NFS mount settings.
20100408 CJC Black holes detected (and fixed) on epgd15 and epgd22 - nodes not able to scp files back to CE.
20100406 CJC Removed and recreated all pool accounts on epgr05. This may fix the ops problems with running jobs on the CreamCE.
20100406 CJC Updated DESY vomscert on CEs and DPM head node.
20100401 CJC Reducing the number of ATLAS Production jobs running on local twins in order to let pilot test jobs run.
20100401 CJC Updated all iptables to accept connections from 147.188.128.127, the University network time server. Corrected date and time on epgsr2 and epgd22. epgsr2 had not updated after change to BST, and this might be the source of the ATLAS file transfer problems.
20100401 CJC Allowing ssh connections on ALICE VOBox from 137.138. (lxplus) and 128.142. (CERN ops nagios). Updated /etc/hosts.allow to reflect this.
20100401 CJC Installing rootkits on all nodes and broadcasting root password on university email lists (only joking - April Fool :D)
20100331 CJC Noted error Pinging service ClusterMonitor? ... The service is running at epgr03.ph.bham.ac.uk:8084, uri ClusterMonitor? ...connect: Connection refused on epgr03 (ALICE VOBox).
20100329 CJC Disabled epgr02 queues for the purposes of upgrading worker nodes to glexec_wn.
20100326 CJC Re-enabled queues on epgr04 and removed Draining status.
20100326 CJC Rebooted epgr04 now that the BB:/egee filesystem has been fixed. Submitted a GGUS Ticket regarding the rogue SAM Nagios tests coming from samnag010.cern.ch in the SW Cloud.
20100326 CJC Tried removing and redeploying pool accounts using cfengine module. Jobs failed because glite software maintained references in /etc/grid-security to old pool account names.
20100326 CJC Added epgd17-24 to the local cluster, bring the number of jobs slots to 192. Increased the ATLAS maui quota and reyaimed epgr02. Because this machine is still a 64 bit VM, some of the library paths in /opt/glite/etc/lcas/lcas.db and opt/glite/etc/lcmaps/lcmaps.db were wrong. These are now corrected automatically by cfengine after running yaim.
20100324 CJC Job submission now successful on both CAGE CEs.
20100322 CJC Job submission on epgr08 currently fails.
20100322 CJC Updated epgsr1:/etc/exports (via epgmo1) so that experiment software areas can be mounted on epgr07 as well.
20100322 CJC Deploying epgr08 as an lcg-CE to see if the lcg-CE and CreamCE can co-exist.
20100322 CJC CreamCE/ARGUS/GLEXEC_wn test bench now accepts dteam and ops jobs. Would like to be able to test the pilot job glexec functionality.
20100318 CJC Installed as a test suite - epgr05 -> Cream CE, epgr06 -> ARGUS Server, epgr07 -> GLEXEC_wn. Can't submit jobs yet.
20100317 CJC Birmingham starts to fail the GangaRobot? WMS tests, because the wrong certificates were copied to epgsr1 and 2 after update to cfagent.conf on epgmo1. Fixed and now waiting for jobs to return to Birmingham. This does not address the problem as to why we are not receiving any panda jobs.
20100317 CJC Fixed CreamCE output retrieval problem by allowing incoming and outgoing traffic on the recommended ports.
20100316 CJC Noted that CreamCE only reports job complete when the firewall is off => check ports.
20100316 CJC BlueBEAR loses /egee filesystem again. Put qhold on remaining queued jobs as edguser on epgr04, and use qdisable on glong and gshort. Move epgr04 to "Draining" state.
20100316 CJC Updated kickstart files in epgmo1:/data1/grid/kickstart to reflect the fact that redhat mirror is now held at 147.188.47.108:/disk/11b/home/redhat/.
20100316 CJC Renamed epgsr3, epgce1, epgce3 and epgce4 according to LocalGridMachines, and rebooted. Note that this required changes to epgmo1:/etc/dhcp.conf and to /etc/sysconfig/network on each physical machine to be renamed.
20100316 CJC Due to DNS problem, flashed /etc/hosts on all grid nodes with the line 147.188.46.8 epgsr1.ph.bham.ac.uk epgsr1. This should allow grid nodes to communicate with the first pool node (ie epgsr1) while the DNS problem is being resolved.
20100315 CJC Noted that although ATLAS Panda jobs submitted from the local UI (ganga 5.4.5) run successfully on BlueBEAR, gangarobot jobs are still failing because of the TAR_WN bug. Soft linked all files in /egee/soft/SL5/middleware/prod/globus/lib into /egee/soft/SL5/middleware/prod/atlas_fix/lib, and prepended the new variable onto the LD_LIBRARY_PATH environment variable in the x509.sh script. This should fix the gangarobot failures.
20100315 CJC Added the f14 filesystems into the DPM. Temporarily split the 20TB of space evenly between SCRATCH and LOCALGROUP disks., although it is expected that this space will be dedicated to the TopPhys? cache.
20100315 CJC Clean install on epgsr3 and epgr06. Redeploying epgsr3 as a WN for the CreamCE, with epgr06 intended to be a glexec/SCAS server installation.
20100315 CJC Booted the new epgsr1 and configured as a DPM Pool node. Successfully copied new data onto and off the node. Software directories all exported properly. Not yet network bonded. The new epgsr3 is still network bonded, and so should have the IP Address which is hard coded into the bond0 script changed before booting.
20100315 CJC Renamed epgsr1 as epgsr3 and shutdown. Swapped MAC addresses in epgmo1:/etc/dhcp.conf.
20100315 CJC Renamed epgsr3 as epgsr1 and shutdown. Ready for downtime.
20100314 CJC Moved epgr04 into the draining state so that no more new jobs would be submitted before the scheduled downtime on Monday. Moved all local worker nodes offline for the same reason.
20100312 CJC Installed epgsr3.ph.bham.ac.uk, with the intention of reconfiguring as a replacement for epgsr1 on Monday. Glite stack installed, but not configured.
20100311 CJC dteam jobs should now run on the CreamCE. Investigating the possibility of SCAS server. OOPs! Only if the firewall is off!
20100311 CJC Added the project directive -A lowel01 to qsub script on epgr04. (Via epgmo1:/var/cfengine/inputs/repo/ce/qsub.sh, as this file is periodically copied to the CE!)
20100311 CJC Implemented Part 1 of the Grid Backup policy. A cron job on epgmo1 invokes the command /usr/sbin/cfrun -- -D run_backup >> /var/log/cfengine_backup.log 2>&1, which will run the backup module on all grid nodes (excluding epgsr1 as this is still not under the control of cfengine). This module reads a list of files from /root/cfengine/files/backup.rules, and copies the relevant files (preserving permissions, access times and directory structures) to /root/cfengine/backup/`date +% Y%m%d`. The directory is then compressed, ready for Part 2. This will involve distributing to epgsr1 and BB.
20100309 CJC epgr05.ph.bham.ac.uk advertised as a CreamCE in the Site BDII. Marked as preproduction in the GOCDB. Noted that the GlueCEStateStatus item has the value "Special", and not "Production", which is why jobs are not being matched.
20100309 CJC Installed SL5 UI on BlueBEAR. Users simply log on and source /apps/hep/lcgui/lcguisetup. Note that this only works for SL5 so far. Also note that the SL5 installation is dependent on the CRLs managed by the SL5 WN installation, which are stored in /egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates. The relevant X509 are set by the /apps/hep/lcgui/SL5/middleware/prod/external/etc/profile.d/x509.sh script, which is created by running the /apps/hep/lcgui/SL5/yaim-conf/post_yaim.sh script after running yaim.
20100305 CJC Rationalized epgse1:/disk/f??/vo folder creation after biomed tried writing files to the non-existant /disk/f9a/biomed directory. All supported VOs should now have the appropriate directories on all SE filesystems.
20100305 CJC BlueBEAR back online. Clearing "Draining" status from epgr04.
20100303 CJC Requested epgr04.ph.bham.ac.uk be removed from the LHCb management system to allow jobs to run at epgr02 (jobs were no longer being submitted because epgr04 was down).
20100302 CJC Implemented a simple pbs monitor for Nagios, which detects WNs in the offline and down state.
20100302 CJC Updated the SL4 and SL5 UI on the local system. Added support for ILC to both. Note that the SL4 installation (UI_TAR 3.1.44-0) suffers from a bug such that external/usr/lib is appended to the grid environment LD_LIBRARY_PATH variable. This is fixed in the yaim-conf/post_yaim.sh script. The SL5 installation (UI_TAR 3.2.6-0) did not install any .pem or .lsc files in external/etc/grid-security/vomsdir/. These were manually copied from the SL4.new installation.
20100302 CJC Changed epgr04 to Draining status while BB:/egee/ problems continue. Also added epgr04 AT RISK status to GocDB? .
20100301 CJC Need to implement x509 fix on UI_TAR installations. Also need to make sure ILC can authenticate on UI. Also need to implement Globus_Port_Range fix for UI. Where to the *.pem files in vomsdir come from in SL5 installation? They appear to just be there in SL4.
20100301 CJC Submitted 69 local jobs from epgr04 to BB as each of the configured users according to the showusers output. Each job runs the BlueBEAR cleanup script, which should remove old files and directories in the BB:/egee/home/ area. It is hoped that this will ease the slow file access problem on BB. Note that this is a temporary measure until the cron jobs are updated on BB!
20100301 CJC Upgraded to glite-WN_TAR 3.2.6-0 on BlueBEAR.
20100301 CJC Created the scripts BB:/egee/soft/SL5/local/yaim-conf/pre_yaim.sh and post_yaim.sh to be run before and after yaim on BlueBEAR when configuring a new WN_TAR release. These scripts make sure that the X509 environment variables are set and creates the gridmapdir directory for the cleanup scripts.
20100226 CJC Removed SL4 tasks from BB:/egee/system/cron.d/cronuser.u4n??8 cron definitions. These will have to be reloaded on the cron nodes to stop the tasks from being executed!
20100226 CJC Removed WN_TAR release 3.2.4-0 from BlueBEAR. Noted that production system is still 3.2.5-0. Will upgrade to 3.2.6-0 after cleanup scripts investigated.
20100224 CJC Submitted a GGUS Ticket regarding the apparent problems publishing APEL data for the past 13 days.
20100224 CJC Added Virtual Hosts to epgmo1 webserver. The University locks down port 80, so this is not externally accessible. Port 8888 is externally accessible. epgmo1.ph.bham.ac.uk:8888 will now serve the ganglia pages. epgmo1.ph.bham.ac.uk will serve the config files held in /var/www/htlm/config.
20100224 CJC Noted that there are a large number of H1 GridFTP? transfers on epgse1. This would make sense in the context of the larger number of production jobs which have just completed. According to Ganglia, the problem appears to be the load and CPU usage, not bandwidth related.
20100223 CJC Yaim Savannah Bug highlights the fact that lcg-CE is not supported on SL4 32bit. Redeploy?
20100223 CJC Updated iptables for pool nodes. This should also now allow communication between the pool node and BB IPs.
20100223 LSL Noticed during epgsr2 reboot that PXE doesn't function when the bonded interfaces eth0-3 are connected to the group-of-4 trunked switch ports. Observed that epgmo1 receives and responds to the PXE dhcp packets, but epgsr2 PXE doesn't see responses. If important, swap cables such that switch XOR algorithm (LocalGridBonding) chooses eth0 for response.
20100223 LSL Noticed epgsr2 RAID disk labels had been assigned wrongly, eg f16c on physical RAID f17, so went offline, backed up /disk/f* to internal disk, re-initialised the RAID file-system labels correctly, restored the disk areas from the backup, and went back online. Note that the LocalGridRaidFormat doc describes the initialisation process.
20100223 CJC Started to remove SL4 software areas from BB.
20100223 CJC epgr04 failing SAM tests related to CRLs. Removed everything in BB:/egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates, and reran yaim config. This fixed the cert test warning, but not the rm test error. Under Investigation.
20100223 CJC Noted that Birmingham has not published any accounting statistics for 12 days. Ran gap publisher on epgmo1 manually. Under investigation.
20100222 CJC Ran yaim on epgsr2. This allowed lcg-cr transfers onto the pool node when the firewall was dropped. Compare with epgsr1 to see which ports need to be open. epgsr2 configured with a bonded network connection, but requires the network cables to be physically moved.
20100222 CJC Changed permissions on /home/lcgui/SL5/local/bin-cron/local-fetch-crl to 755 so that cron job can actually download the CRLs (previously failing - permission denied to execute).
20100222 CJC/LSL Power supply problem noted on epgd09-16. No data on ganglia for these nodes since Sunday 21st, 2pm. Outage due to a blown fuse.
20100222 CJC epgsr2 network booted and started to reinstall. Because epgsr2 left to network boot on epgmo1, it started to reinstall after rebooting following the unbonding action. As the RAIDs were not disconnected, they appear to be formatting as well - this should not be allowed to happen again.
20100222 CJC Unbonded epgsr2 and rebooted. This enabled the pxeboot to run (failed for the bonded interface)._An interesting question - does the unbonded network connection for eth0 work when connected to the trunked ports on the switch?_
20100221 CJC Initialised the qfeed script on epgr04 for g-honp09 in the absence of ATLAS production.
20100219 CJC Ganglia and dpmmgr UIDs got confused on epgsr2, causing DPM to create directory structures belonging to ganglia. Reinstalling in order to ensure completely fresh setup and avoid future difficulties. Moved ganglia installation to come after lcg installation in cfengine.
20100219 CJC Updated all groups.conf so that entries take the form "/alice/ROLE=lcgadmin":::sgm: (was previously "/VO=alice/GROUP=/alice/ROLE=lcgadmin":::sgm:). The old format was causing yaim to only make entries in /etc/grid-security/grid-mapfile for special (ie production accounts). This caused some intermittent problems on epgse1 during the afternoon.
20100219 CJC Rerunning yaim did not help the transfer problems on epgsr2. Noted that other users have tried to write to the disk, so marking as readonly for now.
20100219 CJC Tried to copy a file onto epgsr2 by setting all other disks to read only and then using the command lcg-cr -v --vo atlas -d epgse1.ph.bham.ac.uk --st ATLASSCRATCHDISK -l lfn:/grid/atlas/users/christophercurtis/test.sh.0 file:///home/cjc/thesis.tar.gz=. The transfer appeared to stall, and in =epgsr2:/var/log/dpm-gsiftp/gridftp.log, the entry 530 Login incorrect. : Could not get virtual id! was noted. Checked /etc/grid-security/grid-mapfile and found only production accounts listed. Rerunning yaim.
20100219 CJC Network bonded eth0-3 on epgsr2.
20100219 CJC Moved /etc/xen/epgr0* VM definitions on VMHosts to /etc/xen/auto/epgr0*. This allows the machines to boot automatically after the host has booted. Changed initialisation scripts to reflect this.
20100218 CJC Ran yaim on epgsr2, with some success. Storage not yet online because DPM on epgse1 has not been updated. This should really be done automatically somehow...
20100218 CJC Updated ssh permissions on all grid nodes. ssh now only allowed between log in node, Lawries desktop and Chris' desktop. No ssh between nodes permitted (with the exception of the required ssh between epgr02 and the twin WNs and epgr04 and BB exports).
20100218 CJC Brian Davies suggests DATA=25.09T, GROUP=5T, HOT=1T, LOCALGROUP=18T, MC=25T, PROD=3.5T and SCRATCH=12T for the ATLAS spacetoken allocations, assuming that epgsr2 is assigned entirely to ATLAS.
20100218 CJC Tim can't dowload dataset user09.timmartin.105003.pythia_sdiff.MinBiasAthenaV1.AtlOff15.6.1_r1027 from Tokyo. DQ2 complains about a CRL problem. The transfer works at CERN however. Check CRLs on UI.
20100218 CJC epgce2 failed to reboot overnight because of a failed bios RAM test. This machine is known to have bad RAM. To avoid the problem again, the bios settings were changed so that the F1 key does not have to be pressed manually on discovering an error. This should allow the machine to continue to boot.
20100218 CJC Connected and mounted /disk/f16 and /disk/f17 RAIDs to epgsr2. This required creating mount points on epgsr2 ( /disk/f1?[a-d]) and adding entries for each filesystem in /etc/fstab. Rebooted. This machine is now ready for configuring as a pool node. It will also require network bonding in the near future.
20100218 CJC Fixed the epgr04 gmetric.cron by adding /root/bin to the path. Previously not reporting any running jobs because it could not find the qs command.
20100218 CJC Restarted pbs_mom on all WNs. Communication between WNs and epgr02 lost sometime over night.
20100217 CJC epgce2 noted to be incommunicado. Requires manual reboot.
20100217 CJC Found that xen does not automatically restart domains after a VM host is rebooted.
20100217 CJC Prepared epgmo1 for the installation of Storage Pool epgsr2. Linked epgmo1:/tftpboot/pxelinux.cfg/93BC2E25 -> hosts/epgsr2.ph.bham.ac.uk ->/tftpboot/pxelinux.cfg/configs/boot-hd.cfg. Changed bios settings to network boot first. Installed SL5.3, preserving the Dell Utility partition.
20100217 CJC Peter Love confirms that the reason for Panda Jobs not running on BlueBEAR is because ATLAS breaks the LD_LIBRARY_PATH variable in tarball installations.
20100217 CJC Changed all installation scripts to use Local RPM Repo when first installing/updating a new node. This includes changes to the /var/cfengine/inputs/repo/vm/* scripts on epgmo1, as well as to all kickstart files. This avoids difficulties with the main Scientific Linux Repo (which is currently unavailable). Further changes to repo lists may be made after a node has been installed by cfengine.
20100217 CJC Noted that the dpm, dpmcopyd, srmv1 and srmv2.2 services on epgse1 failed to restart after a reboot. Restarted manually.
20100216 CJC Deployed epgr05 as a blank VM ready for the CreamCE. epgr06 deployed as a WN for epgr05.
20100216 CJC Rebooted epgce4 - this should install SL 5.3 x86_64 and prepare the machine for two VM Hosts.
20100216 CJC Added virtual host to epgmo1:/etc/httpd/conf/httpd.conf, listening to *:80. All normal web connections should now be accepted without having to authenticate using SSL. Authentication still used for https://epgmo1.ph.bham.ac.uk/nagios. Changed /data1/grid/kickstart/*, /var/cfengine/inputs/cfagent.conf, /var/cfengine/inputs/repo/vm/* and /var/www/html/*.ks on epgmo1 to reflect this.
20100216 CJC Backed up /opt, /etc, /var and /root on epgce4 to epgsr1:/disk/f15d/epgce4.backup. This will be the final backup before redeploying epgce4 as a VM host.
20100215 CJC Downloaded swevo.ific.uv.es.pem into /etc/grid-security/vomsdir on epgr02 to allow fusion jobs to run.
20100215 CJC Configured SL5 UI. Editted /usr/local/bin/lcguisetup to call /home/lcgui/SL5/local/lcguisetup.bash if the user is in an SL5 environment. Also added /home/lcgui/SL5/local/bin-cron/local-fetch-crl to eprexa cron jobs so that the CRLs are downloaded every 6 hours.
20100215 CJC Installed UI 3.2.6-0 on local system for use on SL5 nodes (ie eprexb). Unzipped UI tarballs into /home/lcgui/SL5/middleware/3.2.6-0 and soft linked to /home/lcgui/SL5/middleware/prod/. Configured with yaim /home/lcgui/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /home/lcgui/SL5/yaim-conf/site-info.conf -n UI_TAR. Note that this must be done from an SL5 node! Changed permissions of profile scripts in /home/lcgui/SL5/middleware/prod/external/etc/profile.d/* to 755 (previously 744). Copied /egee/soft/SL5/local/bin-cron/local-fetch-crl from BlueBEAR into /home/lcgui/SL5/local/bin-cron/ and executed.
20100215 CJC On BlueBEAR, replaced the softlink /egee/soft/SL5/middleware/prod/external/usr/lib64/libldap-2.3.so.0, which previously pointed to libldap-2.3.so.0.2.31 in the same library, with one which points to /usr/lib64/libldap-2.3.so.0. This fixed the ldapsearch error "LDAP vendor version mismatch: library 20343, header 20327".
20100215 CJC Updated BlueBEAR WN tarball to release 3.2.5-0. Untarred release into /egee/soft/SL5/middleware/3.2.5-0 and then updated the softlink /egee/soft/SL5/middleware/prod to point to the new release. Ran yaim twice: /egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /egee/soft/SL5/middleware/yaim-conf/site-info.def -n glite-WN_TAR and /egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -r -s /egee/soft/SL5/middleware/yaim-conf/site-info.def -n glite-WN_TAR -f config_certs_userland -f config_crl to configure the WN release and obtain the CRL url files. Added the file /egee/soft/SL5/middleware/prod/external/etc/profile.d/x509.sh, which sets the = X509_CERT_DIR= and X509_VOMS_DIR variables, because these are not added by the yaim config. Editted /egee/soft/SL5/middleware/prod/external/etc/profile.d/grid-env.sh to ensure that x509.sh is also called.
20100215 CJC On BlueBEAR, changed /egee/soft/SL5/localbin-cron/local-fetch-crl so that it retrieves the current CRLs, and places them into /egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates. This script is executed every six hours by the /egee/system/cron.d/cronuser.u4n??8 cron job.
20100215 LSL On BlueBEAR, changed the NFS mount of /egee so that it uses the noatime option, for efficiency, so that simple file accesses do not result in inode re-writes back through the NFS/GPFS system. Observed that the NFS v3 max transfer size is 32768, even if higher value requested.
20100212 CJC Removed epgce4 from site BDII definition. Removed references to myproxy services on epgmo1 by 1) Removing all glite*, edg*, bdii* packages. 2) Removing /opt/bdii, /opt/glite, /opt/globus and /opt/edg directories 3) Reinstalling and reyaiming via cfengine. This has removed all myproxy references in the node BDII.
20100210 CJC Updated the GridPP? voms certificate on the local UI. This is held in /home/lcgui/SL4/etc/grid-security/old-1.28/vomsdir/voms.gridpp.ac.uk.pem. The updated version is available here. The old certificate, which expires on 11/02/2010, has been backed up to voms.gridpp.ac.uk.22812.pem. Also updated certificates on epgr02 and epgse1 in the directory /etc/grid-security/vomsdir/.
20100210 CJC Killed qfeed on epgce4 and started it on epgr04 for the user g-atlp13 (Graeme).
20100210 CJC Notified ATLAS decommissioning of epgce4 and replacement by epgr04. Moved epgce4 queues offline by editting /opt/lcg/libexec/lcg-info-dynamic-pbs so that push @output, "GlueCEStateStatus: $Status\n"; becomes @output, "GlueCEStateStatus: Draining\n";. The command lcg-info --vo atlas --list-ce --attrs 'CEStatus' confirms that epgce4 is not available for jobs. Changed status of epgce4 and epgr04 in GOC DB.
20100210 CJC Wrote a nagios plugin which raises a warning if there are less than 20% of a groups pool accounts left and a critical warning if there are less than 10%.
20100209 CJC Problem with maui/pbs on epgr02 - appears to be unresponsive, even after restarting services. Rebooting machine.
20100209 CJC Pilots failing on BB SL4 complain of not finding libglobus_gsi_proxy_core_gcc32dbgpthr.so.0. This is available in /opt/globus/lib/ on Twin WN.
20100209 CJC Fusion and H1 production appear to be failing gatekeeper tests on epgr02. Investigating
20100209 CJC Nagios remote testing implemented. New tests distributed by cfengine. Added gmetric tests controlled by /root/cfengine/files/gmetric.sh via /etc/cron.d/gmetric.cron on epgr02, epgr04 and epgse1. These tests monitor the number of running jobs and the number of GridFTP? transfers, making the results available to Ganglia. Tests distributed via cfengine.
20100208 CJC Patched bug in cfengine deployment of epgr02 which caused qsub -> qsub.sh -> qsub -> qsub.sh... Each time a new job was submitted, it entered an infinite recursive submission. Also added umask=022 to all shellcommands, which should fix reyaim bug.
20100208 CJC Removed egee-NAGIOS, glite-PX and glite-UI packages from epgmo1 - this monitoring information is available elsewhere. Installed vanilla Nagios release with the intention of deploying standard (eg ping, disk usage) and home brew (number of ATLAS production jobs submitted) sensors.
20100204 CJC Very slow Athena compile times noted on BB. Investigating.
20100204 CJC Athena 15.5.0 test job and SQUID test job submitted to epgr04. If successful, decommissioning on epgce4 will begin. Passed the Athena and SQUID tests.
20100203 CJC Re-enabled remote logging on all nodes (except twins). Log messages should be saved both locally and on epgmo1.
20100203 CJC Reinstalled epgd16 successfully and moved it back online.
20100203 CJC Enabled port 7512 on epgmo1 for the purposes of the MyProxy? server.
20100202 CJC Graeme's jobs on BB SL4 seem to be failing with the error "Can't locate Globus/Core/Paths.pm in @INC". Investigating.
20100202 CJC Problem with lcg-cr onto SE (failed SAM tests). Investigating. Transient error on SE. Keeping an eye on it.
20100202 CJC Ganglia and Nagios monitoring installed on epgmo1. Nagios creates a high load on epgmo1. Consider reducing polling frequency or upgrading to a better machine!
20100202 CJC Fixed all grid kickstart files to connect to https://epgmo1.ph.bham.ac.uk/ack.php with the --no-check-certificate switch.
20100202 LSL Following actions noted yesterday on BB grid, added sharutils and blas-devel on BB suggested by Chris. Today, after reviewing SL5WN, added PyXML.i386 from 32-bit distro (64-bit already present, so may not be important, but harmless!).
20100202 CJC Reinstalling glite-WN on epgd16 (having put it offline first!) after ATLAS pilot job could not find globus-url-copy.
20100202 CJC epgse1 failing gsirfio lcg-gt SAM tests. Removal of --legacy from epgse1:/opt/glite/etc/gip/provider/se-dpm caused gsirfio support to be appended to ldap output. As this is not supported, BHAM failed SAM tests. --legacy support reintroduced. Awaiting the result of the savannah bug. Tested installation of savannah bug rpm - breaks xrootd support and fails to fix gsirfio problem, although legacy warnings do disappear!
20100201 CJC Re-enabled CMS jobs on epgr02 and epgr04.
20100201 CJC Removed "--legacy" switch from dpm-listspaces call in epgse1:/opt/glite/etc/gip/provider/se-dpm. This should fix the "GlueSACapability has unknown value" gstat2 warnings. This edit is managed by cfengine, so any subsequent reyaim should be fixed by cfengine. The consequence of this is that the SE appears to have dropped out of the lcg-infosites output. Is this important?
20100201 LSL Following on from my actions on 20100118, for SL5 on BB, installed compat-glibc-headers.i386 from 32-bit distro, missing from 64-bit distro. Asked Alan to update SL5 kernel for BB grid worker nodes.
20100201 CJC Changing "/C=FR/O=CNRS/CN=GRID-FR" to "/C=FR/O=CNRS/CN=GRID2-FR" in the vo.d/biomed file appears to fix biomed authentication failure errors in globus-gatekeeper.log
20100201 CJC dpm-listspaces on epgse1 shows that the ATLAS pool is not using any space, which appears to be contrary to similar output from Oxford and Glasgow DPM output. Suspect this may be the root cause of GlueSAUsedOnlineSize < 0 in the Gstat-prod monitoring. Emailed Gridpp-storage list.
20100131 CJC Added qsub.sh setup to epgr02 to ensure that NGS jobs get assigned to a specific queue (previously not running).
20100131 CJC Added H1 production and lcgadmin roles to QUEUE_ENABLE variables in epgr02 and epgr04 site-info.def files. This will allow H1 production jobs to run (previously failing).
20100128 CJC Birmingham is on the Ganga Blacklist. The ANALY_BHAM Panda jobs are failing on BB, but also the latest WMS job went to epgr04 and failed when it couldn't use DQ2... Review the situation on Monday after installations have progressed.
20100128 CJC epgce4 seems to fail 14.5.0 Pilot jobs consistantly. Problem with installation after epgr04 mix up? If epgr04 is up and running soon, epgce4 will be decomissioned anyway! epgr02 jobs seem to be more hit and miss - sometimes they work, sometimes they fail the md5sum test.
20100128 CJC Added SITE_OTHER_WLCG_NAME="UK-SouthGrid" to site-info.defs of all CEs and site BDII in order to pass gstat2 tests.
20100128 CJC SQUID test on epgr04 failed becayse latest DDM tools (ie DQ2) are not available. Requesting the 1.32 installation be fixed.
20100128 CJC Submitted one ATLAS pilot job to Birmingham with two subjobs. One subjob succeeded, the other failed due to a mismatching md5sum. Submitting job again for reproduceability. Also submitting with other datasets. Update: Other datasets have failed. Either there are lots of corrupt files on the SE (!), or there is a problem with the transfers, or there is a problem with the md5sum installation on the WNs.
20100127 CJC Increased the MAXPROC limit to 100 for both camont and ALICE production - first come, first served whilst ATLAS is down!
20100127 CJC Added logrotate directive to cfengine to ensure all logs are kept for 366 days.
20100127 CJC Added fusion VO to epgr02 + WNs. Installed lcg-vomscerts-desy to enable H1 and Zeus jobs to run.
20100126 CJC Installed GridPP? Nagios suite on epgmo1. For this to work, SELinux needed to be moved to permissive mode. It also only works on port 80 (not 8888), so this has broken Ganglia (works on port 8888) and may have knock on effects... keep an eye on the SAM tests!
20100126 CJC Manual yum update of epgsr1 and a reboot. This might fix the ATLAS transfer problems.
20100126 CJC yum updating all SL5 nodes in light of new security bug. Should really develop a rolling reboot script for use in cfengine to safely reboot Twins...
20100125 CJC HepSpec06 (32 bit, SL5) results. 9.61 for the Twins, 7.93 for BlueBEAR. Added to CE Information System.
20100122 CJC Test upload of file to SE using xrootd ( xrdcp ~/test.txt root://epgse1.ph.bham.ac.uk//home/alice/test.txt failed with the error Last server error 3010 ('Opening path '/home/alice/test.txt' is disallowed.') Error accessing path/file for root://epgse1.ph.bham.ac.uk//home/alice/test.txt
20100121 CJC Moved epgd16 offline for the purposes of running hepspec. Running HepSpec? test as described here. The node has all normal grid services running as per normal, but should not receive any jobs. The node has also been disabled in the epgmo1 cfengine cfrun.hosts list, so no large file transfers should take place.
20100120 CJC Problem on SL5 BB WN - grid-env.sh keeps getting overwritten, resulting in the x509 variables not being set (the x509.sh script must be executed as well). Removed offending yaim call from cron job definition, but this will require u4n108, u4118 and u4128 be rebooted.
20100118 CJC Moved the 2009 Diary Entries here.
20100118 LSL See previous item for BlueBEAR: the following packages (both archs) were required: compat-db compat-libf2c-34 compat-libgcc-296 compat-openldap compat-readline43 ghostscript giflib openmotif22 openssl097a tk.
20100118 LSL Review packages in the SL5 image of BlueBEAR using doc https://twiki.cern.ch/twiki/bin/view/LCG/SL5DependencyRPM, specifically packages required by metapackage HEP_OSlibs_SL5-1.0.2-0.x86_64.rpm linked from that doc. Also packages listed in doc https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration, under heading LCG Applications Metapackage, for ATLAS.
20100118 CJC RFIO problem detected in 14.5.0 sample job 702 submitted to epgr02.
20100118 CJC Reimplemented Ganglia monitoring on epgmo1. Deployment of Ganglia is under the control of cfengine (and therefore, nodes epgsr1 and epgce4 have not been added yet to the monitoring).
20100115 CJC Reimplement ATLAS and LHCb pilot roles on epgr02 + WNs, as they were lost during the SL5 conversion. These are now completely maintained by yaim.
20100114 CJC yum update on epgmo1 breaks the apel publishing briefly. Fixed by adding the line JAVA_HOME="/usr/java/latest" to /etc/tomcat5/tomcat5.conf.
20100114 CJC New BB lcg-CE made to work by ensuring local pool account locations match those on BB. Created a softlink /egee/home -> /home and edited /etc/passwd to reflect this.
20100113 CJC PBS communication broken between epgr02 and worker nodes after autoupdate. WNs attempted an auto update, but failed due to package inconsistancy in glite-WN_ext repo. All WNs updated via cfengine and rebooted.
20100112 CJC Rebooted epgr02 after lcg-CE update.
20100112 CJC Auto yum update on epgce4 upgraded lcg-CE, which broke the torque submission. Turned off yum updates ( chkconfig yum off; chkconfig --list yum) and then reinstalled Lawries moab tools tarball. Also restored the qsub.bin/qsub.sh setup using script held on epgr04. Local job submission works, checkjob works (unlike epgr04). Waiting for grid test job to return positive.
20100107 CJC Freed 1.4T of dark ATLAS data from epgse1.
20100106 CJC BlueBEAR jobs running again.
20100106 CJC Moved epgd15 offline for the purposes of benchmarking.
20100106 CJC Noted that all jobs (including local jobs) are queued on BlueBEAR. Emailed Alan and Aslam.


This topic: Computing > WebHome > LocalGridTopics > LocalGridJournal2010
Topic revision: r1 - 07 Jan 2013 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback