This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like
for more carefully considered documentation.
20111115 |
MWS |
Removed an empty pool 'DPM001' from epgse1 using dpm-rmpool. |
20111024 |
LSL |
On local /home/lcgui setup, patched-in some files to support eScience CA 2A/2B, so voms-proxy-init works for new students and recent renewers. |
20111017 |
LSL |
Power socket supplying UPS for grid left rack (epgsr1 and RAIDs f12-f15, epgpe01-04) has hair-line crack and failed at around 9am. Swapped plug to another socket and got things going again by 09:40. Will inform electrician Mark Wicks about this problem. [He will replace during some downtime]. |
20111010 |
LSL |
After SCSI problem on epgsr1 for f12 f13 f15, swapped that SCSI chain with f14 chain by moving cables between cards to see if SCSI problem moves or not. |
20111009 |
MWS |
epgsr1 was acting up which was assumed to be a SCSI problem. However, after doing a reboot, the server didn't come back up. Offlined the site and closed the queues. Hopefully this can be fixed tomorrow! |
20110923 |
LSL |
After SCSI problem on epgsr1 for f12 f13 f15, replace that SCSI card with a new one and see if this fixes the problem. |
20110921 |
MWS |
SE was acting up this morning with what seemed to be a dpns hang. Restarted but got strange SOAP errors about token headers. Restarted the srmv2.2 and that seemed to fix it. |
20110909 |
MWS |
After several days of being hit by batches of 100s of jobs at a time, banned user Rafael Mayo (Fusion) across the site by adding to /opt/glite/etc/lcas/ban_users.db. Will email and try to get him to stop. |
20110713 |
MWS |
As per GGUS 72515, added acl cern_dest dstdomain .cern.ch http_access allow cern_dest to squid config file. |
20110712 |
MWS |
Fixed GLExec issues so local tests now work. Polices on the Argus server needed sorting out. Future problems may involve not having roles set properly here! |
20110708 |
MWS |
Started failling certificate NAGIOS tests as we were running the wrong version (1.38 rather than 1.40). Updated using yum update ca-policy-egi-core on the local WNs and copying the resultant certs to the BB WNs. |
20110609 |
MWS |
Noticed ATLAS analysis jobs were failling with liblcgdm errors. Checked ndoes and the links were broken in /opt/lcg/lib. Fixed by hand but in next WN update, this should be fixed. |
20110526 |
MWS |
SE problems narrowed down to excessive H1 jobs hammering the SE. |
20110525 |
MWS |
epgse1 showed strange network issues overnight. Rebooting (eventually) fixed it but should keep an eye on odd 'fetch-crl' and 'voms' errors |
20110511 |
MWS |
All appears well on the recovered VMs on epgpe10. Did a test copy to the SE and that was fine so reopened the queues and jobs have started coming in. WIll keep an eye on the Nagios tests to make sure everything is up and running again. |
20110511 |
LSL |
For epgpe10 problem, updated base system kernel and kernel-xen from 2.6.18-194.32.1 to 2.6.18-238.9.1. Also, Dell have supplied new disk, so after booting from a CD, did a dd copy to new disk: dd if=/dev/sda of=/dev/sdb bs=51200000. This took about 2 hours. Rebooted. |
20110510 |
MWS |
Update Maui config to use MAXJOB instead of MAXJOBS and slightly altered weighting to prioritise Atlas, LHCb and Alice. Will keep an eye to make sure jobs go through as expected. |
20110508 |
LSL/MWS |
Mark notes epgpe10 has gone down again, like 20110505. System messages for epgpe10 logged on epgmo1 starts with mpt2sas0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630), then sd 0:0:0:0: SCSI error: return code = 0x00010000, then scsi 0:0:0:0: rejecting I/O to dead device . Come in and reboot. |
20110503 |
LSL |
For BB, I asked Alan to propagate sudoers changes of March, including Mark's account, from front-ends to worker nodes too. I've made a rc.d/S60sudo.sh to include Chris's account as required. |
20110503 |
LSL |
On BB, I've added Mark's account as an extra allowed-user in rc.d/S60sshd.sh which configures /etc/ssh/sshd_config on the grid worker nodes. |
20110427 |
MWS |
Entered DT. Set epgr02 & 05 queues to Draining and stopped (qstop --) long, short and alice. |
20112204 |
MWS |
epgr05 was failing NAGIOS with LB Query failures. Rebooting fixed the problem. |
20112104 |
MWS |
Set epgr04/07 queues back online and set status in /opt/lcg/libexec/lcg-info-dynamic-pbs from Draining to $Status. Needed to reboot epgr04 as qsub/qstat didn't work, but other than that, all fine. |
20111504 |
MWS |
ALICE BB VO Box was under very heavy load (>15) with CPU idle. Contacted ALICE experts who had a look but recommened reboot. Tried soft reboot and didn't work so hard reset (xm destroy + xm create). All seems well now. |
20111404 |
MWS |
Request from ATLAS to take 22.5 TB from DATADISK and redistribute to other spacetokens. |
20111404 |
MWS |
Noticed that new jobs into epgr05 weren't coming in. Rebooted and found BDII didn't start due to 0 diskspace left. Deleted a load of cfengine backup files, rebooted and all is well. Did the same for epgr02 just in case as well. Need a more permanent solution in the long term though. |
20111304 |
MWS |
Added the new certificate for epgr08. Didn't reyaim/reboot as didn't seem necessary. |
20111304 |
MWS |
Attempted to reyaim epgse1 after putting the new certificate but it got stuck when restarting the dpm. This was eventually traced to gmetric going nuts as it was run every minute. Reduced this time to every 30mins, rebooted (after Ctrl-C'ing out of the reyaim) and reyaim again. Everything seems to be back up and OK! |
20111204 |
MWS |
On request from Elena (Atlas) added 1TB to PRODDISK (taken from DATADISK). |
20111204 |
MWS |
Reyaimed and rebooted epgr07 to put in new certificate. |
20111104 |
MWS |
Marked epgr07 and epgr04 in downtime and stopped the queues due to 1.5 weeks of BB downtime. |
20110404 |
MWS |
On Friday 1st, what looks like an ALICE job took out two nodes (kernel crash in the log) and at about the same time, we either recieved 2000 jobs causing the local CE and Torque to fall over OR the CE and Torque fell over and jobs were getting hung up over the weekend. Either way, the CE needed rebooting and ~2000 jobs were left in the queue in an odd state. Deleted all these and everything returned to normal! |
20110307 |
CJC |
All nodes require ssh keys to login, with the exception of epgmo1 (though this may change in the future). Public keys should be stored in the directory epgmo1:/var/cfengine/inputs/repo/general/public_keys/ . They will then be distributed to all nodes via the module script modules/modules:ssh . New keys are added to the authorized_keys file using the command cfrun -- -D restart_ssh . |
20110217 |
LSL |
On BB, process accounting now starts via system/rc.d/S*psacct.sh: uses directory /local/account/, a 2GB area which survives reboot. |
20110209 |
LSL |
On BB, implement logging of outgoing ssh calls on bluebear workers via iptables rule. Test process accounting on one node u4n128. |
20110117 |
CJC |
New dteam voms supported on local system SL5 UI. |
20101208 |
LSL |
f8 RAID now in place in BB serving /egee/soft via NFS. Now /egee/home areas are on local worker disk, like on our local cluster, as a performance enhancement. |
20101206 |
LSL/CJC |
Take f8 RAID and eprex6 server over to BB; their BB team want to do the physical installation though. |
20101125 |
LSL |
ep19x BB NAS server fails again with kernel traceback from alloc_pages_internal. Do a soft reboot but filesystem then disappears. Start preparing for redeployment of f8 RAID for BB. |
20101112 |
LSL |
Prepared a bbmoab.tar of current (5.4.3.s1) Moab client binaries, including Green Computing options, for Chris to put on epgr04. |
20101111 |
LSL |
RAM memory tests (memtest86+ 2.0.1 and then 4.1.0) on ep19x BB NAS server ran clean for 24 hours, so put /egee filesystem back online. |
20101109 |
CJC |
Marked epgr04 and epgr07 as draining ahead of the BB downtime. |
20101109 |
CJC |
Created new home directories and ssh keys for new BB grid users. Full details on how this was automated can be found here. |
20101105 |
CJC |
After fixing epgr11 DN in GOCDB, apel data appears to be uploading successfully. Check back on November figures in 24 hours. Also check back on September - currently at 1007940 (want to avoid double counting). Full details of upgrade here. |
20101104 |
CJC |
Removed all tags from epgr04:/opt/edg/var/info/atlas/atlas.list except VO-atlas-cloud-UK and VO-atlas-tier-T2 . This should be enough to trigger reinstallation of ATLAS software. This will affect epgr07 as well as tag file was shared over NFS. |
20101104 |
CJC |
Test jobs successfully processed on BB. Submitting full grid type job. |
20101103 |
CJC |
Re-enabled BB queues and submitted large number of test jobs to ensure that nodes offlined by Green Computing systems can come back online. It appears as though all appropriate nodes have come back online, but no jobs are submitting. Check moab status? |
20101103 |
CJC |
Emailed tb-support as problems with APEL still persist and there is no reply to the https://gus.fzk.de/ws/ticket_info.php?ticket=63654][GGUS Ticket]]. |
20101102 |
CJC |
Restored grid middleware according to the LocalGridCookbook instructions, but test jobs submitted from epgr04 are not picking up the grid environment variables. Local config scripts, yaim etc picked up by chance from old /egee filesystem still mounted on BB3. These files need to be added to the backup policy! |
20101102 |
CJC |
Submitted a helpdesk ticket (and emailed) Alan requesting that the kernel and glibc be updated on the BB nodes. Reinstalling middleware. |
20101102 |
LSL |
The NAS1104L box which provides /egee has been fitted with a new usb disk-on-module including up to date Open-E software. 3ware firmware already up to date. RAID now reformatted from scratch. Aslam has moved its power to the UPS, and is not to hard power-off the device on future occasions. |
20101101 |
CJC |
Submitted GGUS Ticket to APEL after epgr11 fails to upload updated accounting data to Accounting Portal. |
20101029 |
CJC |
Hard reboot of epaf17.ph.bham.ac.uk after failed reboot due to mount binds still being in place. Removed suid from cfengine modules. |
20101029 |
CJC |
Moved MonBox? role over to SL5 on epgr11.ph.bham.ac.uk. Reyaimed all CEs (epgr02, 04,, 05 and 07) and Site BDII to reflect change. Updated GOCDB. accounting currently reads 770316 for Birmingham - this should have increased by Monday. If not, GGUS ticket APEL for help. |
20101029 |
CJC |
All local nodes have been updated and rebooted, and so have been patched against CVE-2010-3904 and CVE-2010-3847. Waiting for BB to come back online before patching. |
20101020 |
CJC |
Disabled rds module in SL5 installations via cfengine CVE-2010-3904. This should be extended to BB once the filesystem has been fixed. |
20101020 |
CJC |
NFS server logs copied to /home/lcgdata/logs/NAS/20101019 . |
20101020 |
CJC |
Added module:suid to cfengine tasks. This executes the script /root/cfengine/files/suid_fix.sh on SL5 nodes, either automatically if the lock file /var/cfengine/reports/suid_fix cannot be found or on demand (by setting the variable force_suid_fix ). The suid_fix.sh script prevents unauthorised root access via hard links, as described in CVE-2010-3847. Note that rebooting a node will undo the fix, so the reboot and halt cfengine commands attempt to remove the lock file! This should be extended to BlueBEAR once the filesystem has been fixed. |
20101020 |
CJC |
Unscheduled downtime for epgr04 (BB lcg-CE), epgr07 (BB CreamCE) and epgr10 (BB Alice VOBox) due to NFS filesystem problems. |
20101018 |
CJC |
BB NFS box unresponsive. Requested Aslam do a hard restart. |
20101012 |
CJC |
Enabled ATLAS and ALICE (along with other normal VOs) on epgr07 (the CreamCE for BB). Notified Patricia and Graeme about sending ALICE and ATLAS jobs to this CE. Note that the CreamCE requires access to the torque server logs (not just accounting). These are currently copied onto the NFS server ( ep19x.ph.bham.ac.uk:/egee/torque/server_logs ) on the BB side every 10 minutes by a cron job. This directory is then NFS mounted onto epgr07:/var/spool/pbs/server_logs . |
20101004 |
CJC |
Replicated cond10_data.000007.gen.COND._0002.pool.root.4801537.0 and DBRelease-12.7.1.tar.gz.6244710.0 on SE after job failures due to timeouts. |
20100929 |
CJC |
mysqld service failed to restart after rebooting epgmo1 during kernel upgrades. This caused APEL to fail to publish for 8 days. Restarted mysqld service and republished APEL data. Accounting data should now be up to date. |
20100927 |
CJC |
epgr07 not accepting jobs because it was not redeployed when the BB pool accounts were redefined. Backing up VM and redeploying. |
20100924 |
CJC |
epgsr4 (40TB) brought online. Space distributed between ATLAS spacetokens (DATA, MC, SCRATCH, HOT, LOCALGROUP). |
20100923 |
CJC |
Problem with yaim generated /etc/sudoers file on CreamCE for BB (epgr07). Emailed lcg-rollout. |
20100923 |
CJC |
Deploying epgr10 as a second VOBox for Alice. This will manage the BlueBEAR software area. NFS mounted ep19x.ph.bham.ac.uk:/egee/soft/SL5/alice on the VOBox. |
20100923 |
CJC |
BlueBEAR WNs back online, and using the updated kernel. Reyaiming epgr04 to allow jobs again. Updating ticket 62359. |
20100923 |
CJC |
BlueBEAR WNs appear to be in the down,offline state since late last night. Emailed Alan Reed. |
20100922 |
CJC |
Official kernel patch released. Updated DPM pool nodes, reyaimed and rebooted. Requested kernel be installed on BlueBEAR WNs. |
20100921 |
CJC |
epgd[01-24] nodes have kernel updated using yum --enablerepo=sl-testing update . Nodes rebooted, grub checked to make sure that nodes are using new kernel. All other service nodes, with the exception of the DPM pool nodes are updated in the same way. BB nodes are waiting for official kernel release. Supported VOs on epgr04 are reduced to ops only. Downtimes cleared from GOCDB. |
20100915 |
CJC |
Draining the epgd[01-24] nodes in preparation for kernel fix for the problem described here. |
20100915 |
CJC |
4000+ ILC jobs submitted to epgr02/05 by Stephane Poss. Killed off 3500 queued and emailed user. Checking efficiency of remaining jobs - could be useful to distribute to other SouthGrid? sites. |
20100915 |
CJC |
Replaced bucket in server room with a bucket and crate. This should have a large enough volume to contain the air conditioning drainage for the weekend (bucket approximately 3/4 full after 24 hours. |
20100914 |
CJC |
Submission problem on epgr02. Submitted jobs run, but no output appears to be returned. This would explain the nagios timeouts on epgr02 jobs. Rebooted. |
20100914 |
CJC |
Pump broken in AirCon? D. Maintenance logged problem with central services, waiting on quote for fix. In the meantime, they have uncoupled the drainage, which now empties into a bucket. This is not ideal (bucket should be checked everyday), but it does mean the unit is switched on. All WNs brought back online. Temperature steady at 18.5C. |
20100914 |
CJC |
Switched more nodes off. David Clifford sending someone to look at air conditioning. Temperature peaked at 25C. |
20100913 |
CJC |
Added SRCFG definition to maui config on epgr05, reserving one slot on both epgd01 and epgd02 for ops jobs and Steve Lloyd. Check back on SAM tests in 24 hours to see if this makes a difference!. |
20100913 |
CJC |
Changed "MAXPROC" to "MAXJOBS" in epgr05 maui definition, following advice on ScotGrid Blog. |
20100913 |
CJC |
AirCon? D powered back on (~5pm). Temperature drops to < 17C. |
20100913 |
CJC |
Installed epgf01 and epgf02 behind the f12-15 RAIDs to help with air flow. Temperature holding steady at 19.5C. |
20100913 |
CJC |
AirCon? D in W332 failed (switched off permanently). Contacted Dave clifford. Proceeding to drain alternate WNs (1,4,5,8,9,12,13,16,17,20,21,24)with the intention of powering them off once the jobs have completed. Air Temp currently at 19.46C. |
20100913 |
CJC |
Set epgr04 to draining and glong to enabled = False in preparation for BlueBEAR downtime. |
20100831 |
CJC |
Moved epgd17 back online, but gave it the property "raid". All other nodes have the property "lcgpro". Modified qsub script so that all jobs require the "lcgpro" property, with the exception of jobs submitted by "atl059", which require the "raid" property. In this way, epgd17.ph.bham.ac.uk has been isolated for the purposes of testing the RAID performance. |
20100831 |
CJC |
Moved epgd17 offline for the purposes of testing a RAID'ed WN. |
20100827 |
CJC |
Noted that the ATLAS Squid was swapping about 700MB of RAM. Readjusted VM allocations to give Squid 3GB at the expense of epgr02 (hosted on the same server). |
20100825 |
CJC |
Reyaimed epgr05 after ALICE complained of not being able to submit jobs. Jobs now successfully being submitted |
20100811 |
CJC |
Disappeared from Top level BDII again last night when epgr09 stopped responding to ldap queries. Restarted BDII service on epgr09. Adding hourly restart to cfengine. Checking log files for problems. |
20100810 |
CJC |
Rebooted epgsr1 due to the disappearance of 4 files systems. This fixed the problem. |
20100810 |
CJC |
Added the 40TB of storage attached to epgsr3 to the DPM. Used to reinstate a 50TB MCDISK spacetoken (along with some storage from DATA and LOCALGROUP). Emailed Brian Davies about making this official. |
20100806 |
CJC |
Accounting website reports no accounting data for epgr04.ph.bham.ac.uk (noted thanks to SpecInt? tagging idea put forward by Pete Gronbech). Checking accounting records on epgr04.ph.bham.ac.uk. |
20100806 |
CJC |
(Software) bonded epgsr3, all four interface connections working. Waiting for replacement host certificate before adding into DPM. Still have to update epgse1:/etc/shift.conf |
20100806 |
CJC |
Birmingham back in the information system, and on gstat. |
20100806 |
LSL |
Reconfigure switch epsw22 to include bond for sr3 on ports 05-08 and future sr4 on ports 09-12. Note: had to use browser IE<=6 or FF<=2 to reconfigure trunking on this DLink switch. |
20100806 |
CJC |
Rebooted Site BDII after it failed to respond to ldap queries. service bdii status checked out ok before reboot - checking logs... |
20100803 |
CJC |
qfeed scripts superseded on epgr04 by the qall script. This reads in a list of prioritised usernames, along with a maximum number of jobs they're allowed to run. The script then runs through all queued jobs and submits as many as it can. The script is invoked as root using the command qalld cfengine/files/qall.priorities d& . |
20100803 |
CJC |
Ran /sbin/start_udev on epgd02, 08 and 12 to fix the /dev/null bug. epgd10 remains unresponsive. |
20100803 |
CJC |
Moved epgd02,08 10 and 12 offline as they have been hit by the overwriting /dev/null bug. |
20100802 |
CJC |
Redeployed epgr04 with a reduced number of pool accounts. |
20100802 |
CJC |
Moved DPM Head Node to SL5.4 machine (keeping same name and IP address). Moved Site BDII to SL5, deploying on VM epgr09. This required changing both node and GIIS information in the GOCDB. |
20100730 |
CJC |
Moved epgd01 offline due to problem remotely rebooting (NFS?). |
20100730 |
CJC |
Scheduled downtime for the whole of Monday 2nd August so that the SE can be migrated to SL5. |
20100728 |
CJC |
Added 1TB to ATLASSCRATCHDISK (at the cost of LOCALGROUPDISK) to avoid being blacklisted. Available space must stay above 1TB! |
20100727 |
CJC |
Added Pheno and DZero support to local cluster and SE. |
20100722 |
CJC |
Changed maui FairShare? weighting scheme on epgr05 to more extreme values. Whereas previously the FS group weights were treated as a percentage, there did not appear to be enough of a discriminating factor between jobs. Ops jobs should now have the highest priority, followed by ATLAS/ALICE. LHCb jobs follow next, with all other VOs taking the lowest priorities. |
20100721 |
CJC |
UK CPU and Storage ranks, based on information in the BDII, are made available online. |
20100721 |
CJC |
Installed voms.gridpp.ac.uk and voms.ngs.ac.uk certificates in /etc/grid-security/vomsdir/ on epgr05. This CE shouldn't need the certificates (relying instead on the vomsdir/VO/*.lsc files), but a bug means that it can't deal with VOs that need to authenticate with the Manchester VOMS server. |
20100710 |
LSL |
/egee progress: only u4n085 and u4n116 unconverted to new NAS; both are offline, so restart glong queue. Later: all done. |
20100709 |
LSL |
/egee progress: BB worker nodes u4n081,082,110-128 are on new ep19x NAS. 100-109 are offline awaiting job finish. |
20100709 |
LSL |
BB SAM tests for epgr04 have been showing u4n128:CRITICAL for the WN-CAver test: info files showed certs were version 1.34, but required 1.36. Later received GGUS ticket 59922. In /egee/soft/SL5/middleware/prod/external/etc/grid-security, I moved certificates/ to certificates.yyyymmdd/ and rsync'd afresh from epgr04.ph.bham.ac.uk::certificates. Done on both currently active /egee directory trees. Suggest re-instating g-admin cron job to do this. |
20100709 |
LSL |
Our storage was not being reported by ldap to epgse1 or by Gstat2 on web. Rebooted epgse1 (last night) to remedy. Today found that there were log messages "dpm: failed" and "dpnsdaemon: failed". File dpm/log indicates epaf17 network-down. Restarted network on that, and restarted epgse1, 10am Friday. Query via ldap now showing sensible Size information, absent before. |
20100708 |
LSL |
Around 5pm: on the console, logged on to all physical and virtual grid machines to check if they were down on the network: all were down except epgsr1 and epgsr2. Did service network restart for those. |
20100708 |
LSL |
On epgmo1, truncated that big log file, manually set IP addr, copied iptables from iptables.save, moved /etc/cron.d/cfengine_cron to /root directory for now. |
20100708 |
LSL |
Noticed that most grid servers had no network accessibility. Checked epgmo1 and found it had a 100% cfrun process, and 100% disk full. File /var/log/cfengine_backup.log was 77GB, with messages "You do not have a public key from host epgpe10.ph.bham.ac.uk", "Do you want to accept one on trust (yes/no)", and "Please answer yes or no". File /etc/sysconfig/iptables had been truncated at 4096 bytes, presumably by the disk full condition after an attempted update by cfengine. |
20100708 |
LSL |
Rebooted several BB workers to check that access to new /egee server worked from a fresh image. It does. Also converting a further handful of workers to use the new /egee. |
20100707 |
LSL |
On u4n128 tested new BlueBEAR transtec NAS /egee server, known on the network as ep19x and 10.143.245.103: no problems. |
20100705 |
CJC |
Owing to the Great Pool Account Crisis of '2010 (BlueBEAR Moab hit a hard limit of manageable pool accounts), Camont, CMS, NA48, Southgrid and Zeus have been disabled on the BlueBEAR CEs. |
20100705 |
CJC |
Allowed ssh connections in epgr03:/etc/hosts.allow to gridppnagios.physics.ox.ac.ukon the ALICE VOBox in order to pass nagios tests. As Patricia could already gsissh from lxplus, failing these OPS tests did not affect functionality. |
20100705 |
CJC |
Completing dpm-drain of storage hanginf off epgse1 by fixing drain errors ( dpm-delreplica non-existant physical files and rm -f files marked as in the process of being deleted by the DPM). |
20100705 |
CJC |
Copied (cp -a as g-admin) /egee/soft/SL5/middleware and /egee/soft/SL5/local/ onto the new file server, mounted at bluebear4x:/mnt/egee-new . Stopped queues to ensure no more software jobs are submitted and started to copy software directories. |
20100628 |
CJC |
Incremented the kSI2K spec of the CreamCE by 1 to make differentiating between published accounting records on the accounting website easier. |
20100622 |
CJC |
Serena Psoroulas having difficulties authenticating at Cambridge. Watching pilot job at Birmingham - local ID 2820474 on BB. |
20100622 |
CJC |
Changed software tags area /opt/edg/var/info on epgr02/05 so that it's hosted on epgpe04 and NFS mounted on the CEs. This may help to alleviate the writing problems experienced by epgr02. Resubmitted 15.8.0 installation job. |
20100622 |
CJC |
Installed Cream CE on epgr07 to submit jobs to BB. Not in site BDII yet due to problem with firewall on epgr07 preventing connections to bbexport. |
20100622 |
CJC |
Problems with torque server on epgr05. Appears to be confused about which jobs are actually real. Killed 0% efficient jobs and started to manually qrun some of the backlog. Brought epgd24 back online. Accounting stats recorded at 39782 for June. |
20100621 |
CJC |
Yum updated all nodes. No reyaim required. |
20100621 |
CJC |
Removed queued LHCb pilot jobs from epgr02/05. Pilot factory appears to have read a wrong value from the information system and sent too many jobs. Queued pilots are safe to remove because no work has been assigned yet. |
20100618 |
LSL |
BB user g-atl023 (Steve Lloyd) jobs generating 3000 emails per day recently. These get stuck in campus emailer, causing extra load. Solutions are (a) run a working sendmail server on epgr04, or (b) tweak our qsub so that emails are directed to some account of ours. If $HOME/.forward files on BB worked (they don't!) that would have been another option. |
20100617 |
LSL |
Ran ATLAS squid test for Alastair according to tb-support emal recipe: success. Some discussion going on in Southgrid as to what to configure as our backup ATLAS squid. |
20100609 |
CJC |
Added ATLAS spacetoken information to ganglia. |
20100604 |
CJC |
Updated BB:/egee/soft/SL5/local/yaim-conf/users.conf , groups.conf and site-info.def to reflect new users and groups. Reconfigured BlueBEAR WN middleware. |
20100603 |
CJC |
Generated ssh keys for g-ali, g-bio, g-cal, g-cam, g-stg, g-fus, and g-ze users on bluebear using the command echo ssh-keygen -v -t dsa -f /egee/home/$u/.ssh/id_dsa pipe sudo -H -s -u $u . New keys copied into /var/cfengine/inputs/repo/ce/sl5_bb_ce/opYtert2hpwTCsaRT9f36grTz on epgmo1 and distributed to epgr04 as = /etc/ssh/extra/opYtert2hpwTCsaRT9f36grTz= on epgr04. Added new groups to gshort and glong queue as edguser on epgr04. Waiting for new groups to be added to moab ( although jobs do run if the qfeed script is used). Added relevant software areas to BB:/egee/soft/SL5 |
20100603 |
CJC |
Restarted epgr02/5 queues after rebooting epgsr1. |
20100603 |
CJC |
Stopped epgr02/5 queues whilst investigating epgsr1 unavailable problem. Unable to ping epgsr1 from all machines except epgse1. |
20100601 |
CJC |
Updated local UI (SL4/5 Local/BB) to support Calice. This involved downloading the grid-voms.desy.de.11017.pem certificate into $GLITE/middleware/prod/external/etc/grid-security/vomsdir/ . Also updated $GLITE/yaim-conf/vo.d/calice to reflect changes to available WMS. |
20100601 |
CJC |
Updated epgr04 yaim definitions (via epgmo1) to reflect support for Biomed, Camont, Calice, and vo.southgrid.ac.uk. Camont uses names have also been created, but they're not supported yet. Still waiting for ssh keys and sudo access on BlueBEAR to be sorted. |
20100601 |
CJC |
Requested 15.6.9.9 for epgr04. Installation tasks appear to be failing on epgr02/5. |
20100528 |
CJC |
qfeed'ing g-honp14 jobs on epgr04. Check back later to see if this affects the LHCb SAM tests. |
20100527 |
CJC |
Ping'ing epgsr1 from desktop and epgse1 results in 0% packet loss. Checking hostname... |
20100527 |
CJC |
Reduced the number of concurrent ATLAS jobs on BlueBEAR (via the qfeed scripts) to 60. This will allow the LHCb SAM tests to execute successfully. This problem will be fixed properly by the new /egee filesystem, to be installed next week (1st June). |
20100526 |
CJC |
Added ngs.ac.uk support to local SL5 UI. This required the ngs certficate be downloaded from CIC, and installed in /home/lcgui/SL5/middleware/prod/external/etc/grid-security/vomsdir/voms.ngs.ac.uk.25890.pem . Also added support for ngs.ac.uk on BlueBEAR SL5 UI. |
20100526 |
CJC |
Updated, rebooted and reyaimed epgmo1. Check back tomorrow to make sure that accounting is still being updated. |
2010525 |
CJC |
Problem authenticating as NGS user on local UI - is ngs supported? |
20100525 |
CJC |
WNs rebooted and moved back online. |
20100524 |
CJC |
Marked epgd01-12 offline to drain for the purposes of rebooting and installing a new kernel. |
20100524 |
CJC |
Requested 15.6.9.4 on epgr04 for Tim. Also waiting for existing installation process to finish before installing on epgr02/05. |
20100521 |
CJC |
Removed old /opt/edg/var/info/atlas/lock file, dated 12 May, which may be holding up installation processes on epgr02/5. Restarted 15.6.9 installation task. |
20100521 |
CJC |
Released and re-reserved a new ATLASMCDISK spacetoken in an attempt to fix the ATLAS reporting problem. This was only possible because the ATLASMCDISK was already empty! |
20100520 |
CJC |
Requesting Athena 15.6.9 on local cluster. |
20100517 |
CJC |
Birmingham panda queues set back online. Closed related GGUS tickets. srmv2.2 still vulnerable to crashing when querying ATLAS production spacetokens (mcdisk, proddisk etc). SE still reporting invalid size allocations according to Peter Love. |
20100514 |
CJC |
Noted that srmv2.2 service fails on epgse1 if it queried with the srm-get-space-metadata command. Added the service to the cfengine grid services script, so it should be restarted every hour if it has failed. Contacted dpm-users-forum@cern.ch for advice, but planning on upgrading head node to SL5 VM. |
20100513 |
CJC |
Restarted cfservd on epgr05 after failure of glite-ce-job-submit . This rules out cfengine as a cause of the job submission problems. |
20100512 |
CJC |
Updated vo.d/atlas to include the BNL VOMS server, as specified on the CIC Portal. |
20100512 |
CJC |
Reconfigured CreamCE following these instructions after yum updating to glite-CREAM-1.42-3.jdk5 . Job submission now possible via glite-ce-job-submit command. This should solve the ISB problems experienced by ALICE. Cfengine appears to break this, requiring yaim to be rerun. Killed cfservd process on epgr05 until this problem is understood. |
20100510 |
CJC |
Added separate camont queue to epgr05 (and reconfigured epgr02 in the process) so as to limit camont jobs to 3 hour walltime. This should allow greater flexibility so that camont jobs can be more quickly throttled in the case of an ATLAS avalanche. |
20100510 |
CJC |
Deleting remaining replicas on ATLASMCDISK spacetoken, as requested. Deletion completed by listing files using dpm-sql-spacetoken-list-files --st=ATLASMCDISK command, followed by dp-delreplica . |
20100510 |
CJC |
Noted library problems using UI on Fedora 12. |
20100510 |
CJC |
Rebooted epgsr1 after selected WNs failed to mount software area. Reboot appears to have fixed the problem. |
20100509 |
CJC |
Fixed cfengine Grid Services module ( /var/cfengine/inputs/modules/module:grid_services ), which checks and cleans up rfiod and dpm-gsiftp processes on storage nodes. Contained a bug which meant parent processes were not identified properly. They were then killed off if older than 1 hour. The script should now only kill off slave processes (ie user transfers). Killed servkick processes in order to test overnight. |
20100507 |
LSL |
Those two services on the epgsr servers are stopping every 1 or 2 hours (between 0 and 10 mins past the hour), so /root/bin/servkick now restarts them. Once the problem is understood, that process should be killed. |
20100506 |
LSL |
The ntpd service on epgsr1 had gone missing so the clock was a couple of minutes slow. Restarted. |
20100505 |
LSL |
Both epgsr1 and epgsr2 are losing services regularly: dpm-gsiftp and rfiod needed to be restarted. A netstat -ntlp sort was handy for spotting missing services. |
20100504 |
LSL |
The campus DNS is failing for about 5% of lookups (hostname not found for good hosts). Network team contacted. |
20100430 |
CJC |
ALICE note problem using Input SandBox? with CreamCE. |
20100430 |
CJC |
/etc/cron.d/grid_services.cron set to run on DPM Head and Pool nodes to check the status of dpm-gsiftp and rfiod and restart if appropriate. This might fix the source of the latest ATLAS errors. |
20100428 |
CJC |
In light of pbswebmon data, restricting camont and ATLAS Pilot jobs, as their efficiency is particularly low. |
20100428 |
CJC |
PBSWebMon now running on epgr08. The original script has been dited to allow monitoring of both epgr05 and BlueBEAR. |
20100427 |
CJC |
Added epgsr1:/disk/f15c to the new DPM Pool DPM001 . Moved all epgse1 filesystems offline and started draining into epgsr1:DPM001 . The recovered storage on epgse1 will eventually become the experiment software area, with epgse1 demoted to providing NFS services. |
20100426 |
CJC |
Rebooting epgsr1 after transfer problems occur. |
20100423 |
CJC |
epgd01 back online. Installed pakiti2-client on all Grid nodes. Now reporting to epgr08. The pakiti server has yet to be properly configured, and should probably be password protected. |
20100423 |
CJC |
Marked epgd01 offline due to package inconsistancies after yum update. |
20100420 |
CJC |
Moved fabric monitoring software (currently Nagios and Ganglia, but soon also Pakiti and PBSWebMon? ) to epgr08 from epgmo1. If php compromises security, the accounting won't be at risk! |
20100410 |
CJC |
Manual reboot of epgsr2 after it became unavailable on the network. Problem first occurred at 0317 10/04/2010. ATLAS disabled queues and issued a GGUS ticket. No clue in logs as to what the cause of the problem might have been. Also noted that NTP failed to update time from GMT to BST after rebooting. |
20100409 |
CJC |
Removing epgd24 from twin cluster and adding to test cluster for the purposes of NFS performance testing. |
20100409 |
CJC |
Allowing submission to u4n183-4 and u4n127-8 on Bluebear via the epgr04:/root/bin/qgogo script (updated via cfengine). These nodes were previously reserved for maui assigned jobs, but the the nodes can be better employed by allowing ATLAS jobs to run. |
20100409 |
CJC |
Re-enabled nodes epgd23-24, and ensured that 8 ATLAS production jobs were running per node. ATLAS software area remounted on epgd24 with the actimeo=60 option, in an attempt to reduce the number of getattr calls. Draining epgd20-22 in order to test the effect of varying the block size. |
20100409 |
CJC |
Draining nodes epgd23-24 to test NFS mount settings. |
20100408 |
CJC |
Black holes detected (and fixed) on epgd15 and epgd22 - nodes not able to scp files back to CE. |
20100406 |
CJC |
Removed and recreated all pool accounts on epgr05. This may fix the ops problems with running jobs on the CreamCE. |
20100406 |
CJC |
Updated DESY vomscert on CEs and DPM head node. |
20100401 |
CJC |
Reducing the number of ATLAS Production jobs running on local twins in order to let pilot test jobs run. |
20100401 |
CJC |
Updated all iptables to accept connections from 147.188.128.127 , the University network time server. Corrected date and time on epgsr2 and epgd22. epgsr2 had not updated after change to BST, and this might be the source of the ATLAS file transfer problems. |
20100401 |
CJC |
Allowing ssh connections on ALICE VOBox from 137.138. (lxplus) and 128.142. (CERN ops nagios). Updated /etc/hosts.allow to reflect this. |
20100401 |
CJC |
Installing rootkits on all nodes and broadcasting root password on university email lists (only joking - April Fool :D) |
20100331 |
CJC |
Noted error Pinging service ClusterMonitor? ... The service is running at epgr03.ph.bham.ac.uk:8084, uri ClusterMonitor? ...connect: Connection refused on epgr03 (ALICE VOBox). |
20100329 |
CJC |
Disabled epgr02 queues for the purposes of upgrading worker nodes to glexec_wn. |
20100326 |
CJC |
Re-enabled queues on epgr04 and removed Draining status. |
20100326 |
CJC |
Rebooted epgr04 now that the BB:/egee filesystem has been fixed. Submitted a GGUS Ticket regarding the rogue SAM Nagios tests coming from samnag010.cern.ch in the SW Cloud. |
20100326 |
CJC |
Tried removing and redeploying pool accounts using cfengine module. Jobs failed because glite software maintained references in /etc/grid-security to old pool account names. |
20100326 |
CJC |
Added epgd17-24 to the local cluster, bring the number of jobs slots to 192. Increased the ATLAS maui quota and reyaimed epgr02. Because this machine is still a 64 bit VM, some of the library paths in /opt/glite/etc/lcas/lcas.db and opt/glite/etc/lcmaps/lcmaps.db were wrong. These are now corrected automatically by cfengine after running yaim. |
20100324 |
CJC |
Job submission now successful on both CAGE CEs. |
20100322 |
CJC |
Job submission on epgr08 currently fails. |
20100322 |
CJC |
Updated epgsr1:/etc/exports (via epgmo1) so that experiment software areas can be mounted on epgr07 as well. |
20100322 |
CJC |
Deploying epgr08 as an lcg-CE to see if the lcg-CE and CreamCE can co-exist. |
20100322 |
CJC |
CreamCE/ARGUS/GLEXEC_wn test bench now accepts dteam and ops jobs. Would like to be able to test the pilot job glexec functionality. |
20100318 |
CJC |
Installed as a test suite - epgr05 -> Cream CE, epgr06 -> ARGUS Server, epgr07 -> GLEXEC_wn. Can't submit jobs yet. |
20100317 |
CJC |
Birmingham starts to fail the GangaRobot? WMS tests, because the wrong certificates were copied to epgsr1 and 2 after update to cfagent.conf on epgmo1. Fixed and now waiting for jobs to return to Birmingham. This does not address the problem as to why we are not receiving any panda jobs. |
20100317 |
CJC |
Fixed CreamCE output retrieval problem by allowing incoming and outgoing traffic on the recommended ports. |
20100316 |
CJC |
Noted that CreamCE only reports job complete when the firewall is off => check ports. |
20100316 |
CJC |
BlueBEAR loses /egee filesystem again. Put qhold on remaining queued jobs as edguser on epgr04, and use qdisable on glong and gshort. Move epgr04 to "Draining" state. |
20100316 |
CJC |
Updated kickstart files in epgmo1:/data1/grid/kickstart to reflect the fact that redhat mirror is now held at 147.188.47.108:/disk/11b/home/redhat/ . |
20100316 |
CJC |
Renamed epgsr3 , epgce1 , epgce3 and epgce4 according to LocalGridMachines, and rebooted. Note that this required changes to epgmo1:/etc/dhcp.conf and to /etc/sysconfig/network on each physical machine to be renamed. |
20100316 |
CJC |
Due to DNS problem, flashed /etc/hosts on all grid nodes with the line 147.188.46.8 epgsr1.ph.bham.ac.uk epgsr1 . This should allow grid nodes to communicate with the first pool node (ie epgsr1) while the DNS problem is being resolved. |
20100315 |
CJC |
Noted that although ATLAS Panda jobs submitted from the local UI (ganga 5.4.5) run successfully on BlueBEAR, gangarobot jobs are still failing because of the TAR_WN bug. Soft linked all files in /egee/soft/SL5/middleware/prod/globus/lib into /egee/soft/SL5/middleware/prod/atlas_fix/lib , and prepended the new variable onto the LD_LIBRARY_PATH environment variable in the x509.sh script. This should fix the gangarobot failures. |
20100315 |
CJC |
Added the f14 filesystems into the DPM. Temporarily split the 20TB of space evenly between SCRATCH and LOCALGROUP disks., although it is expected that this space will be dedicated to the TopPhys? cache. |
20100315 |
CJC |
Clean install on epgsr3 and epgr06. Redeploying epgsr3 as a WN for the CreamCE, with epgr06 intended to be a glexec/SCAS server installation. |
20100315 |
CJC |
Booted the new epgsr1 and configured as a DPM Pool node. Successfully copied new data onto and off the node. Software directories all exported properly. Not yet network bonded. The new epgsr3 is still network bonded, and so should have the IP Address which is hard coded into the bond0 script changed before booting. |
20100315 |
CJC |
Renamed epgsr1 as epgsr3 and shutdown. Swapped MAC addresses in epgmo1:/etc/dhcp.conf . |
20100315 |
CJC |
Renamed epgsr3 as epgsr1 and shutdown. Ready for downtime. |
20100314 |
CJC |
Moved epgr04 into the draining state so that no more new jobs would be submitted before the scheduled downtime on Monday. Moved all local worker nodes offline for the same reason. |
20100312 |
CJC |
Installed epgsr3.ph.bham.ac.uk , with the intention of reconfiguring as a replacement for epgsr1 on Monday. Glite stack installed, but not configured. |
20100311 |
CJC |
dteam jobs should now run on the CreamCE. Investigating the possibility of SCAS server. OOPs! Only if the firewall is off! |
20100311 |
CJC |
Added the project directive -A lowel01 to qsub script on epgr04. (Via epgmo1:/var/cfengine/inputs/repo/ce/qsub.sh , as this file is periodically copied to the CE!) |
20100311 |
CJC |
Implemented Part 1 of the Grid Backup policy. A cron job on epgmo1 invokes the command /usr/sbin/cfrun -- -D run_backup >> /var/log/cfengine_backup.log 2>&1 , which will run the backup module on all grid nodes (excluding epgsr1 as this is still not under the control of cfengine). This module reads a list of files from /root/cfengine/files/backup.rules , and copies the relevant files (preserving permissions, access times and directory structures) to /root/cfengine/backup/`date +% Y%m%d` . The directory is then compressed, ready for Part 2. This will involve distributing to epgsr1 and BB. |
20100309 |
CJC |
epgr05.ph.bham.ac.uk advertised as a CreamCE in the Site BDII. Marked as preproduction in the GOCDB. Noted that the GlueCEStateStatus item has the value "Special", and not "Production", which is why jobs are not being matched. |
20100309 |
CJC |
Installed SL5 UI on BlueBEAR. Users simply log on and source /apps/hep/lcgui/lcguisetup . Note that this only works for SL5 so far. Also note that the SL5 installation is dependent on the CRLs managed by the SL5 WN installation, which are stored in /egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates . The relevant X509 are set by the /apps/hep/lcgui/SL5/middleware/prod/external/etc/profile.d/x509.sh script, which is created by running the /apps/hep/lcgui/SL5/yaim-conf/post_yaim.sh script after running yaim. |
20100305 |
CJC |
Rationalized epgse1:/disk/f??/vo folder creation after biomed tried writing files to the non-existant /disk/f9a/biomed directory. All supported VOs should now have the appropriate directories on all SE filesystems. |
20100305 |
CJC |
BlueBEAR back online. Clearing "Draining" status from epgr04. |
20100303 |
CJC |
Requested epgr04.ph.bham.ac.uk be removed from the LHCb management system to allow jobs to run at epgr02 (jobs were no longer being submitted because epgr04 was down). |
20100302 |
CJC |
Implemented a simple pbs monitor for Nagios, which detects WNs in the offline and down state. |
20100302 |
CJC |
Updated the SL4 and SL5 UI on the local system. Added support for ILC to both. Note that the SL4 installation (UI_TAR 3.1.44-0) suffers from a bug such that external/usr/lib is appended to the grid environment LD_LIBRARY_PATH variable. This is fixed in the yaim-conf/post_yaim.sh script. The SL5 installation (UI_TAR 3.2.6-0) did not install any .pem or .lsc files in external/etc/grid-security/vomsdir/ . These were manually copied from the SL4.new installation. |
20100302 |
CJC |
Changed epgr04 to Draining status while BB:/egee/ problems continue. Also added epgr04 AT RISK status to GocDB? . |
20100301 |
CJC |
Need to implement x509 fix on UI_TAR installations. Also need to make sure ILC can authenticate on UI. Also need to implement Globus_Port_Range fix for UI. Where to the *.pem files in vomsdir come from in SL5 installation? They appear to just be there in SL4. |
20100301 |
CJC |
Submitted 69 local jobs from epgr04 to BB as each of the configured users according to the showusers output. Each job runs the BlueBEAR cleanup script, which should remove old files and directories in the BB:/egee/home/ area. It is hoped that this will ease the slow file access problem on BB. Note that this is a temporary measure until the cron jobs are updated on BB! |
20100301 |
CJC |
Upgraded to glite-WN_TAR 3.2.6-0 on BlueBEAR. |
20100301 |
CJC |
Created the scripts BB:/egee/soft/SL5/local/yaim-conf/pre_yaim.sh and post_yaim.sh to be run before and after yaim on BlueBEAR when configuring a new WN_TAR release. These scripts make sure that the X509 environment variables are set and creates the gridmapdir directory for the cleanup scripts. |
20100226 |
CJC |
Removed SL4 tasks from BB:/egee/system/cron.d/cronuser.u4n??8 cron definitions. These will have to be reloaded on the cron nodes to stop the tasks from being executed! |
20100226 |
CJC |
Removed WN_TAR release 3.2.4-0 from BlueBEAR. Noted that production system is still 3.2.5-0. Will upgrade to 3.2.6-0 after cleanup scripts investigated. |
20100224 |
CJC |
Submitted a GGUS Ticket regarding the apparent problems publishing APEL data for the past 13 days. |
20100224 |
CJC |
Added Virtual Hosts to epgmo1 webserver. The University locks down port 80, so this is not externally accessible. Port 8888 is externally accessible. epgmo1.ph.bham.ac.uk:8888 will now serve the ganglia pages. epgmo1.ph.bham.ac.uk will serve the config files held in /var/www/htlm/config . |
20100224 |
CJC |
Noted that there are a large number of H1 GridFTP? transfers on epgse1. This would make sense in the context of the larger number of production jobs which have just completed. According to Ganglia, the problem appears to be the load and CPU usage, not bandwidth related. |
20100223 |
CJC |
Yaim Savannah Bug highlights the fact that lcg-CE is not supported on SL4 32bit. Redeploy? |
20100223 |
CJC |
Updated iptables for pool nodes. This should also now allow communication between the pool node and BB IPs. |
20100223 |
LSL |
Noticed during epgsr2 reboot that PXE doesn't function when the bonded interfaces eth0-3 are connected to the group-of-4 trunked switch ports. Observed that epgmo1 receives and responds to the PXE dhcp packets, but epgsr2 PXE doesn't see responses. If important, swap cables such that switch XOR algorithm (LocalGridBonding) chooses eth0 for response. |
20100223 |
LSL |
Noticed epgsr2 RAID disk labels had been assigned wrongly, eg f16c on physical RAID f17, so went offline, backed up /disk/f* to internal disk, re-initialised the RAID file-system labels correctly, restored the disk areas from the backup, and went back online. Note that the LocalGridRaidFormat doc describes the initialisation process. |
20100223 |
CJC |
Started to remove SL4 software areas from BB. |
20100223 |
CJC |
epgr04 failing SAM tests related to CRLs. Removed everything in BB:/egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates , and reran yaim config. This fixed the cert test warning, but not the rm test error. Under Investigation. |
20100223 |
CJC |
Noted that Birmingham has not published any accounting statistics for 12 days. Ran gap publisher on epgmo1 manually. Under investigation. |
20100222 |
CJC |
Ran yaim on epgsr2 . This allowed lcg-cr transfers onto the pool node when the firewall was dropped. Compare with epgsr1 to see which ports need to be open. epgsr2 configured with a bonded network connection, but requires the network cables to be physically moved. |
20100222 |
CJC |
Changed permissions on /home/lcgui/SL5/local/bin-cron/local-fetch-crl to 755 so that cron job can actually download the CRLs (previously failing - permission denied to execute). |
20100222 |
CJC/LSL |
Power supply problem noted on epgd09-16. No data on ganglia for these nodes since Sunday 21st, 2pm. Outage due to a blown fuse. |
20100222 |
CJC |
epgsr2 network booted and started to reinstall. Because epgsr2 left to network boot on epgmo1, it started to reinstall after rebooting following the unbonding action. As the RAIDs were not disconnected, they appear to be formatting as well - this should not be allowed to happen again. |
20100222 |
CJC |
Unbonded epgsr2 and rebooted. This enabled the pxeboot to run (failed for the bonded interface)._An interesting question - does the unbonded network connection for eth0 work when connected to the trunked ports on the switch?_ |
20100221 |
CJC |
Initialised the qfeed script on epgr04 for g-honp09 in the absence of ATLAS production. |
20100219 |
CJC |
Ganglia and dpmmgr UIDs got confused on epgsr2, causing DPM to create directory structures belonging to ganglia. Reinstalling in order to ensure completely fresh setup and avoid future difficulties. Moved ganglia installation to come after lcg installation in cfengine. |
20100219 |
CJC |
Updated all groups.conf so that entries take the form "/alice/ROLE=lcgadmin":::sgm: (was previously "/VO=alice/GROUP=/alice/ROLE=lcgadmin":::sgm: ). The old format was causing yaim to only make entries in /etc/grid-security/grid-mapfile for special (ie production accounts). This caused some intermittent problems on epgse1 during the afternoon. |
20100219 |
CJC |
Rerunning yaim did not help the transfer problems on epgsr2. Noted that other users have tried to write to the disk, so marking as readonly for now. |
20100219 |
CJC |
Tried to copy a file onto epgsr2 by setting all other disks to read only and then using the command lcg-cr -v --vo atlas -d epgse1.ph.bham.ac.uk --st ATLASSCRATCHDISK -l lfn:/grid/atlas/users/christophercurtis/test.sh.0 file:///home/cjc/thesis.tar.gz=. The transfer appeared to stall, and in =epgsr2:/var/log/dpm-gsiftp/gridftp.log , the entry 530 Login incorrect. : Could not get virtual id! was noted. Checked /etc/grid-security/grid-mapfile and found only production accounts listed. Rerunning yaim. |
20100219 |
CJC |
Network bonded eth0-3 on epgsr2. |
20100219 |
CJC |
Moved /etc/xen/epgr0* VM definitions on VMHosts to /etc/xen/auto/epgr0* . This allows the machines to boot automatically after the host has booted. Changed initialisation scripts to reflect this. |
20100218 |
CJC |
Ran yaim on epgsr2, with some success. Storage not yet online because DPM on epgse1 has not been updated. This should really be done automatically somehow... |
20100218 |
CJC |
Updated ssh permissions on all grid nodes. ssh now only allowed between log in node, Lawries desktop and Chris' desktop. No ssh between nodes permitted (with the exception of the required ssh between epgr02 and the twin WNs and epgr04 and BB exports). |
20100218 |
CJC |
Brian Davies suggests DATA=25.09T, GROUP=5T, HOT=1T, LOCALGROUP=18T, MC=25T, PROD=3.5T and SCRATCH=12T for the ATLAS spacetoken allocations, assuming that epgsr2 is assigned entirely to ATLAS. |
20100218 |
CJC |
Tim can't dowload dataset user09.timmartin.105003.pythia_sdiff.MinBiasAthenaV1.AtlOff15.6.1_r1027 from Tokyo. DQ2 complains about a CRL problem. The transfer works at CERN however. Check CRLs on UI. |
20100218 |
CJC |
epgce2 failed to reboot overnight because of a failed bios RAM test. This machine is known to have bad RAM. To avoid the problem again, the bios settings were changed so that the F1 key does not have to be pressed manually on discovering an error. This should allow the machine to continue to boot. |
20100218 |
CJC |
Connected and mounted /disk/f16 and /disk/f17 RAIDs to epgsr2. This required creating mount points on epgsr2 ( /disk/f1?[a-d] ) and adding entries for each filesystem in /etc/fstab . Rebooted. This machine is now ready for configuring as a pool node. It will also require network bonding in the near future. |
20100218 |
CJC |
Fixed the epgr04 gmetric.cron by adding /root/bin to the path. Previously not reporting any running jobs because it could not find the qs command. |
20100218 |
CJC |
Restarted pbs_mom on all WNs. Communication between WNs and epgr02 lost sometime over night. |
20100217 |
CJC |
epgce2 noted to be incommunicado. Requires manual reboot. |
20100217 |
CJC |
Found that xen does not automatically restart domains after a VM host is rebooted. |
20100217 |
CJC |
Prepared epgmo1 for the installation of Storage Pool epgsr2. Linked epgmo1:/tftpboot/pxelinux.cfg/93BC2E25 -> hosts/epgsr2.ph.bham.ac.uk ->/tftpboot/pxelinux.cfg/configs/boot-hd.cfg . Changed bios settings to network boot first. Installed SL5.3, preserving the Dell Utility partition. |
20100217 |
CJC |
Peter Love confirms that the reason for Panda Jobs not running on BlueBEAR is because ATLAS breaks the LD_LIBRARY_PATH variable in tarball installations. |
20100217 |
CJC |
Changed all installation scripts to use Local RPM Repo when first installing/updating a new node. This includes changes to the /var/cfengine/inputs/repo/vm/* scripts on epgmo1, as well as to all kickstart files. This avoids difficulties with the main Scientific Linux Repo (which is currently unavailable). Further changes to repo lists may be made after a node has been installed by cfengine. |
20100217 |
CJC |
Noted that the dpm , dpmcopyd , srmv1 and srmv2.2 services on epgse1 failed to restart after a reboot. Restarted manually. |
20100216 |
CJC |
Deployed epgr05 as a blank VM ready for the CreamCE. epgr06 deployed as a WN for epgr05. |
20100216 |
CJC |
Rebooted epgce4 - this should install SL 5.3 x86_64 and prepare the machine for two VM Hosts. |
20100216 |
CJC |
Added virtual host to epgmo1:/etc/httpd/conf/httpd.conf , listening to *:80 . All normal web connections should now be accepted without having to authenticate using SSL. Authentication still used for https://epgmo1.ph.bham.ac.uk/nagios . Changed /data1/grid/kickstart/* , /var/cfengine/inputs/cfagent.conf , /var/cfengine/inputs/repo/vm/* and /var/www/html/*.ks on epgmo1 to reflect this. |
20100216 |
CJC |
Backed up /opt , /etc , /var and /root on epgce4 to epgsr1:/disk/f15d/epgce4.backup . This will be the final backup before redeploying epgce4 as a VM host. |
20100215 |
CJC |
Downloaded swevo.ific.uv.es.pem into /etc/grid-security/vomsdir on epgr02 to allow fusion jobs to run. |
20100215 |
CJC |
Configured SL5 UI. Editted /usr/local/bin/lcguisetup to call /home/lcgui/SL5/local/lcguisetup.bash if the user is in an SL5 environment. Also added /home/lcgui/SL5/local/bin-cron/local-fetch-crl to eprexa cron jobs so that the CRLs are downloaded every 6 hours. |
20100215 |
CJC |
Installed UI 3.2.6-0 on local system for use on SL5 nodes (ie eprexb). Unzipped UI tarballs into /home/lcgui/SL5/middleware/3.2.6-0 and soft linked to /home/lcgui/SL5/middleware/prod/ . Configured with yaim /home/lcgui/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /home/lcgui/SL5/yaim-conf/site-info.conf -n UI_TAR . Note that this must be done from an SL5 node! Changed permissions of profile scripts in /home/lcgui/SL5/middleware/prod/external/etc/profile.d/* to 755 (previously 744). Copied /egee/soft/SL5/local/bin-cron/local-fetch-crl from BlueBEAR into /home/lcgui/SL5/local/bin-cron/ and executed. |
20100215 |
CJC |
On BlueBEAR, replaced the softlink /egee/soft/SL5/middleware/prod/external/usr/lib64/libldap-2.3.so.0 , which previously pointed to libldap-2.3.so.0.2.31 in the same library, with one which points to /usr/lib64/libldap-2.3.so.0 . This fixed the ldapsearch error "LDAP vendor version mismatch: library 20343, header 20327". |
20100215 |
CJC |
Updated BlueBEAR WN tarball to release 3.2.5-0. Untarred release into /egee/soft/SL5/middleware/3.2.5-0 and then updated the softlink /egee/soft/SL5/middleware/prod to point to the new release. Ran yaim twice: /egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /egee/soft/SL5/middleware/yaim-conf/site-info.def -n glite-WN_TAR and /egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -r -s /egee/soft/SL5/middleware/yaim-conf/site-info.def -n glite-WN_TAR -f config_certs_userland -f config_crl to configure the WN release and obtain the CRL url files. Added the file /egee/soft/SL5/middleware/prod/external/etc/profile.d/x509.sh , which sets the = X509_CERT_DIR= and X509_VOMS_DIR variables, because these are not added by the yaim config. Editted /egee/soft/SL5/middleware/prod/external/etc/profile.d/grid-env.sh to ensure that x509.sh is also called. |
20100215 |
CJC |
On BlueBEAR, changed /egee/soft/SL5/localbin-cron/local-fetch-crl so that it retrieves the current CRLs, and places them into /egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates . This script is executed every six hours by the /egee/system/cron.d/cronuser.u4n??8 cron job. |
20100215 |
LSL |
On BlueBEAR, changed the NFS mount of /egee so that it uses the noatime option, for efficiency, so that simple file accesses do not result in inode re-writes back through the NFS/GPFS system. Observed that the NFS v3 max transfer size is 32768, even if higher value requested. |
20100212 |
CJC |
Removed epgce4 from site BDII definition. Removed references to myproxy services on epgmo1 by 1) Removing all glite*, edg*, bdii* packages. 2) Removing /opt/bdii, /opt/glite, /opt/globus and /opt/edg directories 3) Reinstalling and reyaiming via cfengine. This has removed all myproxy references in the node BDII. |
20100210 |
CJC |
Updated the GridPP? voms certificate on the local UI. This is held in /home/lcgui/SL4/etc/grid-security/old-1.28/vomsdir/voms.gridpp.ac.uk.pem . The updated version is available here. The old certificate, which expires on 11/02/2010, has been backed up to voms.gridpp.ac.uk.22812.pem. Also updated certificates on epgr02 and epgse1 in the directory /etc/grid-security/vomsdir/. |
20100210 |
CJC |
Killed qfeed on epgce4 and started it on epgr04 for the user g-atlp13 (Graeme). |
20100210 |
CJC |
Notified ATLAS decommissioning of epgce4 and replacement by epgr04. Moved epgce4 queues offline by editting /opt/lcg/libexec/lcg-info-dynamic-pbs so that push @output, "GlueCEStateStatus: $Status\n"; becomes @output, "GlueCEStateStatus: Draining\n"; . The command lcg-info --vo atlas --list-ce --attrs 'CEStatus' confirms that epgce4 is not available for jobs. Changed status of epgce4 and epgr04 in GOC DB. |
20100210 |
CJC |
Wrote a nagios plugin which raises a warning if there are less than 20% of a groups pool accounts left and a critical warning if there are less than 10%. |
20100209 |
CJC |
Problem with maui/pbs on epgr02 - appears to be unresponsive, even after restarting services. Rebooting machine. |
20100209 |
CJC |
Pilots failing on BB SL4 complain of not finding libglobus_gsi_proxy_core_gcc32dbgpthr.so.0 . This is available in /opt/globus/lib/ on Twin WN. |
20100209 |
CJC |
Fusion and H1 production appear to be failing gatekeeper tests on epgr02. Investigating |
20100209 |
CJC |
Nagios remote testing implemented. New tests distributed by cfengine. Added gmetric tests controlled by /root/cfengine/files/gmetric.sh via /etc/cron.d/gmetric.cron on epgr02, epgr04 and epgse1. These tests monitor the number of running jobs and the number of GridFTP? transfers, making the results available to Ganglia. Tests distributed via cfengine. |
20100208 |
CJC |
Patched bug in cfengine deployment of epgr02 which caused qsub -> qsub.sh -> qsub -> qsub.sh... Each time a new job was submitted, it entered an infinite recursive submission. Also added umask=022 to all shellcommands, which should fix reyaim bug. |
20100208 |
CJC |
Removed egee-NAGIOS , glite-PX and glite-UI packages from epgmo1 - this monitoring information is available elsewhere. Installed vanilla Nagios release with the intention of deploying standard (eg ping, disk usage) and home brew (number of ATLAS production jobs submitted) sensors. |
20100204 |
CJC |
Very slow Athena compile times noted on BB. Investigating. |
20100204 |
CJC |
Athena 15.5.0 test job and SQUID test job submitted to epgr04. If successful, decommissioning on epgce4 will begin. Passed the Athena and SQUID tests. |
20100203 |
CJC |
Re-enabled remote logging on all nodes (except twins). Log messages should be saved both locally and on epgmo1. |
20100203 |
CJC |
Reinstalled epgd16 successfully and moved it back online. |
20100203 |
CJC |
Enabled port 7512 on epgmo1 for the purposes of the MyProxy? server. |
20100202 |
CJC |
Graeme's jobs on BB SL4 seem to be failing with the error "Can't locate Globus/Core/Paths.pm in @INC". Investigating. |
20100202 |
CJC |
Problem with lcg-cr onto SE (failed SAM tests). Investigating. Transient error on SE. Keeping an eye on it. |
20100202 |
CJC |
Ganglia and Nagios monitoring installed on epgmo1. Nagios creates a high load on epgmo1. Consider reducing polling frequency or upgrading to a better machine! |
20100202 |
CJC |
Fixed all grid kickstart files to connect to https://epgmo1.ph.bham.ac.uk/ack.php with the --no-check-certificate switch. |
20100202 |
LSL |
Following actions noted yesterday on BB grid, added sharutils and blas-devel on BB suggested by Chris. Today, after reviewing SL5WN, added PyXML.i386 from 32-bit distro (64-bit already present, so may not be important, but harmless!). |
20100202 |
CJC |
Reinstalling glite-WN on epgd16 (having put it offline first!) after ATLAS pilot job could not find globus-url-copy. |
20100202 |
CJC |
epgse1 failing gsirfio lcg-gt SAM tests. Removal of --legacy from epgse1:/opt/glite/etc/gip/provider/se-dpm caused gsirfio support to be appended to ldap output. As this is not supported, BHAM failed SAM tests. --legacy support reintroduced. Awaiting the result of the savannah bug. Tested installation of savannah bug rpm - breaks xrootd support and fails to fix gsirfio problem, although legacy warnings do disappear! |
20100201 |
CJC |
Re-enabled CMS jobs on epgr02 and epgr04. |
20100201 |
CJC |
Removed "--legacy" switch from dpm-listspaces call in epgse1:/opt/glite/etc/gip/provider/se-dpm . This should fix the "GlueSACapability has unknown value" gstat2 warnings. This edit is managed by cfengine, so any subsequent reyaim should be fixed by cfengine. The consequence of this is that the SE appears to have dropped out of the lcg-infosites output. Is this important? |
20100201 |
LSL |
Following on from my actions on 20100118, for SL5 on BB, installed compat-glibc-headers.i386 from 32-bit distro, missing from 64-bit distro. Asked Alan to update SL5 kernel for BB grid worker nodes. |
20100201 |
CJC |
Changing "/C=FR/O=CNRS/CN=GRID-FR" to "/C=FR/O=CNRS/CN=GRID2-FR" in the vo.d/biomed file appears to fix biomed authentication failure errors in globus-gatekeeper.log |
20100201 |
CJC |
dpm-listspaces on epgse1 shows that the ATLAS pool is not using any space, which appears to be contrary to similar output from Oxford and Glasgow DPM output. Suspect this may be the root cause of GlueSAUsedOnlineSize < 0 in the Gstat-prod monitoring. Emailed Gridpp-storage list. |
20100131 |
CJC |
Added qsub.sh setup to epgr02 to ensure that NGS jobs get assigned to a specific queue (previously not running). |
20100131 |
CJC |
Added H1 production and lcgadmin roles to QUEUE_ENABLE variables in epgr02 and epgr04 site-info.def files. This will allow H1 production jobs to run (previously failing). |
20100128 |
CJC |
Birmingham is on the Ganga Blacklist. The ANALY_BHAM Panda jobs are failing on BB, but also the latest WMS job went to epgr04 and failed when it couldn't use DQ2... Review the situation on Monday after installations have progressed. |
20100128 |
CJC |
epgce4 seems to fail 14.5.0 Pilot jobs consistantly. Problem with installation after epgr04 mix up? If epgr04 is up and running soon, epgce4 will be decomissioned anyway! epgr02 jobs seem to be more hit and miss - sometimes they work, sometimes they fail the md5sum test. |
20100128 |
CJC |
Added SITE_OTHER_WLCG_NAME="UK-SouthGrid" to site-info.defs of all CEs and site BDII in order to pass gstat2 tests. |
20100128 |
CJC |
SQUID test on epgr04 failed becayse latest DDM tools (ie DQ2) are not available. Requesting the 1.32 installation be fixed. |
20100128 |
CJC |
Submitted one ATLAS pilot job to Birmingham with two subjobs. One subjob succeeded, the other failed due to a mismatching md5sum. Submitting job again for reproduceability. Also submitting with other datasets. Update: Other datasets have failed. Either there are lots of corrupt files on the SE (!), or there is a problem with the transfers, or there is a problem with the md5sum installation on the WNs. |
20100127 |
CJC |
Increased the MAXPROC limit to 100 for both camont and ALICE production - first come, first served whilst ATLAS is down! |
20100127 |
CJC |
Added logrotate directive to cfengine to ensure all logs are kept for 366 days. |
20100127 |
CJC |
Added fusion VO to epgr02 + WNs. Installed lcg-vomscerts-desy to enable H1 and Zeus jobs to run. |
20100126 |
CJC |
Installed GridPP? Nagios suite on epgmo1. For this to work, SELinux needed to be moved to permissive mode. It also only works on port 80 (not 8888), so this has broken Ganglia (works on port 8888) and may have knock on effects... keep an eye on the SAM tests! |
20100126 |
CJC |
Manual yum update of epgsr1 and a reboot. This might fix the ATLAS transfer problems. |
20100126 |
CJC |
yum updating all SL5 nodes in light of new security bug. Should really develop a rolling reboot script for use in cfengine to safely reboot Twins... |
20100125 |
CJC |
HepSpec06 (32 bit, SL5) results. 9.61 for the Twins, 7.93 for BlueBEAR. Added to CE Information System. |
20100122 |
CJC |
Test upload of file to SE using xrootd ( xrdcp ~/test.txt root://epgse1.ph.bham.ac.uk//home/alice/test.txt failed with the error Last server error 3010 ('Opening path '/home/alice/test.txt' is disallowed.') Error accessing path/file for root://epgse1.ph.bham.ac.uk//home/alice/test.txt |
20100121 |
CJC |
Moved epgd16 offline for the purposes of running hepspec. Running HepSpec? test as described here. The node has all normal grid services running as per normal, but should not receive any jobs. The node has also been disabled in the epgmo1 cfengine cfrun.hosts list, so no large file transfers should take place. |
20100120 |
CJC |
Problem on SL5 BB WN - grid-env.sh keeps getting overwritten, resulting in the x509 variables not being set (the x509.sh script must be executed as well). Removed offending yaim call from cron job definition, but this will require u4n108, u4118 and u4128 be rebooted. |
20100118 |
CJC |
Moved the 2009 Diary Entries here. |
20100118 |
LSL |
See previous item for BlueBEAR: the following packages (both archs) were required: compat-db compat-libf2c-34 compat-libgcc-296 compat-openldap compat-readline43 ghostscript giflib openmotif22 openssl097a tk. |
20100118 |
LSL |
Review packages in the SL5 image of BlueBEAR using doc https://twiki.cern.ch/twiki/bin/view/LCG/SL5DependencyRPM, specifically packages required by metapackage HEP_OSlibs_SL5-1.0.2-0.x86_64.rpm linked from that doc. Also packages listed in doc https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration, under heading LCG Applications Metapackage, for ATLAS. |
20100118 |
CJC |
RFIO problem detected in 14.5.0 sample job 702 submitted to epgr02 . |
20100118 |
CJC |
Reimplemented Ganglia monitoring on epgmo1 . Deployment of Ganglia is under the control of cfengine (and therefore, nodes epgsr1 and epgce4 have not been added yet to the monitoring). |
20100115 |
CJC |
Reimplement ATLAS and LHCb pilot roles on epgr02 + WNs , as they were lost during the SL5 conversion. These are now completely maintained by yaim. |
20100114 |
CJC |
yum update on epgmo1 breaks the apel publishing briefly. Fixed by adding the line JAVA_HOME="/usr/java/latest" to /etc/tomcat5/tomcat5.conf . |
20100114 |
CJC |
New BB lcg-CE made to work by ensuring local pool account locations match those on BB. Created a softlink /egee/home -> /home and edited /etc/passwd to reflect this. |
20100113 |
CJC |
PBS communication broken between epgr02 and worker nodes after autoupdate. WNs attempted an auto update, but failed due to package inconsistancy in glite-WN_ext repo. All WNs updated via cfengine and rebooted. |
20100112 |
CJC |
Rebooted epgr02 after lcg-CE update. |
20100112 |
CJC |
Auto yum update on epgce4 upgraded lcg-CE, which broke the torque submission. Turned off yum updates ( chkconfig yum off; chkconfig --list yum ) and then reinstalled Lawries moab tools tarball. Also restored the qsub.bin/qsub.sh setup using script held on epgr04. Local job submission works, checkjob works (unlike epgr04). Waiting for grid test job to return positive. |
20100107 |
CJC |
Freed 1.4T of dark ATLAS data from epgse1 . |
20100106 |
CJC |
BlueBEAR jobs running again. |
20100106 |
CJC |
Moved epgd15 offline for the purposes of benchmarking. |
20100106 |
CJC |
Noted that all jobs (including local jobs) are queued on BlueBEAR. Emailed Alan and Aslam. |