This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like
for more carefully considered documentation.
20111115 |
MWS |
Removed an empty pool 'DPM001' from epgse1 using dpm-rmpool. |
20111024 |
LSL |
On local /home/lcgui setup, patched-in some files to support eScience CA 2A/2B, so voms-proxy-init works for new students and recent renewers. |
20111017 |
LSL |
Power socket supplying UPS for grid left rack (epgsr1 and RAIDs f12-f15, epgpe01-04) has hair-line crack and failed at around 9am. Swapped plug to another socket and got things going again by 09:40. Will inform electrician Mark Wicks about this problem. [He will replace during some downtime]. |
20111010 |
LSL |
After SCSI problem on epgsr1 for f12 f13 f15, swapped that SCSI chain with f14 chain by moving cables between cards to see if SCSI problem moves or not. |
20111009 |
MWS |
epgsr1 was acting up which was assumed to be a SCSI problem. However, after doing a reboot, the server didn't come back up. Offlined the site and closed the queues. Hopefully this can be fixed tomorrow! |
20110923 |
LSL |
After SCSI problem on epgsr1 for f12 f13 f15, replace that SCSI card with a new one and see if this fixes the problem. |
20110921 |
MWS |
SE was acting up this morning with what seemed to be a dpns hang. Restarted but got strange SOAP errors about token headers. Restarted the srmv2.2 and that seemed to fix it. |
20110909 |
MWS |
After several days of being hit by batches of 100s of jobs at a time, banned user Rafael Mayo (Fusion) across the site by adding to /opt/glite/etc/lcas/ban_users.db. Will email and try to get him to stop. |
20110713 |
MWS |
As per GGUS 72515, added acl cern_dest dstdomain .cern.ch http_access allow cern_dest to squid config file. |
20110712 |
MWS |
Fixed GLExec issues so local tests now work. Polices on the Argus server needed sorting out. Future problems may involve not having roles set properly here! |
20110708 |
MWS |
Started failling certificate NAGIOS tests as we were running the wrong version (1.38 rather than 1.40). Updated using yum update ca-policy-egi-core on the local WNs and copying the resultant certs to the BB WNs. |
20110609 |
MWS |
Noticed ATLAS analysis jobs were failling with liblcgdm errors. Checked ndoes and the links were broken in /opt/lcg/lib. Fixed by hand but in next WN update, this should be fixed. |
20110526 |
MWS |
SE problems narrowed down to excessive H1 jobs hammering the SE. |
20110525 |
MWS |
epgse1 showed strange network issues overnight. Rebooting (eventually) fixed it but should keep an eye on odd 'fetch-crl' and 'voms' errors |
20110511 |
MWS |
All appears well on the recovered VMs on epgpe10. Did a test copy to the SE and that was fine so reopened the queues and jobs have started coming in. WIll keep an eye on the Nagios tests to make sure everything is up and running again. |
20110511 |
LSL |
For epgpe10 problem, updated base system kernel and kernel-xen from 2.6.18-194.32.1 to 2.6.18-238.9.1. Also, Dell have supplied new disk, so after booting from a CD, did a dd copy to new disk: dd if=/dev/sda of=/dev/sdb bs=51200000. This took about 2 hours. Rebooted. |
20110510 |
MWS |
Update Maui config to use MAXJOB instead of MAXJOBS and slightly altered weighting to prioritise Atlas, LHCb and Alice. Will keep an eye to make sure jobs go through as expected. |
20110508 |
LSL/MWS |
Mark notes epgpe10 has gone down again, like 20110505. System messages for epgpe10 logged on epgmo1 starts with mpt2sas0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630), then sd 0:0:0:0: SCSI error: return code = 0x00010000, then scsi 0:0:0:0: rejecting I/O to dead device . Come in and reboot. |
20110503 |
LSL |
For BB, I asked Alan to propagate sudoers changes of March, including Mark's account, from front-ends to worker nodes too. I've made a rc.d/S60sudo.sh to include Chris's account as required. |
20110503 |
LSL |
On BB, I've added Mark's account as an extra allowed-user in rc.d/S60sshd.sh which configures /etc/ssh/sshd_config on the grid worker nodes. |
20110427 |
MWS |
Entered DT. Set epgr02 & 05 queues to Draining and stopped (qstop --) long, short and alice. |
20112204 |
MWS |
epgr05 was failing NAGIOS with LB Query failures. Rebooting fixed the problem. |
20112104 |
MWS |
Set epgr04/07 queues back online and set status in /opt/lcg/libexec/lcg-info-dynamic-pbs from Draining to $Status. Needed to reboot epgr04 as qsub/qstat didn't work, but other than that, all fine. |
20111504 |
MWS |
ALICE BB VO Box was under very heavy load (>15) with CPU idle. Contacted ALICE experts who had a look but recommened reboot. Tried soft reboot and didn't work so hard reset (xm destroy + xm create). All seems well now. |
20111404 |
MWS |
Request from ATLAS to take 22.5 TB from DATADISK and redistribute to other spacetokens. |
20111404 |
MWS |
Noticed that new jobs into epgr05 weren't coming in. Rebooted and found BDII didn't start due to 0 diskspace left. Deleted a load of cfengine backup files, rebooted and all is well. Did the same for epgr02 just in case as well. Need a more permanent solution in the long term though. |
20111304 |
MWS |
Added the new certificate for epgr08. Didn't reyaim/reboot as didn't seem necessary. |
20111304 |
MWS |
Attempted to reyaim epgse1 after putting the new certificate but it got stuck when restarting the dpm. This was eventually traced to gmetric going nuts as it was run every minute. Reduced this time to every 30mins, rebooted (after Ctrl-C'ing out of the reyaim) and reyaim again. Everything seems to be back up and OK! |
20111204 |
MWS |
On request from Elena (Atlas) added 1TB to PRODDISK (taken from DATADISK). |
20111204 |
MWS |
Reyaimed and rebooted epgr07 to put in new certificate. |
20111104 |
MWS |
Marked epgr07 and epgr04 in downtime and stopped the queues due to 1.5 weeks of BB downtime. |
20110404 |
MWS |
On Friday 1st, what looks like an ALICE job took out two nodes (kernel crash in the log) and at about the same time, we either recieved 2000 jobs causing the local CE and Torque to fall over OR the CE and Torque fell over and jobs were getting hung up over the weekend. Either way, the CE needed rebooting and ~2000 jobs were left in the queue in an odd state. Deleted all these and everything returned to normal! |
20110307 |
CJC |
All nodes require ssh keys to login, with the exception of epgmo1 (though this may change in the future). Public keys should be stored in the directory epgmo1:/var/cfengine/inputs/repo/general/public_keys/ . They will then be distributed to all nodes via the module script modules/modules:ssh . New keys are added to the authorized_keys file using the command cfrun -- -D restart_ssh . |
20110217 |
LSL |
On BB, process accounting now starts via system/rc.d/S*psacct.sh: uses directory /local/account/, a 2GB area which survives reboot. |
20110209 |
LSL |
On BB, implement logging of outgoing ssh calls on bluebear workers via iptables rule. Test process accounting on one node u4n128. |
20110117 |
CJC |
New dteam voms supported on local system SL5 UI. |