Local Grid Journal 2011

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20111115 MWS Removed an empty pool 'DPM001' from epgse1 using dpm-rmpool.
20111024 LSL On local /home/lcgui setup, patched-in some files to support eScience CA 2A/2B, so voms-proxy-init works for new students and recent renewers.
20111017 LSL Power socket supplying UPS for grid left rack (epgsr1 and RAIDs f12-f15, epgpe01-04) has hair-line crack and failed at around 9am. Swapped plug to another socket and got things going again by 09:40. Will inform electrician Mark Wicks about this problem. [He will replace during some downtime].
20111010 LSL After SCSI problem on epgsr1 for f12 f13 f15, swapped that SCSI chain with f14 chain by moving cables between cards to see if SCSI problem moves or not.
20111009 MWS epgsr1 was acting up which was assumed to be a SCSI problem. However, after doing a reboot, the server didn't come back up. Offlined the site and closed the queues. Hopefully this can be fixed tomorrow!
20110923 LSL After SCSI problem on epgsr1 for f12 f13 f15, replace that SCSI card with a new one and see if this fixes the problem.
20110921 MWS SE was acting up this morning with what seemed to be a dpns hang. Restarted but got strange SOAP errors about token headers. Restarted the srmv2.2 and that seemed to fix it.
20110909 MWS After several days of being hit by batches of 100s of jobs at a time, banned user Rafael Mayo (Fusion) across the site by adding to /opt/glite/etc/lcas/ban_users.db. Will email and try to get him to stop.
20110713 MWS As per GGUS 72515, added acl cern_dest dstdomain .cern.ch http_access allow cern_dest to squid config file.
20110712 MWS Fixed GLExec issues so local tests now work. Polices on the Argus server needed sorting out. Future problems may involve not having roles set properly here!
20110708 MWS Started failling certificate NAGIOS tests as we were running the wrong version (1.38 rather than 1.40). Updated using yum update ca-policy-egi-core on the local WNs and copying the resultant certs to the BB WNs.
20110609 MWS Noticed ATLAS analysis jobs were failling with liblcgdm errors. Checked ndoes and the links were broken in /opt/lcg/lib. Fixed by hand but in next WN update, this should be fixed.
20110526 MWS SE problems narrowed down to excessive H1 jobs hammering the SE.
20110525 MWS epgse1 showed strange network issues overnight. Rebooting (eventually) fixed it but should keep an eye on odd 'fetch-crl' and 'voms' errors
20110511 MWS All appears well on the recovered VMs on epgpe10. Did a test copy to the SE and that was fine so reopened the queues and jobs have started coming in. WIll keep an eye on the Nagios tests to make sure everything is up and running again.
20110511 LSL For epgpe10 problem, updated base system kernel and kernel-xen from 2.6.18-194.32.1 to 2.6.18-238.9.1. Also, Dell have supplied new disk, so after booting from a CD, did a dd copy to new disk: dd if=/dev/sda of=/dev/sdb bs=51200000. This took about 2 hours. Rebooted.
20110510 MWS Update Maui config to use MAXJOB instead of MAXJOBS and slightly altered weighting to prioritise Atlas, LHCb and Alice. Will keep an eye to make sure jobs go through as expected.
20110508 LSL/MWS Mark notes epgpe10 has gone down again, like 20110505. System messages for epgpe10 logged on epgmo1 starts with mpt2sas0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630), then sd 0:0:0:0: SCSI error: return code = 0x00010000, then scsi 0:0:0:0: rejecting I/O to dead device . Come in and reboot.
20110503 LSL For BB, I asked Alan to propagate sudoers changes of March, including Mark's account, from front-ends to worker nodes too. I've made a rc.d/S60sudo.sh to include Chris's account as required.
20110503 LSL On BB, I've added Mark's account as an extra allowed-user in rc.d/S60sshd.sh which configures /etc/ssh/sshd_config on the grid worker nodes.
20110427 MWS Entered DT. Set epgr02 & 05 queues to Draining and stopped (qstop --) long, short and alice.
20112204 MWS epgr05 was failing NAGIOS with LB Query failures. Rebooting fixed the problem.
20112104 MWS Set epgr04/07 queues back online and set status in /opt/lcg/libexec/lcg-info-dynamic-pbs from Draining to $Status. Needed to reboot epgr04 as qsub/qstat didn't work, but other than that, all fine.
20111504 MWS ALICE BB VO Box was under very heavy load (>15) with CPU idle. Contacted ALICE experts who had a look but recommened reboot. Tried soft reboot and didn't work so hard reset (xm destroy + xm create). All seems well now.
20111404 MWS Request from ATLAS to take 22.5 TB from DATADISK and redistribute to other spacetokens.
20111404 MWS Noticed that new jobs into epgr05 weren't coming in. Rebooted and found BDII didn't start due to 0 diskspace left. Deleted a load of cfengine backup files, rebooted and all is well. Did the same for epgr02 just in case as well. Need a more permanent solution in the long term though.
20111304 MWS Added the new certificate for epgr08. Didn't reyaim/reboot as didn't seem necessary.
20111304 MWS Attempted to reyaim epgse1 after putting the new certificate but it got stuck when restarting the dpm. This was eventually traced to gmetric going nuts as it was run every minute. Reduced this time to every 30mins, rebooted (after Ctrl-C'ing out of the reyaim) and reyaim again. Everything seems to be back up and OK!
20111204 MWS On request from Elena (Atlas) added 1TB to PRODDISK (taken from DATADISK).
20111204 MWS Reyaimed and rebooted epgr07 to put in new certificate.
20111104 MWS Marked epgr07 and epgr04 in downtime and stopped the queues due to 1.5 weeks of BB downtime.
20110404 MWS On Friday 1st, what looks like an ALICE job took out two nodes (kernel crash in the log) and at about the same time, we either recieved 2000 jobs causing the local CE and Torque to fall over OR the CE and Torque fell over and jobs were getting hung up over the weekend. Either way, the CE needed rebooting and ~2000 jobs were left in the queue in an odd state. Deleted all these and everything returned to normal!
20110307 CJC All nodes require ssh keys to login, with the exception of epgmo1 (though this may change in the future). Public keys should be stored in the directory epgmo1:/var/cfengine/inputs/repo/general/public_keys/. They will then be distributed to all nodes via the module script modules/modules:ssh. New keys are added to the authorized_keys file using the command cfrun -- -D restart_ssh.
20110217 LSL On BB, process accounting now starts via system/rc.d/S*psacct.sh: uses directory /local/account/, a 2GB area which survives reboot.
20110209 LSL On BB, implement logging of outgoing ssh calls on bluebear workers via iptables rule. Test process accounting on one node u4n128.
20110117 CJC New dteam voms supported on local system SL5 UI.
Topic revision: r3 - 09 Jan 2013 - 14:44:47 - LawrenceLowe
Computing.LocalGridJournal20102011 moved from Computing.LocalGridJournal on 27 Apr 2012 - 11:02 by LawrenceLowe
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback