Computing Web>LocalGridTopics>LocalGridJournal2011 (revision 1)~~EditAttach~~

Local Grid Journal

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20111115	MWS	Removed an empty pool 'DPM001' from epgse1 using dpm-rmpool.
20111024	LSL	On local /home/lcgui setup, patched-in some files to support eScience CA 2A/2B, so voms-proxy-init works for new students and recent renewers.
20111017	LSL	Power socket supplying UPS for grid left rack (epgsr1 and RAIDs f12-f15, epgpe01-04) has hair-line crack and failed at around 9am. Swapped plug to another socket and got things going again by 09:40. Will inform electrician Mark Wicks about this problem. [He will replace during some downtime].
20111010	LSL	After SCSI problem on epgsr1 for f12 f13 f15, swapped that SCSI chain with f14 chain by moving cables between cards to see if SCSI problem moves or not.
20111009	MWS	epgsr1 was acting up which was assumed to be a SCSI problem. However, after doing a reboot, the server didn't come back up. Offlined the site and closed the queues. Hopefully this can be fixed tomorrow!
20110923	LSL	After SCSI problem on epgsr1 for f12 f13 f15, replace that SCSI card with a new one and see if this fixes the problem.
20110921	MWS	SE was acting up this morning with what seemed to be a dpns hang. Restarted but got strange SOAP errors about token headers. Restarted the srmv2.2 and that seemed to fix it.
20110909	MWS	After several days of being hit by batches of 100s of jobs at a time, banned user Rafael Mayo (Fusion) across the site by adding to /opt/glite/etc/lcas/ban_users.db. Will email and try to get him to stop.
20110713	MWS	As per GGUS 72515, added `acl cern_dest dstdomain .cern.ch` `http_access allow cern_dest` to squid config file.
20110712	MWS	Fixed GLExec issues so local tests now work. Polices on the Argus server needed sorting out. Future problems may involve not having roles set properly here!
20110708	MWS	Started failling certificate NAGIOS tests as we were running the wrong version (1.38 rather than 1.40). Updated using yum update ca-policy-egi-core on the local WNs and copying the resultant certs to the BB WNs.
20110609	MWS	Noticed ATLAS analysis jobs were failling with liblcgdm errors. Checked ndoes and the links were broken in /opt/lcg/lib. Fixed by hand but in next WN update, this should be fixed.
20110526	MWS	SE problems narrowed down to excessive H1 jobs hammering the SE.
20110525	MWS	epgse1 showed strange network issues overnight. Rebooting (eventually) fixed it but should keep an eye on odd 'fetch-crl' and 'voms' errors
20110511	MWS	All appears well on the recovered VMs on epgpe10. Did a test copy to the SE and that was fine so reopened the queues and jobs have started coming in. WIll keep an eye on the Nagios tests to make sure everything is up and running again.
20110511	LSL	For epgpe10 problem, updated base system kernel and kernel-xen from 2.6.18-194.32.1 to 2.6.18-238.9.1. Also, Dell have supplied new disk, so after booting from a CD, did a dd copy to new disk: dd if=/dev/sda of=/dev/sdb bs=51200000. This took about 2 hours. Rebooted.
20110510	MWS	Update Maui config to use MAXJOB instead of MAXJOBS and slightly altered weighting to prioritise Atlas, LHCb and Alice. Will keep an eye to make sure jobs go through as expected.
20110508	LSL/MWS	Mark notes epgpe10 has gone down again, like 20110505. System messages for epgpe10 logged on epgmo1 starts with mpt2sas0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630), then sd 0:0:0:0: SCSI error: return code = 0x00010000, then scsi 0:0:0:0: rejecting I/O to dead device . Come in and reboot.
20110503	LSL	For BB, I asked Alan to propagate sudoers changes of March, including Mark's account, from front-ends to worker nodes too. I've made a rc.d/S60sudo.sh to include Chris's account as required.
20110503	LSL	On BB, I've added Mark's account as an extra allowed-user in rc.d/S60sshd.sh which configures /etc/ssh/sshd_config on the grid worker nodes.
20110427	MWS	Entered DT. Set epgr02 & 05 queues to Draining and stopped (qstop --) long, short and alice.
20112204	MWS	epgr05 was failing NAGIOS with LB Query failures. Rebooting fixed the problem.
20112104	MWS	Set epgr04/07 queues back online and set status in /opt/lcg/libexec/lcg-info-dynamic-pbs from Draining to $Status. Needed to reboot epgr04 as qsub/qstat didn't work, but other than that, all fine.
20111504	MWS	ALICE BB VO Box was under very heavy load (>15) with CPU idle. Contacted ALICE experts who had a look but recommened reboot. Tried soft reboot and didn't work so hard reset (xm destroy + xm create). All seems well now.
20111404	MWS	Request from ATLAS to take 22.5 TB from DATADISK and redistribute to other spacetokens.
20111404	MWS	Noticed that new jobs into epgr05 weren't coming in. Rebooted and found BDII didn't start due to 0 diskspace left. Deleted a load of cfengine backup files, rebooted and all is well. Did the same for epgr02 just in case as well. Need a more permanent solution in the long term though.
20111304	MWS	Added the new certificate for epgr08. Didn't reyaim/reboot as didn't seem necessary.
20111304	MWS	Attempted to reyaim epgse1 after putting the new certificate but it got stuck when restarting the dpm. This was eventually traced to gmetric going nuts as it was run every minute. Reduced this time to every 30mins, rebooted (after Ctrl-C'ing out of the reyaim) and reyaim again. Everything seems to be back up and OK!
20111204	MWS	On request from Elena (Atlas) added 1TB to PRODDISK (taken from DATADISK).
20111204	MWS	Reyaimed and rebooted epgr07 to put in new certificate.
20111104	MWS	Marked epgr07 and epgr04 in downtime and stopped the queues due to 1.5 weeks of BB downtime.
20110404	MWS	On Friday 1st, what looks like an ALICE job took out two nodes (kernel crash in the log) and at about the same time, we either recieved 2000 jobs causing the local CE and Torque to fall over OR the CE and Torque fell over and jobs were getting hung up over the weekend. Either way, the CE needed rebooting and ~2000 jobs were left in the queue in an odd state. Deleted all these and everything returned to normal!
20110307	CJC	All nodes require ssh keys to login, with the exception of epgmo1 (though this may change in the future). Public keys should be stored in the directory `epgmo1:/var/cfengine/inputs/repo/general/public_keys/`. They will then be distributed to all nodes via the module script `modules/modules:ssh`. New keys are added to the `authorized_keys` file using the command `cfrun -- -D restart_ssh`.
20110217	LSL	On BB, process accounting now starts via system/rc.d/S*psacct.sh: uses directory /local/account/, a 2GB area which survives reboot.
20110209	LSL	On BB, implement logging of outgoing ssh calls on bluebear workers via iptables rule. Test process accounting on one node u4n128.
20110117	CJC	New dteam voms supported on local system SL5 UI.
20101208	LSL	f8 RAID now in place in BB serving /egee/soft via NFS. Now /egee/home areas are on local worker disk, like on our local cluster, as a performance enhancement.
20101206	LSL/CJC	Take f8 RAID and eprex6 server over to BB; their BB team want to do the physical installation though.
20101125	LSL	ep19x BB NAS server fails again with kernel traceback from alloc_pages_internal. Do a soft reboot but filesystem then disappears. Start preparing for redeployment of f8 RAID for BB.
20101112	LSL	Prepared a bbmoab.tar of current (5.4.3.s1) Moab client binaries, including Green Computing options, for Chris to put on epgr04.
20101111	LSL	RAM memory tests (memtest86+ 2.0.1 and then 4.1.0) on ep19x BB NAS server ran clean for 24 hours, so put /egee filesystem back online.
20101109	CJC	Marked epgr04 and epgr07 as draining ahead of the BB downtime.
20101109	CJC	Created new home directories and ssh keys for new BB grid users. Full details on how this was automated can be found here.
20101105	CJC	After fixing epgr11 DN in GOCDB, apel data appears to be uploading successfully. Check back on November figures in 24 hours. Also check back on September - currently at 1007940 (want to avoid double counting). Full details of upgrade here.
20101104	CJC	Removed all tags from epgr04:/opt/edg/var/info/atlas/atlas.list except `VO-atlas-cloud-UK` and `VO-atlas-tier-T2`. This should be enough to trigger reinstallation of ATLAS software. This will affect epgr07 as well as tag file was shared over NFS.
20101104	CJC	Test jobs successfully processed on BB. Submitting full grid type job.
20101103	CJC	Re-enabled BB queues and submitted large number of test jobs to ensure that nodes offlined by Green Computing systems can come back online. It appears as though all appropriate nodes have come back online, but no jobs are submitting. Check moab status?
20101103	CJC	Emailed tb-support as problems with APEL still persist and there is no reply to the https://gus.fzk.de/ws/ticket_info.php?ticket=63654][GGUS Ticket]].
20101102	CJC	Restored grid middleware according to the LocalGridCookbook instructions, but test jobs submitted from epgr04 are not picking up the grid environment variables. Local config scripts, yaim etc picked up by chance from old /egee filesystem still mounted on BB3. These files need to be added to the backup policy!
20101102	CJC	Submitted a helpdesk ticket (and emailed) Alan requesting that the kernel and glibc be updated on the BB nodes. Reinstalling middleware.
20101102	LSL	The NAS1104L box which provides /egee has been fitted with a new usb disk-on-module including up to date Open-E software. 3ware firmware already up to date. RAID now reformatted from scratch. Aslam has moved its power to the UPS, and is not to hard power-off the device on future occasions.
20101101	CJC	Submitted GGUS Ticket to APEL after epgr11 fails to upload updated accounting data to Accounting Portal.
20101029	CJC	Hard reboot of epaf17.ph.bham.ac.uk after failed reboot due to mount binds still being in place. Removed suid from cfengine modules.
20101029	CJC	Moved MonBox? role over to SL5 on epgr11.ph.bham.ac.uk. Reyaimed all CEs (epgr02, 04,, 05 and 07) and Site BDII to reflect change. Updated GOCDB. accounting currently reads 770316 for Birmingham - this should have increased by Monday. If not, GGUS ticket APEL for help.
20101029	CJC	All local nodes have been updated and rebooted, and so have been patched against CVE-2010-3904 and CVE-2010-3847. Waiting for BB to come back online before patching.
20101020	CJC	Disabled `rds` module in SL5 installations via cfengine CVE-2010-3904. This should be extended to BB once the filesystem has been fixed.
20101020	CJC	NFS server logs copied to `/home/lcgdata/logs/NAS/20101019`.
20101020	CJC	Added `module:suid` to cfengine tasks. This executes the script `/root/cfengine/files/suid_fix.sh` on SL5 nodes, either automatically if the lock file `/var/cfengine/reports/suid_fix` cannot be found or on demand (by setting the variable `force_suid_fix`). The `suid_fix.sh` script prevents unauthorised root access via hard links, as described in CVE-2010-3847. Note that rebooting a node will undo the fix, so the `reboot` and `halt` cfengine commands attempt to remove the lock file! This should be extended to BlueBEAR once the filesystem has been fixed.
20101020	CJC	Unscheduled downtime for epgr04 (BB lcg-CE), epgr07 (BB CreamCE) and epgr10 (BB Alice VOBox) due to NFS filesystem problems.
20101018	CJC	BB NFS box unresponsive. Requested Aslam do a hard restart.
20101012	CJC	Enabled ATLAS and ALICE (along with other normal VOs) on epgr07 (the CreamCE for BB). Notified Patricia and Graeme about sending ALICE and ATLAS jobs to this CE. Note that the CreamCE requires access to the torque server logs (not just accounting). These are currently copied onto the NFS server ( `ep19x.ph.bham.ac.uk:/egee/torque/server_logs`) on the BB side every 10 minutes by a cron job. This directory is then NFS mounted onto `epgr07:/var/spool/pbs/server_logs`.
20101004	CJC	Replicated `cond10_data.000007.gen.COND._0002.pool.root.4801537.0` and `DBRelease-12.7.1.tar.gz.6244710.0` on SE after job failures due to timeouts.
20100929	CJC	mysqld service failed to restart after rebooting epgmo1 during kernel upgrades. This caused APEL to fail to publish for 8 days. Restarted mysqld service and republished APEL data. Accounting data should now be up to date.
20100927	CJC	epgr07 not accepting jobs because it was not redeployed when the BB pool accounts were redefined. Backing up VM and redeploying.
20100924	CJC	epgsr4 (40TB) brought online. Space distributed between ATLAS spacetokens (DATA, MC, SCRATCH, HOT, LOCALGROUP).
20100923	CJC	Problem with yaim generated `/etc/sudoers` file on CreamCE for BB (epgr07). Emailed lcg-rollout.
20100923	CJC	Deploying epgr10 as a second VOBox for Alice. This will manage the BlueBEAR software area. NFS mounted ep19x.ph.bham.ac.uk:/egee/soft/SL5/alice on the VOBox.
20100923	CJC	BlueBEAR WNs back online, and using the updated kernel. Reyaiming epgr04 to allow jobs again. Updating ticket 62359.
20100923	CJC	BlueBEAR WNs appear to be in the down,offline state since late last night. Emailed Alan Reed.
20100922	CJC	Official kernel patch released. Updated DPM pool nodes, reyaimed and rebooted. Requested kernel be installed on BlueBEAR WNs.
20100921	CJC	epgd[01-24] nodes have kernel updated using `yum --enablerepo=sl-testing update`. Nodes rebooted, grub checked to make sure that nodes are using new kernel. All other service nodes, with the exception of the DPM pool nodes are updated in the same way. BB nodes are waiting for official kernel release. Supported VOs on epgr04 are reduced to ops only. Downtimes cleared from GOCDB.
20100915	CJC	Draining the epgd[01-24] nodes in preparation for kernel fix for the problem described here.
20100915	CJC	4000+ ILC jobs submitted to epgr02/05 by Stephane Poss. Killed off 3500 queued and emailed user. Checking efficiency of remaining jobs - could be useful to distribute to other SouthGrid? sites.
20100915	CJC	Replaced bucket in server room with a bucket and crate. This should have a large enough volume to contain the air conditioning drainage for the weekend (bucket approximately 3/4 full after 24 hours.
20100914	CJC	Submission problem on epgr02. Submitted jobs run, but no output appears to be returned. This would explain the nagios timeouts on epgr02 jobs. Rebooted.
20100914	CJC	Pump broken in AirCon? D. Maintenance logged problem with central services, waiting on quote for fix. In the meantime, they have uncoupled the drainage, which now empties into a bucket. This is not ideal (bucket should be checked everyday), but it does mean the unit is switched on. All WNs brought back online. Temperature steady at 18.5C.
20100914	CJC	Switched more nodes off. David Clifford sending someone to look at air conditioning. Temperature peaked at 25C.
20100913	CJC	Added SRCFG definition to maui config on epgr05, reserving one slot on both epgd01 and epgd02 for ops jobs and Steve Lloyd. Check back on SAM tests in 24 hours to see if this makes a difference!.
20100913	CJC	Changed "MAXPROC" to "MAXJOBS" in epgr05 maui definition, following advice on ScotGrid Blog.
20100913	CJC	AirCon? D powered back on (~5pm). Temperature drops to < 17C.
20100913	CJC	Installed epgf01 and epgf02 behind the f12-15 RAIDs to help with air flow. Temperature holding steady at 19.5C.
20100913	CJC	AirCon? D in W332 failed (switched off permanently). Contacted Dave clifford. Proceeding to drain alternate WNs (1,4,5,8,9,12,13,16,17,20,21,24)with the intention of powering them off once the jobs have completed. Air Temp currently at 19.46C.
20100913	CJC	Set epgr04 to draining and glong to `enabled = False` in preparation for BlueBEAR downtime.
20100831	CJC	Moved `epgd17` back online, but gave it the property "raid". All other nodes have the property "lcgpro". Modified qsub script so that all jobs require the "lcgpro" property, with the exception of jobs submitted by "atl059", which require the "raid" property. In this way, epgd17.ph.bham.ac.uk has been isolated for the purposes of testing the RAID performance.
20100831	CJC	Moved `epgd17` offline for the purposes of testing a RAID'ed WN.
20100827	CJC	Noted that the ATLAS Squid was swapping about 700MB of RAM. Readjusted VM allocations to give Squid 3GB at the expense of epgr02 (hosted on the same server).
20100825	CJC	Reyaimed epgr05 after ALICE complained of not being able to submit jobs. Jobs now successfully being submitted
20100811	CJC	Disappeared from Top level BDII again last night when epgr09 stopped responding to ldap queries. Restarted BDII service on epgr09. Adding hourly restart to cfengine. Checking log files for problems.
20100810	CJC	Rebooted epgsr1 due to the disappearance of 4 files systems. This fixed the problem.
20100810	CJC	Added the 40TB of storage attached to epgsr3 to the DPM. Used to reinstate a 50TB MCDISK spacetoken (along with some storage from DATA and LOCALGROUP). Emailed Brian Davies about making this official.
20100806	CJC	Accounting website reports no accounting data for epgr04.ph.bham.ac.uk (noted thanks to SpecInt? tagging idea put forward by Pete Gronbech). Checking accounting records on epgr04.ph.bham.ac.uk.
20100806	CJC	(Software) bonded epgsr3, all four interface connections working. Waiting for replacement host certificate before adding into DPM. Still have to update `epgse1:/etc/shift.conf`
20100806	CJC	Birmingham back in the information system, and on gstat.
20100806	LSL	Reconfigure switch epsw22 to include bond for sr3 on ports 05-08 and future sr4 on ports 09-12. Note: had to use browser IE<=6 or FF<=2 to reconfigure trunking on this DLink switch.
20100806	CJC	Rebooted Site BDII after it failed to respond to ldap queries. `service bdii status` checked out ok before reboot - checking logs...
20100803	CJC	`qfeed` scripts superseded on epgr04 by the `qall` script. This reads in a list of prioritised usernames, along with a maximum number of jobs they're allowed to run. The script then runs through all queued jobs and submits as many as it can. The script is invoked as root using the command `qalld cfengine/files/qall.priorities d&`.
20100803	CJC	Ran `/sbin/start_udev` on epgd02, 08 and 12 to fix the /dev/null bug. epgd10 remains unresponsive.
20100803	CJC	Moved epgd02,08 10 and 12 offline as they have been hit by the overwriting /dev/null bug.
20100802	CJC	Redeployed epgr04 with a reduced number of pool accounts.
20100802	CJC	Moved DPM Head Node to SL5.4 machine (keeping same name and IP address). Moved Site BDII to SL5, deploying on VM epgr09. This required changing both node and GIIS information in the GOCDB.
20100730	CJC	Moved epgd01 offline due to problem remotely rebooting (NFS?).
20100730	CJC	Scheduled downtime for the whole of Monday 2nd August so that the SE can be migrated to SL5.
20100728	CJC	Added 1TB to ATLASSCRATCHDISK (at the cost of LOCALGROUPDISK) to avoid being blacklisted. Available space must stay above 1TB!
20100727	CJC	Added Pheno and DZero support to local cluster and SE.
20100722	CJC	Changed maui FairShare? weighting scheme on epgr05 to more extreme values. Whereas previously the FS group weights were treated as a percentage, there did not appear to be enough of a discriminating factor between jobs. Ops jobs should now have the highest priority, followed by ATLAS/ALICE. LHCb jobs follow next, with all other VOs taking the lowest priorities.
20100721	CJC	UK CPU and Storage ranks, based on information in the BDII, are made available online.
20100721	CJC	Installed voms.gridpp.ac.uk and voms.ngs.ac.uk certificates in `/etc/grid-security/vomsdir/` on epgr05. This CE shouldn't need the certificates (relying instead on the vomsdir/VO/*.lsc files), but a bug means that it can't deal with VOs that need to authenticate with the Manchester VOMS server.
20100710	LSL	/egee progress: only u4n085 and u4n116 unconverted to new NAS; both are offline, so restart glong queue. Later: all done.
20100709	LSL	/egee progress: BB worker nodes u4n081,082,110-128 are on new ep19x NAS. 100-109 are offline awaiting job finish.
20100709	LSL	BB SAM tests for epgr04 have been showing u4n128:CRITICAL for the WN-CAver test: info files showed certs were version 1.34, but required 1.36. Later received GGUS ticket 59922. In /egee/soft/SL5/middleware/prod/external/etc/grid-security, I moved certificates/ to certificates.yyyymmdd/ and rsync'd afresh from epgr04.ph.bham.ac.uk::certificates. Done on both currently active /egee directory trees. Suggest re-instating g-admin cron job to do this.
20100709	LSL	Our storage was not being reported by ldap to epgse1 or by Gstat2 on web. Rebooted epgse1 (last night) to remedy. Today found that there were log messages "dpm: failed" and "dpnsdaemon: failed". File dpm/log indicates epaf17 network-down. Restarted network on that, and restarted epgse1, 10am Friday. Query via ldap now showing sensible Size information, absent before.
20100708	LSL	Around 5pm: on the console, logged on to all physical and virtual grid machines to check if they were down on the network: all were down except epgsr1 and epgsr2. Did service network restart for those.
20100708	LSL	On epgmo1, truncated that big log file, manually set IP addr, copied iptables from iptables.save, moved /etc/cron.d/cfengine_cron to /root directory for now.
20100708	LSL	Noticed that most grid servers had no network accessibility. Checked epgmo1 and found it had a 100% cfrun process, and 100% disk full. File /var/log/cfengine_backup.log was 77GB, with messages "You do not have a public key from host epgpe10.ph.bham.ac.uk", "Do you want to accept one on trust (yes/no)", and "Please answer yes or no". File /etc/sysconfig/iptables had been truncated at 4096 bytes, presumably by the disk full condition after an attempted update by cfengine.
20100708	LSL	Rebooted several BB workers to check that access to new /egee server worked from a fresh image. It does. Also converting a further handful of workers to use the new /egee.
20100707	LSL	On u4n128 tested new BlueBEAR transtec NAS /egee server, known on the network as ep19x and 10.143.245.103: no problems.
20100705	CJC	Owing to the Great Pool Account Crisis of '2010 (BlueBEAR Moab hit a hard limit of manageable pool accounts), Camont, CMS, NA48, Southgrid and Zeus have been disabled on the BlueBEAR CEs.
20100705	CJC	Allowed ssh connections in `epgr03:/etc/hosts.allow` to gridppnagios.physics.ox.ac.ukon the ALICE VOBox in order to pass nagios tests. As Patricia could already gsissh from lxplus, failing these OPS tests did not affect functionality.
20100705	CJC	Completing dpm-drain of storage hanginf off epgse1 by fixing drain errors ( `dpm-delreplica` non-existant physical files and `rm -f` files marked as in the process of being deleted by the DPM).
20100705	CJC	Copied (cp -a as g-admin) `/egee/soft/SL5/middleware` and `/egee/soft/SL5/local/` onto the new file server, mounted at `bluebear4x:/mnt/egee-new`. Stopped queues to ensure no more software jobs are submitted and started to copy software directories.
20100628	CJC	Incremented the kSI2K spec of the CreamCE by 1 to make differentiating between published accounting records on the accounting website easier.
20100622	CJC	Serena Psoroulas having difficulties authenticating at Cambridge. Watching pilot job at Birmingham - local ID 2820474 on BB.
20100622	CJC	Changed software tags area `/opt/edg/var/info` on epgr02/05 so that it's hosted on epgpe04 and NFS mounted on the CEs. This may help to alleviate the writing problems experienced by epgr02. Resubmitted 15.8.0 installation job.
20100622	CJC	Installed Cream CE on epgr07 to submit jobs to BB. Not in site BDII yet due to problem with firewall on epgr07 preventing connections to bbexport.
20100622	CJC	Problems with torque server on epgr05. Appears to be confused about which jobs are actually real. Killed 0% efficient jobs and started to manually `qrun` some of the backlog. Brought epgd24 back online. Accounting stats recorded at 39782 for June.
20100621	CJC	Yum updated all nodes. No reyaim required.
20100621	CJC	Removed queued LHCb pilot jobs from epgr02/05. Pilot factory appears to have read a wrong value from the information system and sent too many jobs. Queued pilots are safe to remove because no work has been assigned yet.
20100618	LSL	BB user g-atl023 (Steve Lloyd) jobs generating 3000 emails per day recently. These get stuck in campus emailer, causing extra load. Solutions are (a) run a working sendmail server on epgr04, or (b) tweak our qsub so that emails are directed to some account of ours. If $HOME/.forward files on BB worked (they don't!) that would have been another option.
20100617	LSL	Ran ATLAS squid test for Alastair according to tb-support emal recipe: success. Some discussion going on in Southgrid as to what to configure as our backup ATLAS squid.
20100609	CJC	Added ATLAS spacetoken information to ganglia.
20100604	CJC	Updated `BB:/egee/soft/SL5/local/yaim-conf/users.conf`, `groups.conf` and `site-info.def` to reflect new users and groups. Reconfigured BlueBEAR WN middleware.
20100603	CJC	Generated ssh keys for g-ali, g-bio, g-cal, g-cam, g-stg, g-fus, and g-ze users on bluebear using the command `echo ssh-keygen -v -t dsa -f /egee/home/$u/.ssh/id_dsa pipe sudo -H -s -u $u`. New keys copied into `/var/cfengine/inputs/repo/ce/sl5_bb_ce/opYtert2hpwTCsaRT9f36grTz` on epgmo1 and distributed to epgr04 as = /etc/ssh/extra/opYtert2hpwTCsaRT9f36grTz= on epgr04. Added new groups to gshort and glong queue as edguser on epgr04. Waiting for new groups to be added to moab ( although jobs do run if the qfeed script is used). Added relevant software areas to `BB:/egee/soft/SL5`
20100603	CJC	Restarted epgr02/5 queues after rebooting epgsr1.
20100603	CJC	Stopped epgr02/5 queues whilst investigating epgsr1 unavailable problem. Unable to ping epgsr1 from all machines except epgse1.
20100601	CJC	Updated local UI (SL4/5 Local/BB) to support Calice. This involved downloading the `grid-voms.desy.de.11017.pem` certificate into `$GLITE/middleware/prod/external/etc/grid-security/vomsdir/`. Also updated `$GLITE/yaim-conf/vo.d/calice` to reflect changes to available WMS.
20100601	CJC	Updated epgr04 yaim definitions (via epgmo1) to reflect support for Biomed, Camont, Calice, and vo.southgrid.ac.uk. Camont uses names have also been created, but they're not supported yet. Still waiting for ssh keys and sudo access on BlueBEAR to be sorted.
20100601	CJC	Requested 15.6.9.9 for epgr04. Installation tasks appear to be failing on epgr02/5.
20100528	CJC	qfeed'ing g-honp14 jobs on epgr04. Check back later to see if this affects the LHCb SAM tests.
20100527	CJC	Ping'ing epgsr1 from desktop and epgse1 results in 0% packet loss. Checking hostname...
20100527	CJC	Reduced the number of concurrent ATLAS jobs on BlueBEAR (via the qfeed scripts) to 60. This will allow the LHCb SAM tests to execute successfully. This problem will be fixed properly by the new /egee filesystem, to be installed next week (1st June).
20100526	CJC	Added `ngs.ac.uk` support to local SL5 UI. This required the ngs certficate be downloaded from CIC, and installed in `/home/lcgui/SL5/middleware/prod/external/etc/grid-security/vomsdir/voms.ngs.ac.uk.25890.pem`. Also added support for ngs.ac.uk on BlueBEAR SL5 UI.
20100526	CJC	Updated, rebooted and reyaimed epgmo1. Check back tomorrow to make sure that accounting is still being updated.
2010525	CJC	Problem authenticating as NGS user on local UI - is ngs supported?
20100525	CJC	WNs rebooted and moved back online.
20100524	CJC	Marked epgd01-12 offline to drain for the purposes of rebooting and installing a new kernel.
20100524	CJC	Requested 15.6.9.4 on epgr04 for Tim. Also waiting for existing installation process to finish before installing on epgr02/05.
20100521	CJC	Removed old /opt/edg/var/info/atlas/lock file, dated 12 May, which may be holding up installation processes on epgr02/5. Restarted 15.6.9 installation task.
20100521	CJC	Released and re-reserved a new ATLASMCDISK spacetoken in an attempt to fix the ATLAS reporting problem. This was only possible because the ATLASMCDISK was already empty!
20100520	CJC	Requesting Athena 15.6.9 on local cluster.
20100517	CJC	Birmingham panda queues set back online. Closed related GGUS tickets. srmv2.2 still vulnerable to crashing when querying ATLAS production spacetokens (mcdisk, proddisk etc). SE still reporting invalid size allocations according to Peter Love.
20100514	CJC	Noted that `srmv2.2` service fails on `epgse1` if it queried with the `srm-get-space-metadata` command. Added the service to the cfengine grid services script, so it should be restarted every hour if it has failed. Contacted `dpm-users-forum@cern.ch` for advice, but planning on upgrading head node to SL5 VM.
20100513	CJC	Restarted `cfservd` on epgr05 after failure of `glite-ce-job-submit`. This rules out cfengine as a cause of the job submission problems.
20100512	CJC	Updated `vo.d/atlas` to include the BNL VOMS server, as specified on the CIC Portal.
20100512	CJC	Reconfigured CreamCE following these instructions after yum updating to `glite-CREAM-1.42-3.jdk5`. Job submission now possible via `glite-ce-job-submit` command. This should solve the ISB problems experienced by ALICE. Cfengine appears to break this, requiring yaim to be rerun. Killed cfservd process on epgr05 until this problem is understood.
20100510	CJC	Added separate camont queue to epgr05 (and reconfigured epgr02 in the process) so as to limit camont jobs to 3 hour walltime. This should allow greater flexibility so that camont jobs can be more quickly throttled in the case of an ATLAS avalanche.
20100510	CJC	Deleting remaining replicas on ATLASMCDISK spacetoken, as requested. Deletion completed by listing files using `dpm-sql-spacetoken-list-files --st=ATLASMCDISK` command, followed by `dp-delreplica`.
20100510	CJC	Noted library problems using UI on Fedora 12.
20100510	CJC	Rebooted epgsr1 after selected WNs failed to mount software area. Reboot appears to have fixed the problem.
20100509	CJC	Fixed cfengine Grid Services module ( `/var/cfengine/inputs/modules/module:grid_services`), which checks and cleans up `rfiod` and `dpm-gsiftp` processes on storage nodes. Contained a bug which meant parent processes were not identified properly. They were then killed off if older than 1 hour. The script should now only kill off slave processes (ie user transfers). Killed `servkick` processes in order to test overnight.
20100507	LSL	Those two services on the epgsr servers are stopping every 1 or 2 hours (between 0 and 10 mins past the hour), so /root/bin/servkick now restarts them. Once the problem is understood, that process should be killed.
20100506	LSL	The ntpd service on epgsr1 had gone missing so the clock was a couple of minutes slow. Restarted.
20100505	LSL	Both epgsr1 and epgsr2 are losing services regularly: dpm-gsiftp and rfiod needed to be restarted. A netstat -ntlp sort was handy for spotting missing services.
20100504	LSL	The campus DNS is failing for about 5% of lookups (hostname not found for good hosts). Network team contacted.
20100430	CJC	ALICE note problem using Input SandBox? with CreamCE.
20100430	CJC	`/etc/cron.d/grid_services.cron` set to run on DPM Head and Pool nodes to check the status of `dpm-gsiftp` and `rfiod` and restart if appropriate. This might fix the source of the latest ATLAS errors.
20100428	CJC	In light of pbswebmon data, restricting camont and ATLAS Pilot jobs, as their efficiency is particularly low.
20100428	CJC	PBSWebMon now running on epgr08. The original script has been dited to allow monitoring of both epgr05 and BlueBEAR.
20100427	CJC	Added `epgsr1:/disk/f15c` to the new DPM Pool `DPM001`. Moved all `epgse1` filesystems offline and started draining into `epgsr1:DPM001`. The recovered storage on epgse1 will eventually become the experiment software area, with epgse1 demoted to providing NFS services.
20100426	CJC	Rebooting epgsr1 after transfer problems occur.
20100423	CJC	epgd01 back online. Installed pakiti2-client on all Grid nodes. Now reporting to epgr08. The pakiti server has yet to be properly configured, and should probably be password protected.
20100423	CJC	Marked epgd01 offline due to package inconsistancies after yum update.
20100420	CJC	Moved fabric monitoring software (currently Nagios and Ganglia, but soon also Pakiti and PBSWebMon? ) to `epgr08` from epgmo1. If php compromises security, the accounting won't be at risk!
20100410	CJC	Manual reboot of epgsr2 after it became unavailable on the network. Problem first occurred at 0317 10/04/2010. ATLAS disabled queues and issued a GGUS ticket. No clue in logs as to what the cause of the problem might have been. Also noted that NTP failed to update time from GMT to BST after rebooting.
20100409	CJC	Removing epgd24 from twin cluster and adding to test cluster for the purposes of NFS performance testing.
20100409	CJC	Allowing submission to u4n183-4 and u4n127-8 on Bluebear via the `epgr04:/root/bin/qgogo` script (updated via cfengine). These nodes were previously reserved for maui assigned jobs, but the the nodes can be better employed by allowing ATLAS jobs to run.
20100409	CJC	Re-enabled nodes epgd23-24, and ensured that 8 ATLAS production jobs were running per node. ATLAS software area remounted on epgd24 with the actimeo=60 option, in an attempt to reduce the number of getattr calls. Draining epgd20-22 in order to test the effect of varying the block size.
20100409	CJC	Draining nodes epgd23-24 to test NFS mount settings.
20100408	CJC	Black holes detected (and fixed) on epgd15 and epgd22 - nodes not able to scp files back to CE.
20100406	CJC	Removed and recreated all pool accounts on epgr05. This may fix the ops problems with running jobs on the CreamCE.
20100406	CJC	Updated DESY vomscert on CEs and DPM head node.
20100401	CJC	Reducing the number of ATLAS Production jobs running on local twins in order to let pilot test jobs run.
20100401	CJC	Updated all iptables to accept connections from `147.188.128.127`, the University network time server. Corrected date and time on epgsr2 and epgd22. epgsr2 had not updated after change to BST, and this might be the source of the ATLAS file transfer problems.
20100401	CJC	Allowing ssh connections on ALICE VOBox from `137.138.` (lxplus) and `128.142.` (CERN ops nagios). Updated `/etc/hosts.allow` to reflect this.
20100401	CJC	Installing rootkits on all nodes and broadcasting root password on university email lists (only joking - April Fool :D)
20100331	CJC	Noted error `Pinging service ClusterMonitor? ... The service is running at epgr03.ph.bham.ac.uk:8084, uri ClusterMonitor? ...connect: Connection refused` on epgr03 (ALICE VOBox).
20100329	CJC	Disabled epgr02 queues for the purposes of upgrading worker nodes to glexec_wn.
20100326	CJC	Re-enabled queues on epgr04 and removed Draining status.
20100326	CJC	Rebooted epgr04 now that the `BB:/egee` filesystem has been fixed. Submitted a GGUS Ticket regarding the rogue SAM Nagios tests coming from samnag010.cern.ch in the SW Cloud.
20100326	CJC	Tried removing and redeploying pool accounts using cfengine module. Jobs failed because glite software maintained references in `/etc/grid-security` to old pool account names.
20100326	CJC	Added epgd17-24 to the local cluster, bring the number of jobs slots to 192. Increased the ATLAS maui quota and reyaimed epgr02. Because this machine is still a 64 bit VM, some of the library paths in `/opt/glite/etc/lcas/lcas.db` and `opt/glite/etc/lcmaps/lcmaps.db` were wrong. These are now corrected automatically by cfengine after running yaim.
20100324	CJC	Job submission now successful on both CAGE CEs.
20100322	CJC	Job submission on `epgr08` currently fails.
20100322	CJC	Updated `epgsr1:/etc/exports` (via epgmo1) so that experiment software areas can be mounted on epgr07 as well.
20100322	CJC	Deploying epgr08 as an lcg-CE to see if the lcg-CE and CreamCE can co-exist.
20100322	CJC	CreamCE/ARGUS/GLEXEC_wn test bench now accepts dteam and ops jobs. Would like to be able to test the pilot job glexec functionality.
20100318	CJC	Installed as a test suite - epgr05 -> Cream CE, epgr06 -> ARGUS Server, epgr07 -> GLEXEC_wn. Can't submit jobs yet.
20100317	CJC	Birmingham starts to fail the GangaRobot? WMS tests, because the wrong certificates were copied to epgsr1 and 2 after update to cfagent.conf on epgmo1. Fixed and now waiting for jobs to return to Birmingham. This does not address the problem as to why we are not receiving any panda jobs.
20100317	CJC	Fixed CreamCE output retrieval problem by allowing incoming and outgoing traffic on the recommended ports.
20100316	CJC	Noted that CreamCE only reports job complete when the firewall is off => check ports.
20100316	CJC	BlueBEAR loses `/egee` filesystem again. Put `qhold` on remaining queued jobs as edguser on epgr04, and use `qdisable` on glong and gshort. Move epgr04 to "Draining" state.
20100316	CJC	Updated kickstart files in `epgmo1:/data1/grid/kickstart` to reflect the fact that redhat mirror is now held at `147.188.47.108:/disk/11b/home/redhat/`.
20100316	CJC	Renamed `epgsr3`, `epgce1`, `epgce3` and `epgce4` according to LocalGridMachines, and rebooted. Note that this required changes to `epgmo1:/etc/dhcp.conf` and to `/etc/sysconfig/network` on each physical machine to be renamed.
20100316	CJC	Due to DNS problem, flashed `/etc/hosts` on all grid nodes with the line `147.188.46.8 epgsr1.ph.bham.ac.uk epgsr1`. This should allow grid nodes to communicate with the first pool node (ie epgsr1) while the DNS problem is being resolved.
20100315	CJC	Noted that although ATLAS Panda jobs submitted from the local UI (ganga 5.4.5) run successfully on BlueBEAR, gangarobot jobs are still failing because of the TAR_WN bug. Soft linked all files in `/egee/soft/SL5/middleware/prod/globus/lib` into `/egee/soft/SL5/middleware/prod/atlas_fix/lib`, and prepended the new variable onto the LD_LIBRARY_PATH environment variable in the x509.sh script. This should fix the gangarobot failures.
20100315	CJC	Added the f14 filesystems into the DPM. Temporarily split the 20TB of space evenly between SCRATCH and LOCALGROUP disks., although it is expected that this space will be dedicated to the TopPhys? cache.
20100315	CJC	Clean install on epgsr3 and epgr06. Redeploying epgsr3 as a WN for the CreamCE, with epgr06 intended to be a glexec/SCAS server installation.
20100315	CJC	Booted the new epgsr1 and configured as a DPM Pool node. Successfully copied new data onto and off the node. Software directories all exported properly. Not yet network bonded. The new epgsr3 is still network bonded, and so should have the IP Address which is hard coded into the bond0 script changed before booting.
20100315	CJC	Renamed `epgsr1` as `epgsr3` and shutdown. Swapped MAC addresses in `epgmo1:/etc/dhcp.conf`.
20100315	CJC	Renamed `epgsr3` as `epgsr1` and shutdown. Ready for downtime.
20100314	CJC	Moved epgr04 into the draining state so that no more new jobs would be submitted before the scheduled downtime on Monday. Moved all local worker nodes offline for the same reason.
20100312	CJC	Installed `epgsr3.ph.bham.ac.uk`, with the intention of reconfiguring as a replacement for epgsr1 on Monday. Glite stack installed, but not configured.
20100311	CJC	dteam jobs should now run on the CreamCE. Investigating the possibility of SCAS server. OOPs! Only if the firewall is off!
20100311	CJC	Added the project directive `-A lowel01` to qsub script on epgr04. (Via `epgmo1:/var/cfengine/inputs/repo/ce/qsub.sh`, as this file is periodically copied to the CE!)
20100311	CJC	Implemented Part 1 of the Grid Backup policy. A cron job on `epgmo1` invokes the command `/usr/sbin/cfrun -- -D run_backup >> /var/log/cfengine_backup.log 2>&1`, which will run the backup module on all grid nodes (excluding epgsr1 as this is still not under the control of cfengine). This module reads a list of files from `/root/cfengine/files/backup.rules`, and copies the relevant files (preserving permissions, access times and directory structures) to /root/cfengine/backup/`date +% Y%m%d`. The directory is then compressed, ready for Part 2. This will involve distributing to epgsr1 and BB.
20100309	CJC	`epgr05.ph.bham.ac.uk` advertised as a CreamCE in the Site BDII. Marked as preproduction in the GOCDB. Noted that the `GlueCEStateStatus` item has the value "Special", and not "Production", which is why jobs are not being matched.
20100309	CJC	Installed SL5 UI on BlueBEAR. Users simply log on and `source /apps/hep/lcgui/lcguisetup`. Note that this only works for SL5 so far. Also note that the SL5 installation is dependent on the CRLs managed by the SL5 WN installation, which are stored in `/egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates`. The relevant X509 are set by the `/apps/hep/lcgui/SL5/middleware/prod/external/etc/profile.d/x509.sh` script, which is created by running the `/apps/hep/lcgui/SL5/yaim-conf/post_yaim.sh` script after running yaim.
20100305	CJC	Rationalized `epgse1:/disk/f??/vo` folder creation after biomed tried writing files to the non-existant /disk/f9a/biomed directory. All supported VOs should now have the appropriate directories on all SE filesystems.
20100305	CJC	BlueBEAR back online. Clearing "Draining" status from epgr04.
20100303	CJC	Requested epgr04.ph.bham.ac.uk be removed from the LHCb management system to allow jobs to run at epgr02 (jobs were no longer being submitted because epgr04 was down).
20100302	CJC	Implemented a simple pbs monitor for Nagios, which detects WNs in the offline and down state.
20100302	CJC	Updated the SL4 and SL5 UI on the local system. Added support for ILC to both. Note that the SL4 installation (UI_TAR 3.1.44-0) suffers from a bug such that `external/usr/lib` is appended to the grid environment `LD_LIBRARY_PATH` variable. This is fixed in the `yaim-conf/post_yaim.sh` script. The SL5 installation (UI_TAR 3.2.6-0) did not install any .pem or .lsc files in `external/etc/grid-security/vomsdir/`. These were manually copied from the SL4.new installation.
20100302	CJC	Changed epgr04 to `Draining` status while `BB:/egee/` problems continue. Also added epgr04 AT RISK status to GocDB? .
20100301	CJC	Need to implement x509 fix on UI_TAR installations. Also need to make sure ILC can authenticate on UI. Also need to implement Globus_Port_Range fix for UI. Where to the *.pem files in vomsdir come from in SL5 installation? They appear to just be there in SL4.
20100301	CJC	Submitted 69 local jobs from epgr04 to BB as each of the configured users according to the `showusers` output. Each job runs the BlueBEAR cleanup script, which should remove old files and directories in the `BB:/egee/home/` area. It is hoped that this will ease the slow file access problem on BB. Note that this is a temporary measure until the cron jobs are updated on BB!
20100301	CJC	Upgraded to `glite-WN_TAR 3.2.6-0` on BlueBEAR.
20100301	CJC	Created the scripts `BB:/egee/soft/SL5/local/yaim-conf/pre_yaim.sh` and `post_yaim.sh` to be run before and after yaim on BlueBEAR when configuring a new WN_TAR release. These scripts make sure that the X509 environment variables are set and creates the gridmapdir directory for the cleanup scripts.
20100226	CJC	Removed SL4 tasks from `BB:/egee/system/cron.d/cronuser.u4n??8` cron definitions. These will have to be reloaded on the cron nodes to stop the tasks from being executed!
20100226	CJC	Removed WN_TAR release 3.2.4-0 from BlueBEAR. Noted that production system is still 3.2.5-0. Will upgrade to 3.2.6-0 after cleanup scripts investigated.
20100224	CJC	Submitted a GGUS Ticket regarding the apparent problems publishing APEL data for the past 13 days.
20100224	CJC	Added Virtual Hosts to epgmo1 webserver. The University locks down port 80, so this is not externally accessible. Port 8888 is externally accessible. epgmo1.ph.bham.ac.uk:8888 will now serve the ganglia pages. epgmo1.ph.bham.ac.uk will serve the config files held in `/var/www/htlm/config`.
20100224	CJC	Noted that there are a large number of H1 GridFTP? transfers on epgse1. This would make sense in the context of the larger number of production jobs which have just completed. According to Ganglia, the problem appears to be the load and CPU usage, not bandwidth related.
20100223	CJC	Yaim Savannah Bug highlights the fact that lcg-CE is not supported on SL4 32bit. Redeploy?
20100223	CJC	Updated iptables for pool nodes. This should also now allow communication between the pool node and BB IPs.
20100223	LSL	Noticed during epgsr2 reboot that PXE doesn't function when the bonded interfaces eth0-3 are connected to the group-of-4 trunked switch ports. Observed that epgmo1 receives and responds to the PXE dhcp packets, but epgsr2 PXE doesn't see responses. If important, swap cables such that switch XOR algorithm (LocalGridBonding) chooses eth0 for response.
20100223	LSL	Noticed epgsr2 RAID disk labels had been assigned wrongly, eg f16c on physical RAID f17, so went offline, backed up /disk/f* to internal disk, re-initialised the RAID file-system labels correctly, restored the disk areas from the backup, and went back online. Note that the LocalGridRaidFormat doc describes the initialisation process.
20100223	CJC	Started to remove SL4 software areas from BB.
20100223	CJC	epgr04 failing SAM tests related to CRLs. Removed everything in `BB:/egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates`, and reran yaim config. This fixed the cert test warning, but not the rm test error. Under Investigation.
20100223	CJC	Noted that Birmingham has not published any accounting statistics for 12 days. Ran gap publisher on epgmo1 manually. Under investigation.
20100222	CJC	Ran yaim on `epgsr2`. This allowed lcg-cr transfers onto the pool node when the firewall was dropped. Compare with epgsr1 to see which ports need to be open. epgsr2 configured with a bonded network connection, but requires the network cables to be physically moved.
20100222	CJC	Changed permissions on `/home/lcgui/SL5/local/bin-cron/local-fetch-crl` to 755 so that cron job can actually download the CRLs (previously failing - permission denied to execute).
20100222	CJC/LSL	Power supply problem noted on epgd09-16. No data on ganglia for these nodes since Sunday 21st, 2pm. Outage due to a blown fuse.
20100222	CJC	epgsr2 network booted and started to reinstall. Because epgsr2 left to network boot on epgmo1, it started to reinstall after rebooting following the unbonding action. As the RAIDs were not disconnected, they appear to be formatting as well - this should not be allowed to happen again.
20100222	CJC	Unbonded epgsr2 and rebooted. This enabled the pxeboot to run (failed for the bonded interface)._An interesting question - does the unbonded network connection for eth0 work when connected to the trunked ports on the switch?_
20100221	CJC	Initialised the qfeed script on epgr04 for g-honp09 in the absence of ATLAS production.
20100219	CJC	Ganglia and dpmmgr UIDs got confused on epgsr2, causing DPM to create directory structures belonging to ganglia. Reinstalling in order to ensure completely fresh setup and avoid future difficulties. Moved ganglia installation to come after lcg installation in cfengine.
20100219	CJC	Updated all groups.conf so that entries take the form `"/alice/ROLE=lcgadmin":::sgm:` (was previously `"/VO=alice/GROUP=/alice/ROLE=lcgadmin":::sgm:`). The old format was causing yaim to only make entries in `/etc/grid-security/grid-mapfile` for special (ie production accounts). This caused some intermittent problems on epgse1 during the afternoon.
20100219	CJC	Rerunning yaim did not help the transfer problems on epgsr2. Noted that other users have tried to write to the disk, so marking as readonly for now.
20100219	CJC	Tried to copy a file onto epgsr2 by setting all other disks to read only and then using the command `lcg-cr -v --vo atlas -d epgse1.ph.bham.ac.uk --st ATLASSCRATCHDISK -l lfn:/grid/atlas/users/christophercurtis/test.sh.0 file:///home/cjc/thesis.tar.gz=. The transfer appeared to stall, and in =epgsr2:/var/log/dpm-gsiftp/gridftp.log`, the entry `530 Login incorrect. : Could not get virtual id!` was noted. Checked `/etc/grid-security/grid-mapfile` and found only production accounts listed. Rerunning yaim.
20100219	CJC	Network bonded eth0-3 on epgsr2.
20100219	CJC	Moved `/etc/xen/epgr0` VM definitions on VMHosts to `/etc/xen/auto/epgr0`. This allows the machines to boot automatically after the host has booted. Changed initialisation scripts to reflect this.
20100218	CJC	Ran yaim on epgsr2, with some success. Storage not yet online because DPM on epgse1 has not been updated. This should really be done automatically somehow...
20100218	CJC	Updated ssh permissions on all grid nodes. ssh now only allowed between log in node, Lawries desktop and Chris' desktop. No ssh between nodes permitted (with the exception of the required ssh between epgr02 and the twin WNs and epgr04 and BB exports).
20100218	CJC	Brian Davies suggests DATA=25.09T, GROUP=5T, HOT=1T, LOCALGROUP=18T, MC=25T, PROD=3.5T and SCRATCH=12T for the ATLAS spacetoken allocations, assuming that epgsr2 is assigned entirely to ATLAS.
20100218	CJC	Tim can't dowload dataset `user09.timmartin.105003.pythia_sdiff.MinBiasAthenaV1.AtlOff15.6.1_r1027` from Tokyo. DQ2 complains about a CRL problem. The transfer works at CERN however. Check CRLs on UI.
20100218	CJC	epgce2 failed to reboot overnight because of a failed bios RAM test. This machine is known to have bad RAM. To avoid the problem again, the bios settings were changed so that the F1 key does not have to be pressed manually on discovering an error. This should allow the machine to continue to boot.
20100218	CJC	Connected and mounted `/disk/f16` and `/disk/f17` RAIDs to epgsr2. This required creating mount points on epgsr2 ( `/disk/f1?[a-d]`) and adding entries for each filesystem in `/etc/fstab`. Rebooted. This machine is now ready for configuring as a pool node. It will also require network bonding in the near future.
20100218	CJC	Fixed the epgr04 gmetric.cron by adding /root/bin to the path. Previously not reporting any running jobs because it could not find the qs command.
20100218	CJC	Restarted pbs_mom on all WNs. Communication between WNs and epgr02 lost sometime over night.
20100217	CJC	epgce2 noted to be incommunicado. Requires manual reboot.
20100217	CJC	Found that xen does not automatically restart domains after a VM host is rebooted.
20100217	CJC	Prepared epgmo1 for the installation of Storage Pool epgsr2. Linked `epgmo1:/tftpboot/pxelinux.cfg/93BC2E25 -> hosts/epgsr2.ph.bham.ac.uk ->/tftpboot/pxelinux.cfg/configs/boot-hd.cfg`. Changed bios settings to network boot first. Installed SL5.3, preserving the Dell Utility partition.
20100217	CJC	Peter Love confirms that the reason for Panda Jobs not running on BlueBEAR is because ATLAS breaks the LD_LIBRARY_PATH variable in tarball installations.
20100217	CJC	Changed all installation scripts to use Local RPM Repo when first installing/updating a new node. This includes changes to the `/var/cfengine/inputs/repo/vm/*` scripts on epgmo1, as well as to all kickstart files. This avoids difficulties with the main Scientific Linux Repo (which is currently unavailable). Further changes to repo lists may be made after a node has been installed by cfengine.
20100217	CJC	Noted that the `dpm`, `dpmcopyd`, `srmv1` and `srmv2.2` services on epgse1 failed to restart after a reboot. Restarted manually.
20100216	CJC	Deployed epgr05 as a blank VM ready for the CreamCE. epgr06 deployed as a WN for epgr05.
20100216	CJC	Rebooted epgce4 - this should install SL 5.3 x86_64 and prepare the machine for two VM Hosts.
20100216	CJC	Added virtual host to `epgmo1:/etc/httpd/conf/httpd.conf`, listening to `:80`. All normal web connections should now be accepted without having to authenticate using SSL. Authentication still used for `https://epgmo1.ph.bham.ac.uk/nagios`. Changed `/data1/grid/kickstart/`, `/var/cfengine/inputs/cfagent.conf`, `/var/cfengine/inputs/repo/vm/` and `/var/www/html/.ks` on epgmo1 to reflect this.
20100216	CJC	Backed up `/opt`, `/etc`, `/var` and `/root` on epgce4 to `epgsr1:/disk/f15d/epgce4.backup`. This will be the final backup before redeploying epgce4 as a VM host.
20100215	CJC	Downloaded `swevo.ific.uv.es.pem` into `/etc/grid-security/vomsdir` on epgr02 to allow fusion jobs to run.
20100215	CJC	Configured SL5 UI. Editted `/usr/local/bin/lcguisetup` to call `/home/lcgui/SL5/local/lcguisetup.bash` if the user is in an SL5 environment. Also added `/home/lcgui/SL5/local/bin-cron/local-fetch-crl` to eprexa cron jobs so that the CRLs are downloaded every 6 hours.
20100215	CJC	Installed UI 3.2.6-0 on local system for use on SL5 nodes (ie eprexb). Unzipped UI tarballs into `/home/lcgui/SL5/middleware/3.2.6-0` and soft linked to `/home/lcgui/SL5/middleware/prod/`. Configured with yaim `/home/lcgui/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /home/lcgui/SL5/yaim-conf/site-info.conf -n UI_TAR`. Note that this must be done from an SL5 node! Changed permissions of profile scripts in /home/lcgui/SL5/middleware/prod/external/etc/profile.d/* to 755 (previously 744). Copied `/egee/soft/SL5/local/bin-cron/local-fetch-crl` from BlueBEAR into `/home/lcgui/SL5/local/bin-cron/` and executed.
20100215	CJC	On BlueBEAR, replaced the softlink `/egee/soft/SL5/middleware/prod/external/usr/lib64/libldap-2.3.so.0`, which previously pointed to `libldap-2.3.so.0.2.31` in the same library, with one which points to `/usr/lib64/libldap-2.3.so.0`. This fixed the ldapsearch error "LDAP vendor version mismatch: library 20343, header 20327".
20100215	CJC	Updated BlueBEAR WN tarball to release 3.2.5-0. Untarred release into `/egee/soft/SL5/middleware/3.2.5-0` and then updated the softlink `/egee/soft/SL5/middleware/prod` to point to the new release. Ran yaim twice: `/egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -c -s /egee/soft/SL5/middleware/yaim-conf/site-info.def -n glite-WN_TAR` and `/egee/soft/SL5/middleware/prod/glite/yaim/bin/yaim -r -s /egee/soft/SL5/middleware/yaim-conf/site-info.def -n glite-WN_TAR -f config_certs_userland -f config_crl` to configure the WN release and obtain the CRL url files. Added the file `/egee/soft/SL5/middleware/prod/external/etc/profile.d/x509.sh`, which sets the = X509_CERT_DIR= and `X509_VOMS_DIR` variables, because these are not added by the yaim config. Editted `/egee/soft/SL5/middleware/prod/external/etc/profile.d/grid-env.sh` to ensure that x509.sh is also called.
20100215	CJC	On BlueBEAR, changed `/egee/soft/SL5/localbin-cron/local-fetch-crl` so that it retrieves the current CRLs, and places them into `/egee/soft/SL5/middleware/prod/external/etc/grid-security/certificates`. This script is executed every six hours by the `/egee/system/cron.d/cronuser.u4n??8` cron job.
20100215	LSL	On BlueBEAR, changed the NFS mount of /egee so that it uses the noatime option, for efficiency, so that simple file accesses do not result in inode re-writes back through the NFS/GPFS system. Observed that the NFS v3 max transfer size is 32768, even if higher value requested.
20100212	CJC	Removed epgce4 from site BDII definition. Removed references to myproxy services on epgmo1 by 1) Removing all glite, edg, bdii* packages. 2) Removing /opt/bdii, /opt/glite, /opt/globus and /opt/edg directories 3) Reinstalling and reyaiming via cfengine. This has removed all myproxy references in the node BDII.
20100210	CJC	Updated the GridPP? voms certificate on the local UI. This is held in `/home/lcgui/SL4/etc/grid-security/old-1.28/vomsdir/voms.gridpp.ac.uk.pem`. The updated version is available here. The old certificate, which expires on 11/02/2010, has been backed up to voms.gridpp.ac.uk.22812.pem. Also updated certificates on epgr02 and epgse1 in the directory /etc/grid-security/vomsdir/.
20100210	CJC	Killed qfeed on epgce4 and started it on epgr04 for the user g-atlp13 (Graeme).
20100210	CJC	Notified ATLAS decommissioning of epgce4 and replacement by epgr04. Moved epgce4 queues offline by editting `/opt/lcg/libexec/lcg-info-dynamic-pbs` so that `push @output, "GlueCEStateStatus: $Status\n";` becomes `@output, "GlueCEStateStatus: Draining\n";`. The command `lcg-info --vo atlas --list-ce --attrs 'CEStatus'` confirms that epgce4 is not available for jobs. Changed status of epgce4 and epgr04 in GOC DB.
20100210	CJC	Wrote a nagios plugin which raises a warning if there are less than 20% of a groups pool accounts left and a critical warning if there are less than 10%.
20100209	CJC	Problem with maui/pbs on epgr02 - appears to be unresponsive, even after restarting services. Rebooting machine.
20100209	CJC	Pilots failing on BB SL4 complain of not finding `libglobus_gsi_proxy_core_gcc32dbgpthr.so.0`. This is available in /opt/globus/lib/ on Twin WN.
20100209	CJC	Fusion and H1 production appear to be failing gatekeeper tests on epgr02. Investigating
20100209	CJC	Nagios remote testing implemented. New tests distributed by cfengine. Added gmetric tests controlled by /root/cfengine/files/gmetric.sh via /etc/cron.d/gmetric.cron on epgr02, epgr04 and epgse1. These tests monitor the number of running jobs and the number of GridFTP? transfers, making the results available to Ganglia. Tests distributed via cfengine.
20100208	CJC	Patched bug in cfengine deployment of epgr02 which caused qsub -> qsub.sh -> qsub -> qsub.sh... Each time a new job was submitted, it entered an infinite recursive submission. Also added umask=022 to all shellcommands, which should fix reyaim bug.
20100208	CJC	Removed `egee-NAGIOS`, `glite-PX` and `glite-UI` packages from epgmo1 - this monitoring information is available elsewhere. Installed vanilla Nagios release with the intention of deploying standard (eg ping, disk usage) and home brew (number of ATLAS production jobs submitted) sensors.
20100204	CJC	Very slow Athena compile times noted on BB. Investigating.
20100204	CJC	Athena 15.5.0 test job and SQUID test job submitted to epgr04. If successful, decommissioning on epgce4 will begin. Passed the Athena and SQUID tests.
20100203	CJC	Re-enabled remote logging on all nodes (except twins). Log messages should be saved both locally and on epgmo1.
20100203	CJC	Reinstalled epgd16 successfully and moved it back online.
20100203	CJC	Enabled port 7512 on epgmo1 for the purposes of the MyProxy? server.
20100202	CJC	Graeme's jobs on BB SL4 seem to be failing with the error "Can't locate Globus/Core/Paths.pm in @INC". Investigating.
20100202	CJC	Problem with lcg-cr onto SE (failed SAM tests). Investigating. Transient error on SE. Keeping an eye on it.
20100202	CJC	Ganglia and Nagios monitoring installed on epgmo1. Nagios creates a high load on epgmo1. Consider reducing polling frequency or upgrading to a better machine!
20100202	CJC	Fixed all grid kickstart files to connect to `https://epgmo1.ph.bham.ac.uk/ack.php` with the `--no-check-certificate` switch.
20100202	LSL	Following actions noted yesterday on BB grid, added sharutils and blas-devel on BB suggested by Chris. Today, after reviewing SL5WN, added PyXML.i386 from 32-bit distro (64-bit already present, so may not be important, but harmless!).
20100202	CJC	Reinstalling glite-WN on epgd16 (having put it offline first!) after ATLAS pilot job could not find globus-url-copy.
20100202	CJC	epgse1 failing gsirfio lcg-gt SAM tests. Removal of --legacy from `epgse1:/opt/glite/etc/gip/provider/se-dpm` caused gsirfio support to be appended to ldap output. As this is not supported, BHAM failed SAM tests. --legacy support reintroduced. Awaiting the result of the savannah bug. Tested installation of savannah bug rpm - breaks xrootd support and fails to fix gsirfio problem, although legacy warnings do disappear!
20100201	CJC	Re-enabled CMS jobs on epgr02 and epgr04.
20100201	CJC	Removed "--legacy" switch from dpm-listspaces call in `epgse1:/opt/glite/etc/gip/provider/se-dpm`. This should fix the `"GlueSACapability has unknown value"` gstat2 warnings. This edit is managed by cfengine, so any subsequent reyaim should be fixed by cfengine. The consequence of this is that the SE appears to have dropped out of the lcg-infosites output. Is this important?
20100201	LSL	Following on from my actions on 20100118, for SL5 on BB, installed compat-glibc-headers.i386 from 32-bit distro, missing from 64-bit distro. Asked Alan to update SL5 kernel for BB grid worker nodes.
20100201	CJC	Changing "/C=FR/O=CNRS/CN=GRID-FR" to "/C=FR/O=CNRS/CN=GRID2-FR" in the vo.d/biomed file appears to fix biomed authentication failure errors in globus-gatekeeper.log
20100201	CJC	`dpm-listspaces` on epgse1 shows that the ATLAS pool is not using any space, which appears to be contrary to similar output from Oxford and Glasgow DPM output. Suspect this may be the root cause of `GlueSAUsedOnlineSize < 0` in the Gstat-prod monitoring. Emailed Gridpp-storage list.
20100131	CJC	Added qsub.sh setup to epgr02 to ensure that NGS jobs get assigned to a specific queue (previously not running).
20100131	CJC	Added H1 production and lcgadmin roles to QUEUE_ENABLE variables in epgr02 and epgr04 site-info.def files. This will allow H1 production jobs to run (previously failing).
20100128	CJC	Birmingham is on the Ganga Blacklist. The ANALY_BHAM Panda jobs are failing on BB, but also the latest WMS job went to epgr04 and failed when it couldn't use DQ2... Review the situation on Monday after installations have progressed.
20100128	CJC	`epgce4` seems to fail 14.5.0 Pilot jobs consistantly. Problem with installation after epgr04 mix up? If epgr04 is up and running soon, epgce4 will be decomissioned anyway! `epgr02` jobs seem to be more hit and miss - sometimes they work, sometimes they fail the md5sum test.
20100128	CJC	Added `SITE_OTHER_WLCG_NAME="UK-SouthGrid"` to site-info.defs of all CEs and site BDII in order to pass gstat2 tests.
20100128	CJC	SQUID test on epgr04 failed becayse latest DDM tools (ie DQ2) are not available. Requesting the 1.32 installation be fixed.
20100128	CJC	Submitted one ATLAS pilot job to Birmingham with two subjobs. One subjob succeeded, the other failed due to a mismatching md5sum. Submitting job again for reproduceability. Also submitting with other datasets. Update: Other datasets have failed. Either there are lots of corrupt files on the SE (!), or there is a problem with the transfers, or there is a problem with the md5sum installation on the WNs.
20100127	CJC	Increased the MAXPROC limit to 100 for both camont and ALICE production - first come, first served whilst ATLAS is down!
20100127	CJC	Added logrotate directive to cfengine to ensure all logs are kept for 366 days.
20100127	CJC	Added fusion VO to epgr02 + WNs. Installed `lcg-vomscerts-desy` to enable H1 and Zeus jobs to run.
20100126	CJC	Installed GridPP? Nagios suite on epgmo1. For this to work, SELinux needed to be moved to permissive mode. It also only works on port 80 (not 8888), so this has broken Ganglia (works on port 8888) and may have knock on effects... keep an eye on the SAM tests!
20100126	CJC	Manual yum update of epgsr1 and a reboot. This might fix the ATLAS transfer problems.
20100126	CJC	yum updating all SL5 nodes in light of new security bug. Should really develop a rolling reboot script for use in cfengine to safely reboot Twins...
20100125	CJC	HepSpec06 (32 bit, SL5) results. 9.61 for the Twins, 7.93 for BlueBEAR. Added to CE Information System.
20100122	CJC	Test upload of file to SE using xrootd ( `xrdcp ~/test.txt root://epgse1.ph.bham.ac.uk//home/alice/test.txt` failed with the error `Last server error 3010 ('Opening path '/home/alice/test.txt' is disallowed.')` `Error accessing path/file for root://epgse1.ph.bham.ac.uk//home/alice/test.txt`
20100121	CJC	Moved epgd16 offline for the purposes of running hepspec. Running HepSpec? test as described here. The node has all normal grid services running as per normal, but should not receive any jobs. The node has also been disabled in the `epgmo1` cfengine cfrun.hosts list, so no large file transfers should take place.
20100120	CJC	Problem on SL5 BB WN - grid-env.sh keeps getting overwritten, resulting in the x509 variables not being set (the x509.sh script must be executed as well). Removed offending yaim call from cron job definition, but this will require u4n108, u4118 and u4128 be rebooted.
20100118	CJC	Moved the 2009 Diary Entries here.
20100118	LSL	See previous item for BlueBEAR: the following packages (both archs) were required: compat-db compat-libf2c-34 compat-libgcc-296 compat-openldap compat-readline43 ghostscript giflib openmotif22 openssl097a tk.
20100118	LSL	Review packages in the SL5 image of BlueBEAR using doc https://twiki.cern.ch/twiki/bin/view/LCG/SL5DependencyRPM, specifically packages required by metapackage HEP_OSlibs_SL5-1.0.2-0.x86_64.rpm linked from that doc. Also packages listed in doc https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration, under heading LCG Applications Metapackage, for ATLAS.
20100118	CJC	RFIO problem detected in 14.5.0 sample job 702 submitted to `epgr02`.
20100118	CJC	Reimplemented Ganglia monitoring on `epgmo1`. Deployment of Ganglia is under the control of cfengine (and therefore, nodes `epgsr1` and `epgce4` have not been added yet to the monitoring).
20100115	CJC	Reimplement ATLAS and LHCb pilot roles on `epgr02` + `WNs`, as they were lost during the SL5 conversion. These are now completely maintained by yaim.
20100114	CJC	`yum update` on epgmo1 breaks the apel publishing briefly. Fixed by adding the line `JAVA_HOME="/usr/java/latest"` to `/etc/tomcat5/tomcat5.conf`.
20100114	CJC	New BB lcg-CE made to work by ensuring local pool account locations match those on BB. Created a softlink `/egee/home -> /home` and edited `/etc/passwd` to reflect this.
20100113	CJC	PBS communication broken between `epgr02` and worker nodes after autoupdate. WNs attempted an auto update, but failed due to package inconsistancy in `glite-WN_ext` repo. All WNs updated via cfengine and rebooted.
20100112	CJC	Rebooted `epgr02` after lcg-CE update.
20100112	CJC	Auto yum update on `epgce4` upgraded lcg-CE, which broke the torque submission. Turned off yum updates ( `chkconfig yum off; chkconfig --list yum`) and then reinstalled Lawries moab tools tarball. Also restored the qsub.bin/qsub.sh setup using script held on epgr04. Local job submission works, checkjob works (unlike epgr04). Waiting for grid test job to return positive.
20100107	CJC	Freed 1.4T of dark ATLAS data from `epgse1`.
20100106	CJC	BlueBEAR jobs running again.
20100106	CJC	Moved `epgd15` offline for the purposes of benchmarking.
20100106	CJC	Noted that all jobs (including local jobs) are queued on BlueBEAR. Emailed Alan and Aslam.

Topic revision: r1 - 07 Jan 2013 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe?

Computing

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback