Computing Web>LocalGridJournal2009 (19 Jan 2010, _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis? )

EditAttach

Local Grid Journal 2009

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20091216	CJC	Upgraded ATLAS Squid to SL 5.3.
20091215	CJC	Major work on the AliceBox.
20091215	CJC	Added VOBox to Site BDII and re-ran yaim.
20091214	CJC	Editing the file `epgce4:/opt/lcg/libexec/lcg-info-dynamic-pbs` allows for a static queue walltime to be set, but this appears to be global. Would be preferable to set individual times for each queue.
20091213	CJC	Added `mysql` package to `epgr02`. Used command `mysql --pass="MYSQL_PASSWORD" --exec "grant all on accounting.* to 'accounting'@'$NEWCE_HOST' identified by 'APEL_PASSWORD'"` to enable accounting access on epgmo1.ph.bham.ac.uk.
20091211	CJC	Added `long` and `short` queues to epgr02. Removed `sl5_test` queue.
20091211	CJC	Deployed `epgr03` for the purposes of the ALICE VOBox and `epgr04` for a temporary CE to harvest SL5 Athena installations (yet to configure machines.)
20091211	CJC	`epgce3` files backed up to `epgsr1:/disk/f15d/epgce3.backup` before OS reinstallation.
20091211	CJC	Converted `epgd12-15` to SL5 and added to `epgr02` cluster. `epgd16` will follow once decomissioning procedures for epgce3 have been confirmed.
20091209	CJC	Athena test submitted via grid job to `epgr02`, `epgce3`, `epgce4`, `t2ce05.physics.ox.ac.uk` and `svr026.gla.scotgrid.ac.uk`. Glasgow and Oxford passed without problem, and provide a good output comparison. `epgce3` passed without a problem. `epgr02` complained it could not read a DB file from the `SCRATCHDISK` - rerunning job to confirm problem. `epgce4` passed after updating squid firewall.
20091209	CJC	Converted `epgd09-11` to SL5 and added to `epgr02` cluster.
20091208	CJC	Added `/var/log/acpid` to cfengine to ensure permissions are always 640.
20091208	CJC	Killed off a large number of H1 gridFTP transfers to the SE as ATLAS and ops SAM tests were being failed. Number of simultaneous H1 jobs on BB to be reduced to 60.
20091208	CJC	Updated the firewall on epgr01 to allow Squid Traffic (port 3128). Athena test output looks healthier.
20091208	CJC	epgd09-11 moved to offline status for the purposes of reconfiguring as SL5 nodes once running jobs have completed.
20091208	CJC	Removed the lock file `epgce3:/opt/edg/var/info/atlas/lock` (file was created in early June, around the time of the last successful ATLAS install on epgce3). . PFC installation successfully completed on epgce3 soon after.
20091204	CJC	epgr02 reinstalled and accepting jobs. Yaim run locally this time rather than via cfengine. MAX PROC for ATLAS increased to 40 on BB.
20091204	CJC	Resource BDII broken on epgr02. This node is no longer visible to the outside world through the information system. Permissions messed up on info-providing scripts. Reinstall tomorrow.
20091203	CJC	Reinstalling epgr02 in an attempt to fix the job submission problem (epgr02 noted to have passed SAM tests on 02/12/09).
20091203	CJC	Enabled `ntpd` on epgr02 and set to autostart on reboot. Rebooted machine. This may fix the job submission problem which dteam, ops, atlas and lhcb have seen in the SAM tests.
20091203	CJC	Overnight SAM and Ganga tests successful, which proves the SE continues to work as expected. Requesting H1 try hammering Birmingham again to see if this has had an effect!
20091203	CJC	Move mysql DPM database to epaf17. Reyaimed epgse1 to reflect change. SAM tests post 0217 should be successful. Submitted two test jobs to epgr02 and epgce3.
20091202	CJC	Added `openldap-clients` package to SL5 nodes after SAM tests showed that WNs could not use the `ldapsearch` command.
20091201	CJC	epgr02 moved to production node status in GOCDB.
20091201	CJC	Moved nodes epgd01-08 inclusive offline for the purposes of converting to SL5 (epgd09-16 will follow once the remaining jobs on epgce3 have completed). epgr02 reinstalled with 3 vcpus and a copy of the maui.cfg from epgce3. epgr02 to be added temporarily to the production list in the GOCDB. Yaim rerun on epgce3 to reflect reduced nodes.
20091201	CJC	( 0021) User `epgse1:hon079` has paused jobs involving Birmingham data transfer until later this morning. SE seems more responsive and ATLAS DQ2 transfer was much faster. If the SAM tests are also successful over night this will confirm that the gridftp processes were too much of a drain on the DPM head node. Investigating the possibility of separating services.
20091130	CJC	Ensured that NTP service on epgse1 will auto start with `chkconfig ntpd on`. Killed H1 jobs on epgce3, which initially reduced the number of globus-gridftp transfers belonging to user `epgce3:hon015`, but then they rose again. Considering temporary ban of user.
20091130	CJC	After Lawries reboot, epgse1 still shows wrong time. `mysql` appears to have calmed down - only 50-60% of CPU time used now. All memory used, into swap. Restricted number of H1 jobs on epgce3 to 4 in an attempt to reduce users `epgce3:hon072` demands on the SE.
20091130	CJC	epgse1 was running normal kernel (non smp). Changed `/etc/grub.conf` to point to smp kernel and rebooted.
20091130	CJC	Rebooted epgse1 after failing SRM SAM tests. Machine appeared sluggish and under heavy load. Noted that only 2GB of memory is available on this machine, which was entirely in use (along with a portion of the swap) before the reboot. A higher spec machine may be required in the near future.
20091127	LSL	New power supply for f9 RAID arrived from Transtec. Fitted hot-swap without needing to take any systems down - OK.
20091127	CJC	epgce3 and epgce4 failing ATLAS SAM tests due to SQUID. Running Deployment tests again.
20091126	LSL	RAID f9 has a failed power-supply, one of two in a redundant formation, so it's continuing to work. A replacement PSU will be ordered asap.
20091126	LSL	BB jobs now should be going through normally, following a period of a week where the /egee file-system response has been very poor at times. A "ls /egee/home" command now has a sub-second response, rather than taking several minutes. The fix was the BB team killed off Chemistry nwchem jobs using MPI over ethernet not Infiniband, and maybe using heavy GPFS i/o.
20091125	CJC	Updated `lcg-CE` and `glite-TORQUE_server` packages and re-ran yaim on epgce3 in an attempt to get jobs running. Jobs seem to submit but then stay queued. The only clue in the maui log is "rc: 15057".
20091125	CJC	Installed Fusion certificate on epgse1 and epgsr1 as `/etc/grid-security/vomsdir/swevo.ific.uv.es.crt`. This may fix the [[https://gus.fzk.de/ws/ticket_info.php?ticket=53543][Fusion GGUS Ticket].
20091125	CJC	Copied `epgsr1:/etc/grid-security/vomsdir/grid-voms.desy.de.8119.pem` onto epgse1. This file used to exist (cf 20090624), but was not replaced after the SE restoration. This is expected to be the solution to Zeus GGUS ticket.
20091125	CJC	Changed the permissions of all software directories mounted on epgce3 WN to 775, which should ensure that ATLAS can install new software versions once again. Requested another PFC installation on epgce3
20091125	CJC	Rebooted epgce3. All ATLAS jobs were queued, but only Steve Lloyd's were running (presumably on the short queue). After the reboot, a large number of ATLAS jobs started running. This may fix the failed SAM tests.
20091121	CJC	Requested PFC installation on epgce3. This has already been installed on BlueBEAR WNs. The initial request failed when it came to writing the software tag, so the task was restarted...
20091121	CJC	Deployed epgr02 as SL4.6 64 bit CE on epgce1. This will be used to manage the SL5 installation. This deployment has no virt-cpu settings, so it may suffer from being too slow (something to keep an eye on). This machine is under the control of cfengine.
20091120	CJC	SE fails a number of OPS SAM tests. This is probably due to heavy load on the machine when I ran several `dpm-disk-to-dpns` processes simultaneously. These processes have terminated and access to the SE confirmed with lcg-cp/lcg-del and a dq2-get.
20091120	CJC	Updated ATLAS space tokens. DATADISK now holds up to 25TB data, which will allow us to meet the threshold set by ATLAS for beam data. Other spacetokens rejigged to account for this (see Graemes email).
20091120	CJC	SQUID confirmed to work with the RAL Frontier service using fnget.py. Previous problems were confirmed to be down to a problem at RAL rather than Birmingham.
20091119	CJC	Only 2 BlueBEAR nodes online, which appears to be causing a number of failed jobs on epgce4.
20091119	CJC	Added VO directories to `epgse1:/disk/f*/` in an attempt to fix the SAM put test errors. A re-yaim does not create these directories, so their source on other file systems is still a mystery. Note that yaim has a limited role on the DPM head node because it appears to be limited to defining only one pool. Added `/disk/f9c` to the `dpmPart` pool.
20091119	CJC	Deployed ATLAS SQUID Server on `epgs01.ph.bham.ac.uk`. Waiting to be authenticated by RAL (sent email to atlas-uk-comp-operations@cern.ch). The recommended `fnget.py` test works for the PIC and BNL Frontier servers.
20091119	CJC	`epgr01` deployed as a 64 bit SL 4.6 machine with 2GB RAM and 50 GB hard disk on epgce1. This machine will eventually host the ATLAS SQUID. This will require much more hard disk space (200 Gb+), so the SQUID should be hosted on an NFS directory on `epgsr1:/disk/f15d`.
20091119	CJC	Pete Gronbech highlights BDII problem on Gstat. This is similar to a previous problem.
20091119	CJC	Added Virtual Machines to `epgmo1:/etc/dhcpd.conf` file. Check LocalGridMachines for details on hostnames, MAC and IP addresses.
20091113	CJC	Tests show that it is `lcg-cp` fails to write files to a filesystem if `dpmmgr` doesn't already own a vo directory on that filesystem. Question remains: who is responsible for creating those initial directories - manual or yaim? Would need to confirm directory structure on all epgse1 disks. `epgsr1` dedicated soley to ATLAS at the moment and relevant filesystems look to be complete. There are no directories on `/disk/f15c`, but this is earmarked for ALICE.
20091113	CJC	Used the command `dpm-drain` to drain `epgse1:/disk/f9a`. Removed from DPM with command `dpm-rmfs`. Checked remaining directory structure for files and then cleared them from the disk. Added filesytem back in to the `dpmPart` pool with `dpm-addfs`. This may solve the problems ops were having writing to this disk.
20091113	CJC	Changed owner and group of `epgse1:/disk/f3g/alice` to dpmgr. Repeated for `babar` directory as well. This might solve the problem of alice not being able to write to `/disk/f3b/` on the SE (softlinked to `f3g`).
20091110	CJC	Added `vm.mmap_min_addr = 4096` to `/etc/sysctl.conf` on all local nodes, and issued to the command `/sbin/sysctl -p` to ensure that changes are registered. This will guard against future root null pointer weaknesses. A similar fix has been implemented on the BlueBEAR Grid Nodes.
20091109	CJC	Requested a HammerCloud test against Birmingham. Private grid jobs have been shown to complete successfully - this should demonstrate that the HammerCloud mechanism is working.
20091108	CJC	Rebooted `epgd08, 11, 12` and `16` after software areas disappeared.
20091107	CJC	Updated `kernel-devel` and ensured `kernel-module-xfs` installed on epgsr1 after ATLAS complained their spacetokens were "full" (epgsr1 file systems were actually not available according to `dpm-qryconf` on epgse1). Rebooted both epgse1 and epgsr1, which resulted in the file systems being available and writeable. Failed one SAM test while epgse1 rebooted.
20091106	CJC	Updated `lcg-CA`, `kernel` and `kernel-smp` on all nodes. This leaves only the Bluebear WNs with an unpatched kernel.
20091106	CJC	`epgce3-4` failing SRM tests (specifically lcg-rm). This may be just down to time outs on the SE due to the ATLAS spacetoken merging, but a test job has been submitted to ensure that this action is possible.
20091105	CJC	DPM database restored on epgse1. The only complication is that `/disk/f3a` (and `f3b` and `f3c`) was a soft link to `/disk/f9f` (and `f9g` and `f9h`). Soft links temporarily restored, with the intention of "draining" the f3 disks in the DPM and replacing them with the f9 disks. Manual drain of `ATLASDATADISK` restarted.
20091105	CJC	Reinstalled SL 4.6 on epgse1 ( `--clearpart --all` worked in the kickstart file!). Reinstalled glite and DPM software. Currently restoring MySQL? database.
20091103	CJC	Kickstart rejected `clearpart --drives=hda` option for restoring epgse1. Transferring backup files to /disk/f15d with the purpose of trying `clearpart --all`
20091102	CJC	epgce2 reconfigured as the site BDII. Operating system restored to SL 4.6 i386, running kernel version 2.6.9-89.0.15. The machine is under the control of the epgmo1 cfengine. Note: There is a memory fault on this machine, so it must boot with the option `mem=1024M`. There is a separate config/kickstart chain for this ( `sl46-i386-epgce2`).
20091030	CJC	cfengine removed from all nodes except those known to be safe (epgce1, epgmo1, epgd01). Stray update has caused SL5 64 bit libraries to be installed on epgse1 (SL4 32 bit), causing yum to fail to work. This machine will have to reinstalled, but DPM must first be backed up. Intend to follow these instructions. Will try to configure epaf17 as DPM head node first...
20091030	CJC	Reboot of epgce2 fails. Birmingham marked as down in GOC DB until manual reboot tomorrow.
20091030	CJC	A stray cfengine update causes epgce2 to be "updated" into an lcg-CE, thus losing it's site BDII status. The update installs SL5 binaries and breaks yum, so rolling back becomes difficult. Starting reinstall of node from scratch.
20091030	CJC	Restricted ack.php callback script on epgmo1 to `147.188.46.X`. This should stop Google Robot from setting up a boot script!
20091029	CJC	`ATLASHOTDISK` drain complete. Started to drain `ATLASDATADISK`.
20091028	CJC	`ATLASGROUPDISK` and `ATLASPRODDISK` moved to epgsr-f13. `ATLASTMP`, which contains the old `ATLASHOTDISK` files currently draining into epgsr13. This only leaves `ATLASDATADISK` to be moved.
20091027	CJC	Started to drain ATLAS spacetokens residing on the epgsr-f12 pool. Files in a particular spacetoken are listed using `dpm-sql-spacetoken-list-files --st ATLASPRODDISK`. These files are then replicated to the `ATLASTMP` spacetoken and the originals deleted, along with the original spacetoken. A new space token is created using `dpm-reservespace` on the epgsr-13 pool with the name of the original spacetoken. The files on `ATLASTMP` are then `dpm-replicated` into the new spacetoken, and the replicas in `ATLASTMP` deleted.
20091027	CJC	Overnight drains of `/disk/12a` and `f12b` returned a total of 11 "No such file or directory", resulting in the drain being incomplete. Question about the relevance of this sent to the storage email list
20091026	CJC	Started draining `/disk/f12a` on SE. Disks `f12b` and `f12c` will follow. The empty filesystems will be split between the `epgsr-f13` and `ATLASTMP` pools. Files remaining on the `f12d` disk will then be moved manually, via `ATLASTMP`.
20091026	CJC	Amended twin_wn cfengine definition to install i386 libraries in addition to x86_64 versions, which are not installed by the SL5 Dependency RPM. The yum repo file has been been made available on the epgmo1 web server in order to aid automated installation.
20091020	CJC	BDII service on epgce1 hit by glite 3.1 r57 bug (here). This may have been the reason for the failed ATLAS install jobs. Versions rolled back, and yum automatic updates turned off.
20091020	CJC	ATLAS software jobs failed when trying to register software tags. This may be due to the other nodes being available and already having software tags published. Setting up epgce1 as a temporary lcg-CE to handle the install jobs.
20091020	CJC	SL5 test grid job looks more promising. Previous jobs failed to source `grid-env.sh` on arrival at the worker node. Copying the grid-env.sh scripts into /etc/profile.d/ (as opposed to just soft linking) appears to fix this.
20091020	CJC	Re-ran yaim on epgce2 to reflect changes to `GlueSiteOtherInfo` requested by Kashif. Update: epgce2 site-info.def is very old and contains references to resources like epgce1. It has been quickly modified to reflect the existence of epgce4. A full update will follow.
20091019	CJC	Added /disk/f15a,b,c to dpm as member of the `ATLASPool` DPM Pool. Plan to eventually split DPM into three pools - ATLAS, ALICE and Others
20091019	CJC	ATLAS installing Athena 14.5.0 on epgce3 for SL5 via epgd01 (link). If this is successful, all worker nodes on epgce3 will have access to SL5 build athena versions.
20091015	CJC	CGI switch for preventing endless pxe reinstall loops on epgmo1 doesn't work. Replaced with a php script: http://epgmo1.ph.bham.ac.uk:8888/ack.php
20091013	CJC	Accident with cfengine running on epgmo1 distributes ssh keys to all grid machines and breaks communication between epgce3 and it's worker nodes. This appears to be fixable with the commands `rm -f /etc/ssh/ssh_known_hosts; /opt/edg/sbin/edg-pbs-knownhosts; service sshd restart`
20091008	CJC	epgd01 running as SL5.3 node. No torque, LCG or other software packages installed yet. Looking at GridFabricManagement software.
20091006	LSL	I have taken a full backup of BlueBEAR /egee file-system and copied it to PP system in directory /disk/11a/work/ in case of catastrophic GPFS failure; gzipped tar size is 139 GB.
20091006	CJC	SE fails the SAM tests in the early morning due to H1 user stressing the system. Under investigation...
20091006	CJC	Attempting to move epgd01 offline in preparation for SL5 installation. Restored maui.cfg from maui.cfg.20090930 - yaim on 01/10/09 appears to have rewritten config file!
20091001	CJC	HEPSPEC 06 results made available on site BDII. Added the lines `CE_SI00=XXXX`, `CE_CAPABILITY="CPUScalingReferenceSI00=XXXX"` and `CE_OTHERDESCR="Cores=192,Benchmark=Y.YY-HEP-SPEC06"` to site-info.def on epgce3 and 4, where `Y.YY` = 7.81 on epgce4 and 9.1 for epgce3. `XXXX` is given by Y*1000/4.
20090929	LSL	Final tweaks on getting default-image-gridpp on all grid workers, re-run trinity, fix NFS export of /egee from bbexport so that it's read-write! SAM tests being passed now.
20090928	LSL	Continuing new image (default-image-gridpp) work: yum install gives post-install grubby error but circumvent that with a soft-link for boot/vmlinuz done by hand. Note that grub.conf contents are irrelevant on BB.
20090925	LSL	Following my email and plan of action, Alan of ITS and I collaborating on producing new image for BB with new kernel. ITS want to delay new GPFS kernel module till December shutdown! So new image will use NFS to access the gpfs file-system /egee, mounted from bbexport.
20090923	CJC	Edited Twin WNs /etc/fstab to comment out epcf01 mount point (ALICE software area) as this machine is powered off
20090922	CJC	One twin chassis (epgd01-02) powered off due to air conditioning failure.
20090915	---	Email from Mingchao requiring all sites update kernel to avoid user escalation vulnerabilities CVE-2009-2692 and CVE-2009-2698. Not yet done on BB but partial mitigation in place.
20090914	CJC	No UK sites are running ATLAS production jobs according to ATLAS Dashboard.
20090914	CJC	All ATLAS Sites failing SRM SAM tests. GGus ticket opened. Also noted that simultaneously (may yet be coincidence...) that no long ATLAS jobs are running on ce3 or 4. Continuing to investigate...
20090902	CJC	Updated certificate on epgce3
20090827	CJC	epgse1 failing SAM tests since the reboot. Don't think it was rebooted properly earlier (terminal froze). Rebooted se again. Appeared to be able to lcg-cp properly once restarted (couldn't before). Awaiting the next SAM test...
20090827	CJC	Kernels on all grid resources updated to `2.6.9-89.0.9` which addresses the security incident reported last week. More details here. Lawrie updated eprexa kernel. Waiting to hear what the status is with BB. Rebooting caused a few SAM tests to fail. Should pass the next ones without difficulty.
20090825	CJC	Adjusted ATLAS Spacetokens with `/opt/lcg/bin/dpm-updatespace` to reflect a GUS Ticket suggestion.
20090817	LSL	IT Services' Alan Reed had started to take all BB worker nodes offline on Saturday to fix a local difficulty with /projects fs. He had returned grid workers to use after representations yesterday morning.
20090816	CJC	Subset of BB Grid nodes back online. New job submission confirmed
20090815	CJC	Problem with BB I/O noted, causing confirmed difficulties with UI. Also noted that no new jobs are being submitted - they're just queuing. This appears to be the same for local jobs submitted to non-grid queues. Email message sent to bb users.
20090814	LSL	On BlueBEAR, grid workers (and later all login and worker nodes) have the circumvention applied. Ref CVE-2009-2692.
20090814	CJC	Linux security issue allows the possibility of root access to some unpatched kernels. Temporary fix available here until kernel patch available. Temporary fix applied to all machines, with the exception of the UI and the BB front end and WNs. Old /etc/modprobe.conf backed up to /etc/modprobe.conf.20080814. epgce1 shutdown until the incident passes (this machine was previously accessible from a public network).
20090812	CJC	epgce1 reformated (several times!). Kickstart cgi script used to avoid reinstallation loops needs attention.
20090812	CJC	Work started on installing a test Cream CE. More information (including a full list of system edits) here.
20090812	CJC	Birmingham removed from Ganga blacklist having passed the robot jobs
20090811	CJC	Dead links in the directory `bluebear:/egee/soft/middleware/prod/lcg/lib/python/` updated to point at `bluebear:/egee/soft/middleware/3.1.34-0/lcg/lib/python2.3/site-packages/`. glong queue restarted
20090811	CJC	WN software upgrade on BB failed due to missing python links. Long queue stopped with the command `bbstop glong` on epgce4.
20090811	CJC	Birmingham noted to be blacklisted by Ganga (no obvious reason according to the ganga job logs). Birmingham continuing to pass all ATLAS SAM tests and a manual lcg-cr/lcg-del to the `atlasuserdisk` was successful. If Birmingham is still on the list by the end of the day, I'll escalate to Ganga operators
20090810	LSL	On BB, in directory `/egee/soft/middleware/etc/`, softlink `grid-security` now points to `grid-security-rsync` directory, as created by the rsync operation earlier today (see below). This new setup passed CAver test at 14:18 GMT. The new script therefore can replace manual methods of updating CA certs for BB workers.
20090810	CJC	Unpacked glite-WN 3.1.34 and glite-WN-external into `/egee/soft/middleware/3.1.34-0` on BlueBear? . The `/egee/soft/middleware/prod` softlink was changed to point to the new installation directory. Problem when unpacking originally - dumped output into `/egee/soft/middleware`. Folders which I think can be removed have been renamed *-REMOVE, and should be removed if we continue to pass the SAM tests!
20090810	CJC	Copied the fetch-crl script (2.6.3) from BB to epgce4, replacing the more recent version (2.7.0). Running this manually caused the r0 files to be updated properly (ie they contained revoked certificates). Restored the 2.7.0 fetch-crl script on epgce4 for completeness. Opened a GUS ticket.
20090810	CJC	Problem noted by Aslam - ~20 BB grid nodes failed a ClusterVision? test and went offline. They have now been brought back up. Aslam keeping an eye on them.
20090810	LSL	Noticed lots of error messages in `/var/log/fetch-crl-cron.log` on both CEs since about 31 July. Files `/etc/grid-security/certificates/*r0` only contain a cert and no actual CRL!! And yet `BB:/egee/soft/middleware/log/fetch-crl-cron.log` shows no problem.
20090810	LSL	Wrote a rsync-cert-voms script to facilitate transfer of epgce4 cert/voms CA updates to BB directory `/egee/soft/middleware/etc/grid-security-rsync/`. Could be invoked as occasional cron job.
20090807	LSL	Corrected setting for GLITE_LOCATION_VAR in `BB:/egee/soft/middleware/etc/profile.d/grid-env.sh`. Its value was /bb/projects/lcgui/prod/glite/var, and is now /egee/soft/middleware/prod/glite/var, which are identical directories, but /bb is not mounted in grid worker nodes.
20090807	LSL	Re-routed mains cables for epge3 and epgce4 on UPS, so requiring a reboot (19:30-ish). Air-con dripping, needs a call-out.
20090806	CJC	Updated UI and BB WN certificates on `/home/lcgui/SL4/etc/grid-security/` and `BB:/egee/soft/middleware/etc/grid-security/`
20090806	CJC	Problem with the University network - cannot reach any of the grid nodes (or university web services). Ironically, CIC emails are getting through to notify that there are problems reaching the nodes! [Caused by campus DNS problem 17:10 to 20:00].
20090806	CJC	Upgraded to lcg-CA on all rpm nodes (ce1-4, se1, mo1 and Twin WNs).
20090805	LSL	Following the success of the /tmp/ cleanup yesterday, I have added a /etc/cron.daily/bhamdircheck.cron to both CEs, which scans /home/* and /tmp and alerts us about directories with excessive files (> 16k). Its purpose is to monitor the monitors, not to correct the situation, which is best left to specialised cron jobs /etc/cron.d/cleanup-grid-accounts and /etc/cron.daily/tmpwatch.
20090804	CJC	Ran /usr/sbin/tmpwatch [using same command line as /etc/cron.daily/tmpwatch but with 24 as the hours] to clear the very large (> 32000) number of files from `epgce4:/tmp`. This appears to have fixed something, allowing a large number of jobs from different users to start running on ce4. SAM tests for ops, ATLAS, and LHCb all passed. Should review the maximum number of ATLAS processes allowable on epgce4, as it's hit the limit of 20 user jobs
20090804	CJC	Turned on extra logging for the globus-job-manager and marshall ( `debug 2` in `epgce4:/opt/globus/etc/*.conf`)
20090804	CJC	Killed LHCb pilot jobs on epgce4 - no more defunct globus-gma processes seen! Large number of globus-job-manager processes seen though. I wonder if there is one for every incoming job which hasn't run? GUS Ticket opened.
20090804	CJC	Allowing 64 ATLAS pilot jobs to run concurrently on epgce3 as the cluster is relatively quiet.
20090804	CJC	Added the line `RPCNFSDCOUNT=24` to the file `epgsr1:/etc/sysconfig/nfs` to deal with the problem of two many mount requests since the software installation move
20090804	CJC	Updated host*pem on epgce2
20090803	CJC	Updated to glite-WN 3.1.34 on epgce3 WNs. Problems rebooting some of the nodes due to partition labels not matching those given in `/etc/fstab`. Logged onto nodes via terminal room. `/` was mounted read only, so this had to be remounted as writable with the command `mount -o remount,rw /dev/sda2 /`. This allowed changes to `/etc/fstab`. Labels changed to match those given by `tune2fs -l /dev/sda1` and `/dev/sda2`. Nodes remounted. This leaves only a problem with epgd03...
20090731	CJC	Updated lcg-BDII on epgce2. Disabled glite-CE repository. Rebooted
20090731	CJC	Upgraded to glite-TORQUE_utils (3.1.9-0), lcg-CE (3.1.33-0) and lcg-vomscerts (5.5.0-1) on epgce3 and 4. Rebooting
20090731	CJC	Removed a large number of old globus-tmp.u4n* directories from g-atlp08 (Graeme) home directory on Bluebear. Confirmed that ssh keys exist for that user
20090731	CJC	Noticed the following error message on `epgce4:/var/log/globus-gatekeeper.log` when Graeme Stewart tries to submit a pilot job: `GSS authentication failure globus_gss_assist token :3: read failure: Connection closed` . This might explain why we have no pilot jobs from him.
20090730	CJC	Changed `max_user_queuable` in qmgr on epgce3 to limit user jobs to 1000 on long and short queues. Set `max_queuable` to 5000, which caused the `GlueCEPolicyMaxTotalJobs` item to be updated. LHCb pilot jobs hovering around 6550 - expect this to drop to a more managable level. Epgce3 noted to be very sluggish.
20090730	LSL	Noticed that NGS jobs were logged with qsub returncode 189, as they don't specify a queue. Updated the wrapper script /usr/bin/qsub.sh to add a suitable -q option on the qsub command for NGS jobs, a method which is epgce3 and epgce4 compatible, on both CEs.
20090730	CJC	Updated maui.cfg to allow Mark to run 100 camont jobs. Updated USERCFG[cam004] and GROUPCFG[camont] MAXPROC=100. This should return to normal on Friday evening!!!
20090730	CJC	epgse1 fails SAM tests. Investigation finds that SRMV2.2 service is no longer running. This is almost certainly the reason for the fault and is almost certainly due to the 1.7 DPM bug. Service restarted - next (valid) SAM due just before 10am
20090729	CJC	epgce4 still sporadically failing SAM tests for ops and ATLAS. Noticed that ops failed with a Globus Error 10, which is explained here.
20090729	CJC	The Maui attribute NODEALLOCATIONPOLICY on epgce3 changed from LOAD to CPULOAD in an attempt to better balance the job allocation.
20090728	LSL	To bring epgce4/BB more into line with epgce3, gridusers on BB no longer require a .profile file, the grid environment being set up from /etc/profile.d/grid-env.sh as on epgce3 workers. Checked with pbstest.
20090728	CJC	Increased number of ATLAS pilot jobs allowed on epgce3 from 12 to 48. This brings it into line with LHCb pilot jobs (also allowed Peter Love's Pilot DN to run 48 processes).
20090728	LSL	Brought my pbstest script up to date. This is useful for submitting jobs and getting the output, for a particular userid. Using this, checked that g-opss07 account on epgce4/BlueBEAR is working OK as far as job submission and output retrieval are concerned: it is.
20090727	CJC	Installed `lcg-CE-3.1.32` on epgce4 in response to lots of defunct globus-gma processes and there only LHCb pilot jobs in the queue. LHCb production jobs appear immediately after upgrade
20090727	CJC	All experiment software areas for the epgce3 WNs, with the exception of ALICE, are now mounted from `epgsr1:/disk/f15d/egee/soft/`. More details here
20090727	CJC	Installed `lcg-CE-3.1.32` on epgce3. This updated `globus-gma` to version `1.0.12` which I'm assured (TB-SUPPORT 16th July) will address the slow WMS update problem. Also updated `glite-SE_dpm_mysql` on epgse1 and `glite-SE_dpm_disk` on epgsr1.
20090727	LSL	On BlueBEAR, restrictions were placed on worker machines sshd allowusers in April. This is now updated so as to allow Chris, myself, and also g-admin to ssh from a bluebear login node to grid workers. Note that ssh access uses password, not keys (except for g-admin whose ssh keys are accessible) as /bb is deliberately unmounted on grid workers.
20090727	LSL	Noticed for epgce4 no qsub records in /var/log/messages, no jobs being submitted, including SAM test jobs. Rebooted epgce4, and jobs started appearing immediately. Kept ps and netstat outputs in /tmp/psefl.[12] and /tmp/netstat-ntlp.[12] so we can find out what service failed, at leisure.
20090724	LSL	We've had 444444 values in bdii responses: see 20090716. Value is in /opt/glite/etc/gip/ldif/static-file-CE.ldif. Spotted that this coincides with message "lcg-info-dynamic-scheduler: VO max jobs backend command returned nonzero exit status" and "Exiting without output, GIP will use static values" in /var/log/messages on CE. Happens a few times per day, but today ~ 40 times for epgce3.
20090724	LSL	User g-opss07 had no ssh key files, or .profile, so fixed on BB and epgce4: this has caused SAM tests to fail for a week. File loss unexplained - see also 20090528. Generating keys is documented in LocalGridKeys? .
20090724	CJC	Certificate renewal for sr1, mo1 and ce2 started. A log of how this was done is available here
20090724	CJC	Camont tests complete (until next week). cam004 directive commented out in maui.cfg and GROUPCFG[camont] returns to MAXPROC=4,6
20090723	CJC	Curiously, there are queued jobs on epgce1. `checkjob -v` states they are queued because no resources are available. The majority of jobs belong to ngs. A few more are production/software jobs. Perhaps epgce1 is still broadcast as being available?
20090723	CJC	No network saturation was seen in ganglia due to camont jobs. Updated maui.cfg to allow Mark to run 100 camont jobs. Updated USERCFG[cam004] and GROUPCFG[camont] MAXPROC=100. This should return to normal on Friday evening!!!
20090722	CJC	Updated maui.cfg to allow Mark to run 25 camont jobs tomorrow. Updated USERCFG[cam004] and GROUPCFG[camont] MAXPROC=25
20090722	CJC	Updated maui.cfg to allow more pilot job processes to run on epgce3. Previously limited to 10, now limited to 24 (similar to atlas production)
20090722	CJC	Removed empty directories from epgce3:rmdir /home//.globus/job/epgce3.ph.bham.ac.uk/ for users atl073, prdatl08 (Graeme accounts) and atl052, pilatl14, prdatl11, prdatl19 (Peter accounts). Repeated on epgce4 for g-atl012, g-atlo08, g-atlp13, g-atlp17 (Peter) and g-atl057, g-atlp08 (Graeme).
20090722	CJC	Rebooted epgce3 after failing to see any pilot jobs from hammer cloud
20090721	CJC	Started a journal of HammerCloud observations
20090717	CJC	Updated Maui FairShare? settings on epgce3. ATLAS now has 43% (previously 28%), LHCb 21% (previously 17%) and Zeus 13% (previously 3%). Biomed, Camont and H1 maintained at 3% each. All other VOs reduced to nominal value of 1%. epgce4 remains unchanged for now (ATLAS on 41%). LHCb share will eventually reduced and replaced by ALICE.
20090716	LSL/CJC	epgce4 GSTAT information showed GlueCEStateWaitingJobs: 444444, and Ricardo of LHCb had mentioned this. On the other hand, my own LDAP query showed sensible information. Nevertheless, rebooted epgce4. A compare of process names before/after showed that a service globus-job-manager-marshal restart might have achieved the same thing.
20090715	CJC	APEL Publisher RSS feed status now reporting "ERROR [Please use the Gap Publisher to synchronised this dataset]". Publisher rerun on epgmo1 with Gap settings.
20090715	CJC	Use `vimaui` on epgce3 to add the lines `USERCFG[cam004] MAXPROC=24` and `GROUPCFG[camont] FSTARGET=5 MAXPROC=20,24`. This should allow Mark Slater to run Camont tests over the next few days. This change should be removed before 18/07/09!
20090715	CJC	Restored WMS to pre RALDownTime defaults on local and bluebear UI. Some VO configurations on BlueBear? were updated to reflect the supported WMS
20090714	CJC	Installed glite-apel-core-2.0.9-13 and glite-apel-publisher-2.0.9-10 on epgmo1 at the suggestion of theGUS Helpdesk. Rerunning apel publisher to see if the accounting problem is fixed
20090714	CJC	Removed `/home/lcgui/SL4/prod/glite/etc/voms/na48-na48-voms.cern.ch` on the local UI and replaced it with `/home/lcgui/SL4/prod/glite/etc/voms/na48-voms.cern.ch`. This should contain the correct VOMS server information to contact.
20090710	CJC	Restarted `httpd`, `gmond` and `gmetad` services on epgmo1, re-enabling the Ganglia monitoring.
20090710	CJC	`epgmo1:/opt/glite/bin/apel-publisher` editted to allow 2048MB of memory. Resulted in heavy swapping on epgmo1 but apel has not yet crashed. This might be the source of the occasional RGMA SAM test failures due to timeouts.
20090710	CJC	Updated `glite-MON` on epgmo1. Backup of mysql database dumped to `~cjc/Grid/ManRepo/Backups/epgmo1/20090708/`. Reran yaim, with APEL_PUBLISH_LIMIT variable added. New yaim config backup can be found at `~cjc/Grid/ManRepo/Yaim/epgmo1/`. This site-info.def file is a stripped down version of the existing file, so other services (Nagios?) may now be broken/missing.
20090709	CJC	epgd02 rebooted remotely. Automounts correct lhcb software disk.
20090709	CJC	LHCb software area copied from `epgce3:/egee/soft/lhcb` to `epgsr1:/disk/f15d/egee/soft/lhcb` after gus ticket reported disk was full. epgce3 WNs updated to mount correct software area. Full details here.
20090707	LSL	The DPM packages got updated to 1.7 on 13 May on epgse1 and epgsr1 because my action on 20090512 to chkconfig yum off was insufficient: I should have done a service yum stop too!
20090707	LSL	LHCb requests that the qstat command is available to WNs so that jobs can discover their own time-left (GGUS ticket 50000). This is a reasonable request, particularly in a job-wrapper script. Module environment is currently disabled for grid WNs. So the SL5 and SL4 versions of qstat from subdirs of /cvos/shared/apps/torque/ have been copied to /egee/soft/local/ and these are now linked to at grid WN boot time into /usr/bin in base and in SL4 chroot.
20090702	CJC	Global problems with the APEL publishing cause CIC to broadcast a request that APEL SAM tests be ignored until further notice.
20090702	CJC	Steve Lloyd's tests are reporting Birmingham having 320 CPUs total, but 567 free. Not sure how to fix this ...
20090702	CJC	Problem `voms-proxy-init` 'ing on eprexa. Updated certificates, but problem was down to large number of files in /tmp. Emailed Mark to ask about deleting temporary files.
20090702	CJC	Changed RFIO buffer on local farm and grid CE's to 4096 bytes. Other sites (QMUL) run with such a small buffer for a long time and have not experienced problems
20090702	CJC	BB back online. Failed one SAM test, but I think this was because the cluster wasn't back up before the end of the scheduled downtime. Four ATLAS pilot jobs are already running, even though the ldap information has not been updated. Reverted `/opt/lcg/libexec/lcg-info-dynamic-pbs` to original file.
20090629	CJC	Haven't solved the mysqld problem, just coded around it :s Added the line `tmpdir=/var/tmp` to the [mysqld] field in `/etc/my.cnf`, and this seems to have fixed the problem as I can now restart mysqld as a service. Other dpm/srm/rfio services restarted. Successfully lcg-cr and cp'ed a file from the SE. Awaiting results of SAM tests within the hour...
20090629	CJC	Checked yum logs on se1 and sr1. Looks like DPM 1.7 was installed automatically on 13th May. Why problems with the schema only appeared now is still a mystery. Followed these instructions to change the schema - reasonably painless. Can now lcg-cp/r as before and local rfio access has been restored. `mysqld` still causing problems - can only run as user process. GGUS ticket opened.
20090629	CJC	Updated ldap information for epgce4 as BB downtime is causing problems for LHCb (they can still see the queue). Would normally use `qmgr -c "set queue glong enabled=false"`, but we don't own the torque server for BB so this is not possible. Instead, edited `/opt/lcg/libexec/lcg-info-dynamic-pbs` so that push `@output, "GlueCEStateStatus: $Status\n";` becomes `@output, "GlueCEStateStatus: Draining\n";`. Backup saved to `/opt/lcg/libexec/lcg-info-dynamic-pbs.20090629`. The command `lcg-info --vo atlas --list-ce --attrs 'CEStatus'` confirms that epgce4 is not available for jobs.
20090628	CJC	Birmingham site entered into the Atlas Ganga blacklist. epgse1 failing ops and atlas SAM tests - can't lcg-cp from SE.
20090628	CJC	`yum update -y lcg-CA` on epgd01-16
20090628	CJC	`pbsnodes -a` on epgce4 shows status of all BlueBEAR nodes. grep on u4n shows only 14 cpus have jobs (some cpus have more than one job), the highest being u4n128
20090627	CJC	Very large (~103) number of gridftp processes running on epgse1. Checked log file and the majority of requests are coming from a single H1 user certificate at a range of different sites. Stopped srmv1, srmv2 and srmv2.2 services temporarily (triggered a failed SAM test) and the number of gridftp processes decreased. Restarted services and epgse1 became more responsive (including responding to local athena file requests). Will continue to monitor and get in touch with H1 user if appropriate.
20090627	CJC	SAM tests on epgce3 and epgse1 fail when trying to copy files to the SE. `/tmp` directory missing on SE, due to mistake on Monday when trying to distribute new certificates using `rdist`. `/tmp` has been restored with the same permissions as those on epgsr1. CRL cron job running manually. Expect another set of SAM tests within the hour, which should reveal if this has fixed the problem.
20090626	CJC	BB jobs queued again. Re-running fetch CRL script. `qstat/qs` command fails to respond. Full details here.
20090626	CJC	Post Edit: This didn't fix anything! It just republished already existing data! Addressed APEL problem in gus tickets 49689 and 49453. Haven't fixed java memory problem, but I think accounting can be successfully published on a day by day basis. Example APEL config file can be found on `epgmo1:/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml.BHAM.gap`. Only May 20->May 21 have been published in this manner - waiting for confirmation that it works before proceeding.
20090624	CJC	Distributed updated lcg-CA certificates to BlueBEAR and Local UI's. Completed by tarring `epgce4:/etc/grid-security/vomsdir/` and `certificates/` and installing them in `/egee/soft/middleware/etc/grid-security/` on BEAR and `/home/lcgui/SL4/etc/grid-security/` on the local system. Backup's copied to `vomsdir.20090625/` and `certificates.20090625` of installation directories.
20090624	CJC	Regenerated `/etc/grid-security/vomsdir/grid-voms.desy.de.8119.pem` on epgse1 and epgsr1 based on Zeus .crt file on CIC. Removed `grid-voms.desy.de.pem` file on se (no serial number in the file name). Also removed `/etc/grid-security/vomsdir/zeus/grid-voms0.desy.de.lsc` (additional 0 in filename).
20090624	CJC	Installed `pine` using yum on epgce3-4, epgmo1, epgse1 and epgsr1 - I really miss the pico editor!
20090623	CJC	After speaking to Joseph, I've killed all his jobs on epgce3 and 4. According to Ganga he didn't have any jobs running, but was previously having problems with them entering the sleep state.
20090623	CJC	Adding support for the `vo.u-psud.fr` VO for Karl. More details can be found here. Confirmed as working.
20090623	CJC	Updated lcg-CA on epgce1-4 and epgmo1. epgse1, epgsr1 and ce3 worker nodes all appear to have updated automatically. local UI, BB UI and BB WNs not yet updated (don't know how).
20090623	CJC	ALL BB JOBS QUEUED - problem fixed by updating lcg-CA to 1.30 and running manual download of CRLs
20090618	LSL	Working on APEL publishing problem still, see 20090615, GGUS ticket 49453. Rebooted epgmo1 which provoked another error: tomcat5 wouldn't start because /etc/tomcat5/tomcat5.conf had JAVA_HOME="/usr/java/jdk1.6.0_10" (file last modified 20090106) whereas only java jdk version is 1.6.0_12 (installed by Yves 20090205). Very odd that it was running at all before my reboot, but the previous reboot was 20090115, before the jdk update, so this was a problem waiting to happen!
20090616	CJC	Changed UI setup on local system and BlueBEAR to cope with RAL WMS downtime. Full details can be found here.
20090615	LSL/CJC	To fix UI client problem for dpm, soft-link /home/lcgui/SL4/prod/lcg/lib/libshift.so.2.1 -> libdpm.so, and /bb/projects/lcgui/prod/lcg/lib/libshift.so.2.1 -> libdpm.so .
20090615	LSL	Has submitted a GGUS ticket to request help on epgmo1 not publishing Apel information. Symptoms in apel.log are Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded. More recently getting "Warning: No producers found to answer query" messages, but others are getting this which may be due to RAL machine room manoeuvres.
20090612	LSL	Following a GGUS ticket pointing out that compat-libstdc++-33.i386 was missing on epgd11, checked and found that only 6 of the 16 nodes were identical for packages. epgd01 to epgd09 have been yum-updated for RedHat? packages only, using yum -c /etc/yum.conf.wnredhat update. The remainder will be done if there are no adverse effects. This will at least remove minor discrepancies amongst the packages. Then we can check for major package differences.
20090612	LSL	On BlueBEAR, on request, Alan has added a new user g-admin, which can be sudo'd to by me and Chris, just like the other g-users. This ID is now used for files previously owned by Yves, namely /bb/projects/lcgui/ and /egee/soft/middleware/, and for the update-crl cron jobs that run on BB on selected WNs. Checked file /egee/soft/middleware/log/fetch-crl-cron.log later to ensure that the cron jobs were continuing to run successfully: they were.
20090612	LSL	make /bb/projects/lcgui/etc/grid-security point to /egee/soft/middleware/etc/grid-security, so that the UI software on BB for local users uses the same certificates and CRLs as are used by grid jobs in the grid-section of BB. This saves having to do a separate cron job or manual update.
20090611	LSL	Put in bonding for the epgsr1 interfaces to speed up fetching and rfio reads: documented at LocalGridBonding.
20090608	LSL	Tweak the maui scheduling limits for cpu bound users like lhcb and zeus, to make best use of the clusters.
20090604	LSL	Fixed up proper ownership of software areas on BlueBEAR, which had not been done since the software areas were copied by Yves from eScience cluster (on epgce2), which used a different uid/gid layout. Particularly necessary for atlas and lhcb areas. Kept Alessandro (ATLAS) and Vladimir (LHCb) informed.
20090603	LSL	Noticed that STEP 09 jobs are randomly going to short or long queues on both CEs: a time Requirements is missing in the jobs. Wrote a script to qalter and qmove jobs of the relevant user.
20090602	LSL	Checked that fair-shares in maui matched those requested by ATLAS and LHCb, and set MAXPROC limits for groups (and some users) to match those fair-shares. To make those semi-permanent for moab on BlueBEAR, so it survives a moab restart, I got Alan to add those to the moab.cfg file.
20090531	LSL	Added pilot users for atlas and lhcb. See LocalGridPilotAdd.
20090528	LSL	Looking at BB error messages, observed that a few grid users lacked ssh-keys (and also .profile) so job output would always fail to get copied back: g-atl046 g-dtm005 g-dtm056 g-dtm084 g-dtm100 g-ops006. As no-one else lacked these, it is maybe safe to assume that this was a self-inflicted problem by those users.
20090527	CJC	Manual edit of site-info.def and vo.d/biomed on epgse1 and epgsr1 to remove extra "/" at end of VO_VOMSES
20090526	LSL	Updated bouncycastle package on epgmo1 following advice from APEL team: see 20090520; now the java exceptions for republishing APEL information do not occur. Running my /root/bin/apel-republish to re-process old APEL data, week by week to avoid overloading RGMA, with user DN accounting info added.
20090526	CJC	Manual edit of /etc/grid-security/vomsdir/biomed/cclcgvomsli01.in2p3.fr.lsc on epgse1 and epgsr1 to remove extra "/" character at end of first line.
20090521	LSL	Updated static information for epgce4: number of job slots to 192, the new figure as of last week. Updated CE_PHYSCPU and CE_LOGCPU in site-info.def, and corresponding definitions in file /opt/glite/etc/gip/ldif/static-file-Cluster.ldif; yaim not actually run. Confirmed that ldap query and later GSTAT reflected the change.
20090521	LSL	Updated kernel on epgce3 from 2.6.9-78.0.1 to 2.6.9-78.0.22, as checking new logging didn't reveal the cause of the problem. Not rebooted as yet.
20090521	CJC	Copied biomed_certificate.crt to /etc/grid-security/vomsdir/ on epgse1 and epgsr1 (certificate obtained from CIC).
20090521	LSL	epgce3 was down 19:30 last night to 09:35 this morning. Similar symptoms to 20090430. Check logging. Approx 77 long jobs continued to run.
20090520	LSL	13:28 to 13:55 on epgce4: qsub and qstat were failing to contact the BlueBEAR pbs server. DNAT ruleset in BlueBEAR export machine, which routes packets from epgce4 to the qmaster machine on the BlueBEAR private network and back, was temporarily missing after Aslam restarted Shorewall. Rang him to remind him that it was necessary to run my /root/nat-qmaster.sh script which adds that ruleset. I must put that in the init.d/shorewall script or put a check in an hourly cron job on bbexport.
20090520	LSL	Configuring epgmo1 following info on this DN accounting advice page so that encoded user-DN information is published in APEL accounting. However, this led to java null pointer exceptions in the apel publisher cron job, which I have issued a ticket for.
20090518	LSL	f9 Raid replacement disk (ST3750640AS, 750GB) has arrived. Cloning failing drive slot 12 (to spare in slot 10). Then removed failing drive and inserted replacement as new local spare. Reported similar problem in drive slot 11 to supplier.
20090515	LSL	looking at possibly redeploying epgce1x as a future backup BDII (mac addr ending f3:79, one of our 2004 Streamline-supplied front-ends, i686 architecture, not to be confused with current epgce1). Successfully installed this machine as a SL4.6 32-bit glite site BDII, with the new site-info.def, and tested its ldap responses remotely.
20090515	LSL	created new site-info.def file, starting from the example distributed with glite 3.1, with customisation compatible with our previous site-info.def files. CE-dependent definitions omitted and in future these can go into a subdirectory file.
20090515	LSL	fixed problem of epgce2 still advertised as a CE by the epgce2 BDII: on epgce2, kept a ORIG copy of /opt/glite/etc/gip/ldif directory and then deleted static files relating to CE role. Also stopped the globus-gatekeeper service.
20090514	LSL	copied off accounting records from epgce1x machine (aka epaf18!) so it can be redeployed. We also have a copy of all accounting records on current epgce1 and epgce2.
20090513	LSL	writing a script to add pilot userids/groups to our CEs.
20090512	LSL	Pete Gronbech reports that DPM 1.7 is imminent and requires a schema change so should not be done automatically. I've done "chkconfig yum off" on epgse1 and epgsr1 as quick fix to avoid an automatic upgrade.
20090512	LSL	Created tgz file from epgce4:/etc/grid-security/{certificates,vomsdir}. Used that to create an updated bluebear /egee/soft/middleware/etc/grid-security/ directory for bluebear's WNs. Used that also on PP system to update /home/lcgui/SL4/etc/grid-security. Ensured preserving same ownership as the userid which automatically updates the CRLs.
20090512	LSL	Updated all epgce3 WNs to lcg-CA 1.29: yum -y update lcg-CA.
20090508	LSL	Updated lcg-CA pkg to 1.29 on epgce[234] and epgmo1. Servers epgs[er]1 were auto-updated Weds 4am. To do: propagate to UI and WNs.
20090508	LSL	Raid f9 reports media write error 311 for drive in slot12, reassign count = 8, though RAID is still in Good status. Googling shows error 311 is Write to disk error. Reported to vendor.
20090508	LSL	Steve Lloyd monitor recording a lot of failed Atlas jobs on CE epgce3. SAM tests running clean, apart from an early warning about cert of NIKHEF. Most jobs are running on epgd01. Turns out that epgd01 has no remote NFS mounts currently - all other nodes have the software areas mounted correctly. Put epgd01 offline, pending investigation.
20090506	CJC/LSL	epgse1: to remedy SRM GGUS-notified problem, srm1 service restarted.
20090430	LSL	epgce3 restarted twice today (11:05 and 20:28), because it froze: responds to ping and telnet 22 but not ssh and console frozen too. Put in extra logging in new /etc/cron.daily/minutely/ to monitor memory use and filesystems.

-- ChristopherCurtis - 19 Jan 2010

Topic revision: r1 - 19 Jan 2010 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis?

Computing

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback