TWiki> Computing Web>LocalGridJournal2009 (19 Jan 2010, _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis? ) EditAttach

Local Grid Journal 2009

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20091216 CJC Upgraded ATLAS Squid to SL 5.3.
20091215 CJC Major work on the AliceBox.
20091215 CJC Added VOBox to Site BDII and re-ran yaim.
20091214 CJC Editing the file epgce4:/opt/lcg/libexec/lcg-info-dynamic-pbs allows for a static queue walltime to be set, but this appears to be global. Would be preferable to set individual times for each queue.
20091213 CJC Added mysql package to epgr02. Used command mysql --pass="MYSQL_PASSWORD" --exec "grant all on accounting.* to 'accounting'@'$NEWCE_HOST' identified by 'APEL_PASSWORD'" to enable accounting access on epgmo1.ph.bham.ac.uk.
20091211 CJC Added long and short queues to epgr02. Removed sl5_test queue.
20091211 CJC Deployed epgr03 for the purposes of the ALICE VOBox and epgr04 for a temporary CE to harvest SL5 Athena installations (yet to configure machines.)
20091211 CJC epgce3 files backed up to epgsr1:/disk/f15d/epgce3.backup before OS reinstallation.
20091211 CJC Converted epgd12-15 to SL5 and added to epgr02 cluster. epgd16 will follow once decomissioning procedures for epgce3 have been confirmed.
20091209 CJC Athena test submitted via grid job to epgr02, epgce3, epgce4, t2ce05.physics.ox.ac.uk and svr026.gla.scotgrid.ac.uk. Glasgow and Oxford passed without problem, and provide a good output comparison. epgce3 passed without a problem. epgr02 complained it could not read a DB file from the SCRATCHDISK - rerunning job to confirm problem. epgce4 passed after updating squid firewall.
20091209 CJC Converted epgd09-11 to SL5 and added to epgr02 cluster.
20091208 CJC Added /var/log/acpid to cfengine to ensure permissions are always 640.
20091208 CJC Killed off a large number of H1 gridFTP transfers to the SE as ATLAS and ops SAM tests were being failed. Number of simultaneous H1 jobs on BB to be reduced to 60.
20091208 CJC Updated the firewall on epgr01 to allow Squid Traffic (port 3128). Athena test output looks healthier.
20091208 CJC epgd09-11 moved to offline status for the purposes of reconfiguring as SL5 nodes once running jobs have completed.
20091208 CJC Removed the lock file epgce3:/opt/edg/var/info/atlas/lock (file was created in early June, around the time of the last successful ATLAS install on epgce3). . PFC installation successfully completed on epgce3 soon after.
20091204 CJC epgr02 reinstalled and accepting jobs. Yaim run locally this time rather than via cfengine. MAX PROC for ATLAS increased to 40 on BB.
20091204 CJC Resource BDII broken on epgr02. This node is no longer visible to the outside world through the information system. Permissions messed up on info-providing scripts. Reinstall tomorrow.
20091203 CJC Reinstalling epgr02 in an attempt to fix the job submission problem (epgr02 noted to have passed SAM tests on 02/12/09).
20091203 CJC Enabled ntpd on epgr02 and set to autostart on reboot. Rebooted machine. This may fix the job submission problem which dteam, ops, atlas and lhcb have seen in the SAM tests.
20091203 CJC Overnight SAM and Ganga tests successful, which proves the SE continues to work as expected. Requesting H1 try hammering Birmingham again to see if this has had an effect!
20091203 CJC Move mysql DPM database to epaf17. Reyaimed epgse1 to reflect change. SAM tests post 0217 should be successful. Submitted two test jobs to epgr02 and epgce3.
20091202 CJC Added openldap-clients package to SL5 nodes after SAM tests showed that WNs could not use the ldapsearch command.
20091201 CJC epgr02 moved to production node status in GOCDB.
20091201 CJC Moved nodes epgd01-08 inclusive offline for the purposes of converting to SL5 (epgd09-16 will follow once the remaining jobs on epgce3 have completed). epgr02 reinstalled with 3 vcpus and a copy of the maui.cfg from epgce3. epgr02 to be added temporarily to the production list in the GOCDB. Yaim rerun on epgce3 to reflect reduced nodes.
20091201 CJC ( 0021) User epgse1:hon079 has paused jobs involving Birmingham data transfer until later this morning. SE seems more responsive and ATLAS DQ2 transfer was much faster. If the SAM tests are also successful over night this will confirm that the gridftp processes were too much of a drain on the DPM head node. Investigating the possibility of separating services.
20091130 CJC Ensured that NTP service on epgse1 will auto start with chkconfig ntpd on. Killed H1 jobs on epgce3, which initially reduced the number of globus-gridftp transfers belonging to user epgce3:hon015, but then they rose again. Considering temporary ban of user.
20091130 CJC After Lawries reboot, epgse1 still shows wrong time. mysql appears to have calmed down - only 50-60% of CPU time used now. All memory used, into swap. Restricted number of H1 jobs on epgce3 to 4 in an attempt to reduce users epgce3:hon072 demands on the SE.
20091130 CJC epgse1 was running normal kernel (non smp). Changed /etc/grub.conf to point to smp kernel and rebooted.
20091130 CJC Rebooted epgse1 after failing SRM SAM tests. Machine appeared sluggish and under heavy load. Noted that only 2GB of memory is available on this machine, which was entirely in use (along with a portion of the swap) before the reboot. A higher spec machine may be required in the near future.
20091127 LSL New power supply for f9 RAID arrived from Transtec. Fitted hot-swap without needing to take any systems down - OK.
20091127 CJC epgce3 and epgce4 failing ATLAS SAM tests due to SQUID. Running Deployment tests again.
20091126 LSL RAID f9 has a failed power-supply, one of two in a redundant formation, so it's continuing to work. A replacement PSU will be ordered asap.
20091126 LSL BB jobs now should be going through normally, following a period of a week where the /egee file-system response has been very poor at times. A "ls /egee/home" command now has a sub-second response, rather than taking several minutes. The fix was the BB team killed off Chemistry nwchem jobs using MPI over ethernet not Infiniband, and maybe using heavy GPFS i/o.
20091125 CJC Updated lcg-CE and glite-TORQUE_server packages and re-ran yaim on epgce3 in an attempt to get jobs running. Jobs seem to submit but then stay queued. The only clue in the maui log is "rc: 15057".
20091125 CJC Installed Fusion certificate on epgse1 and epgsr1 as /etc/grid-security/vomsdir/swevo.ific.uv.es.crt. This may fix the [[https://gus.fzk.de/ws/ticket_info.php?ticket=53543][Fusion GGUS Ticket].
20091125 CJC Copied epgsr1:/etc/grid-security/vomsdir/grid-voms.desy.de.8119.pem onto epgse1. This file used to exist (cf 20090624), but was not replaced after the SE restoration. This is expected to be the solution to Zeus GGUS ticket.
20091125 CJC Changed the permissions of all software directories mounted on epgce3 WN to 775, which should ensure that ATLAS can install new software versions once again. Requested another PFC installation on epgce3
20091125 CJC Rebooted epgce3. All ATLAS jobs were queued, but only Steve Lloyd's were running (presumably on the short queue). After the reboot, a large number of ATLAS jobs started running. This may fix the failed SAM tests.
20091121 CJC Requested PFC installation on epgce3. This has already been installed on BlueBEAR WNs. The initial request failed when it came to writing the software tag, so the task was restarted...
20091121 CJC Deployed epgr02 as SL4.6 64 bit CE on epgce1. This will be used to manage the SL5 installation. This deployment has no virt-cpu settings, so it may suffer from being too slow (something to keep an eye on). This machine is under the control of cfengine.
20091120 CJC SE fails a number of OPS SAM tests. This is probably due to heavy load on the machine when I ran several dpm-disk-to-dpns processes simultaneously. These processes have terminated and access to the SE confirmed with lcg-cp/lcg-del and a dq2-get.
20091120 CJC Updated ATLAS space tokens. DATADISK now holds up to 25TB data, which will allow us to meet the threshold set by ATLAS for beam data. Other spacetokens rejigged to account for this (see Graemes email).
20091120 CJC SQUID confirmed to work with the RAL Frontier service using fnget.py. Previous problems were confirmed to be down to a problem at RAL rather than Birmingham.
20091119 CJC Only 2 BlueBEAR nodes online, which appears to be causing a number of failed jobs on epgce4.
20091119 CJC Added VO directories to epgse1:/disk/f*/ in an attempt to fix the SAM put test errors. A re-yaim does not create these directories, so their source on other file systems is still a mystery. Note that yaim has a limited role on the DPM head node because it appears to be limited to defining only one pool. Added /disk/f9c to the dpmPart pool.
20091119 CJC Deployed ATLAS SQUID Server on epgs01.ph.bham.ac.uk. Waiting to be authenticated by RAL (sent email to atlas-uk-comp-operations@cern.ch). The recommended fnget.py test works for the PIC and BNL Frontier servers.
20091119 CJC epgr01 deployed as a 64 bit SL 4.6 machine with 2GB RAM and 50 GB hard disk on epgce1. This machine will eventually host the ATLAS SQUID. This will require much more hard disk space (200 Gb+), so the SQUID should be hosted on an NFS directory on epgsr1:/disk/f15d.
20091119 CJC Pete Gronbech highlights BDII problem on Gstat. This is similar to a previous problem.
20091119 CJC Added Virtual Machines to epgmo1:/etc/dhcpd.conf file. Check LocalGridMachines for details on hostnames, MAC and IP addresses.
20091113 CJC Tests show that it is lcg-cp fails to write files to a filesystem if dpmmgr doesn't already own a vo directory on that filesystem. Question remains: who is responsible for creating those initial directories - manual or yaim? Would need to confirm directory structure on all epgse1 disks. epgsr1 dedicated soley to ATLAS at the moment and relevant filesystems look to be complete. There are no directories on /disk/f15c, but this is earmarked for ALICE.
20091113 CJC Used the command dpm-drain to drain epgse1:/disk/f9a. Removed from DPM with command dpm-rmfs. Checked remaining directory structure for files and then cleared them from the disk. Added filesytem back in to the dpmPart pool with dpm-addfs. This may solve the problems ops were having writing to this disk.
20091113 CJC Changed owner and group of epgse1:/disk/f3g/alice to dpmgr. Repeated for babar directory as well. This might solve the problem of alice not being able to write to /disk/f3b/ on the SE (softlinked to f3g).
20091110 CJC Added vm.mmap_min_addr = 4096 to /etc/sysctl.conf on all local nodes, and issued to the command /sbin/sysctl -p to ensure that changes are registered. This will guard against future root null pointer weaknesses. A similar fix has been implemented on the BlueBEAR Grid Nodes.
20091109 CJC Requested a HammerCloud test against Birmingham. Private grid jobs have been shown to complete successfully - this should demonstrate that the HammerCloud mechanism is working.
20091108 CJC Rebooted epgd08, 11, 12 and 16 after software areas disappeared.
20091107 CJC Updated kernel-devel and ensured kernel-module-xfs installed on epgsr1 after ATLAS complained their spacetokens were "full" (epgsr1 file systems were actually not available according to dpm-qryconf on epgse1). Rebooted both epgse1 and epgsr1, which resulted in the file systems being available and writeable. Failed one SAM test while epgse1 rebooted.
20091106 CJC Updated lcg-CA, kernel and kernel-smp on all nodes. This leaves only the Bluebear WNs with an unpatched kernel.
20091106 CJC epgce3-4 failing SRM tests (specifically lcg-rm). This may be just down to time outs on the SE due to the ATLAS spacetoken merging, but a test job has been submitted to ensure that this action is possible.
20091105 CJC DPM database restored on epgse1. The only complication is that /disk/f3a (and f3b and f3c) was a soft link to /disk/f9f (and f9g and f9h). Soft links temporarily restored, with the intention of "draining" the f3 disks in the DPM and replacing them with the f9 disks. Manual drain of ATLASDATADISK restarted.
20091105 CJC Reinstalled SL 4.6 on epgse1 ( --clearpart --all worked in the kickstart file!). Reinstalled glite and DPM software. Currently restoring MySQL? database.
20091103 CJC Kickstart rejected clearpart --drives=hda option for restoring epgse1. Transferring backup files to /disk/f15d with the purpose of trying clearpart --all
20091102 CJC epgce2 reconfigured as the site BDII. Operating system restored to SL 4.6 i386, running kernel version 2.6.9-89.0.15. The machine is under the control of the epgmo1 cfengine. Note: There is a memory fault on this machine, so it must boot with the option mem=1024M. There is a separate config/kickstart chain for this ( sl46-i386-epgce2).
20091030 CJC cfengine removed from all nodes except those known to be safe (epgce1, epgmo1, epgd01). Stray update has caused SL5 64 bit libraries to be installed on epgse1 (SL4 32 bit), causing yum to fail to work. This machine will have to reinstalled, but DPM must first be backed up. Intend to follow these instructions. Will try to configure epaf17 as DPM head node first...
20091030 CJC Reboot of epgce2 fails. Birmingham marked as down in GOC DB until manual reboot tomorrow.
20091030 CJC A stray cfengine update causes epgce2 to be "updated" into an lcg-CE, thus losing it's site BDII status. The update installs SL5 binaries and breaks yum, so rolling back becomes difficult. Starting reinstall of node from scratch.
20091030 CJC Restricted ack.php callback script on epgmo1 to 147.188.46.X. This should stop Google Robot from setting up a boot script!
20091029 CJC ATLASHOTDISK drain complete. Started to drain ATLASDATADISK.
20091028 CJC ATLASGROUPDISK and ATLASPRODDISK moved to epgsr-f13. ATLASTMP, which contains the old ATLASHOTDISK files currently draining into epgsr13. This only leaves ATLASDATADISK to be moved.
20091027 CJC Started to drain ATLAS spacetokens residing on the epgsr-f12 pool. Files in a particular spacetoken are listed using dpm-sql-spacetoken-list-files --st ATLASPRODDISK. These files are then replicated to the ATLASTMP spacetoken and the originals deleted, along with the original spacetoken. A new space token is created using dpm-reservespace on the epgsr-13 pool with the name of the original spacetoken. The files on ATLASTMP are then dpm-replicated into the new spacetoken, and the replicas in ATLASTMP deleted.
20091027 CJC Overnight drains of /disk/12a and f12b returned a total of 11 "No such file or directory", resulting in the drain being incomplete. Question about the relevance of this sent to the storage email list
20091026 CJC Started draining /disk/f12a on SE. Disks f12b and f12c will follow. The empty filesystems will be split between the epgsr-f13 and ATLASTMP pools. Files remaining on the f12d disk will then be moved manually, via ATLASTMP.
20091026 CJC Amended twin_wn cfengine definition to install i386 libraries in addition to x86_64 versions, which are not installed by the SL5 Dependency RPM. The yum repo file has been been made available on the epgmo1 web server in order to aid automated installation.
20091020 CJC BDII service on epgce1 hit by glite 3.1 r57 bug (here). This may have been the reason for the failed ATLAS install jobs. Versions rolled back, and yum automatic updates turned off.
20091020 CJC ATLAS software jobs failed when trying to register software tags. This may be due to the other nodes being available and already having software tags published. Setting up epgce1 as a temporary lcg-CE to handle the install jobs.
20091020 CJC SL5 test grid job looks more promising. Previous jobs failed to source grid-env.sh on arrival at the worker node. Copying the grid-env.sh scripts into /etc/profile.d/ (as opposed to just soft linking) appears to fix this.
20091020 CJC Re-ran yaim on epgce2 to reflect changes to GlueSiteOtherInfo requested by Kashif. Update: epgce2 site-info.def is very old and contains references to resources like epgce1. It has been quickly modified to reflect the existence of epgce4. A full update will follow.
20091019 CJC Added /disk/f15a,b,c to dpm as member of the ATLASPool DPM Pool. Plan to eventually split DPM into three pools - ATLAS, ALICE and Others
20091019 CJC ATLAS installing Athena 14.5.0 on epgce3 for SL5 via epgd01 (link). If this is successful, all worker nodes on epgce3 will have access to SL5 build athena versions.
20091015 CJC CGI switch for preventing endless pxe reinstall loops on epgmo1 doesn't work. Replaced with a php script: http://epgmo1.ph.bham.ac.uk:8888/ack.php
20091013 CJC Accident with cfengine running on epgmo1 distributes ssh keys to all grid machines and breaks communication between epgce3 and it's worker nodes. This appears to be fixable with the commands rm -f /etc/ssh/ssh_known_hosts; /opt/edg/sbin/edg-pbs-knownhosts; service sshd restart
20091008 CJC epgd01 running as SL5.3 node. No torque, LCG or other software packages installed yet. Looking at GridFabricManagement software.
20091006 LSL I have taken a full backup of BlueBEAR /egee file-system and copied it to PP system in directory /disk/11a/work/ in case of catastrophic GPFS failure; gzipped tar size is 139 GB.
20091006 CJC SE fails the SAM tests in the early morning due to H1 user stressing the system. Under investigation...
20091006 CJC Attempting to move epgd01 offline in preparation for SL5 installation. Restored maui.cfg from maui.cfg.20090930 - yaim on 01/10/09 appears to have rewritten config file!
20091001 CJC HEPSPEC 06 results made available on site BDII. Added the lines CE_SI00=XXXX, CE_CAPABILITY="CPUScalingReferenceSI00=XXXX" and CE_OTHERDESCR="Cores=192,Benchmark=Y.YY-HEP-SPEC06" to site-info.def on epgce3 and 4, where Y.YY = 7.81 on epgce4 and 9.1 for epgce3. XXXX is given by Y*1000/4.
20090929 LSL Final tweaks on getting default-image-gridpp on all grid workers, re-run trinity, fix NFS export of /egee from bbexport so that it's read-write! SAM tests being passed now.
20090928 LSL Continuing new image (default-image-gridpp) work: yum install gives post-install grubby error but circumvent that with a soft-link for boot/vmlinuz done by hand. Note that grub.conf contents are irrelevant on BB.
20090925 LSL Following my email and plan of action, Alan of ITS and I collaborating on producing new image for BB with new kernel. ITS want to delay new GPFS kernel module till December shutdown! So new image will use NFS to access the gpfs file-system /egee, mounted from bbexport.
20090923 CJC Edited Twin WNs /etc/fstab to comment out epcf01 mount point (ALICE software area) as this machine is powered off
20090922 CJC One twin chassis (epgd01-02) powered off due to air conditioning failure.
20090915 --- Email from Mingchao requiring all sites update kernel to avoid user escalation vulnerabilities CVE-2009-2692 and CVE-2009-2698. Not yet done on BB but partial mitigation in place.
20090914 CJC No UK sites are running ATLAS production jobs according to ATLAS Dashboard.
20090914 CJC All ATLAS Sites failing SRM SAM tests. GGus ticket opened. Also noted that simultaneously (may yet be coincidence...) that no long ATLAS jobs are running on ce3 or 4. Continuing to investigate...
20090902 CJC Updated certificate on epgce3
20090827 CJC epgse1 failing SAM tests since the reboot. Don't think it was rebooted properly earlier (terminal froze). Rebooted se again. Appeared to be able to lcg-cp properly once restarted (couldn't before). Awaiting the next SAM test...
20090827 CJC Kernels on all grid resources updated to 2.6.9-89.0.9 which addresses the security incident reported last week. More details here. Lawrie updated eprexa kernel. Waiting to hear what the status is with BB. Rebooting caused a few SAM tests to fail. Should pass the next ones without difficulty.
20090825 CJC Adjusted ATLAS Spacetokens with /opt/lcg/bin/dpm-updatespace to reflect a GUS Ticket suggestion.
20090817 LSL IT Services' Alan Reed had started to take all BB worker nodes offline on Saturday to fix a local difficulty with /projects fs. He had returned grid workers to use after representations yesterday morning.
20090816 CJC Subset of BB Grid nodes back online. New job submission confirmed
20090815 CJC Problem with BB I/O noted, causing confirmed difficulties with UI. Also noted that no new jobs are being submitted - they're just queuing. This appears to be the same for local jobs submitted to non-grid queues. Email message sent to bb users.
20090814 LSL On BlueBEAR, grid workers (and later all login and worker nodes) have the circumvention applied. Ref CVE-2009-2692.
20090814 CJC Linux security issue allows the possibility of root access to some unpatched kernels. Temporary fix available here until kernel patch available. Temporary fix applied to all machines, with the exception of the UI and the BB front end and WNs. Old /etc/modprobe.conf backed up to /etc/modprobe.conf.20080814. epgce1 shutdown until the incident passes (this machine was previously accessible from a public network).
20090812 CJC epgce1 reformated (several times!). Kickstart cgi script used to avoid reinstallation loops needs attention.
20090812 CJC Work started on installing a test Cream CE. More information (including a full list of system edits) here.
20090812 CJC Birmingham removed from Ganga blacklist having passed the robot jobs
20090811 CJC Dead links in the directory bluebear:/egee/soft/middleware/prod/lcg/lib/python/ updated to point at bluebear:/egee/soft/middleware/3.1.34-0/lcg/lib/python2.3/site-packages/. glong queue restarted
20090811 CJC WN software upgrade on BB failed due to missing python links. Long queue stopped with the command bbstop glong on epgce4.
20090811 CJC Birmingham noted to be blacklisted by Ganga (no obvious reason according to the ganga job logs). Birmingham continuing to pass all ATLAS SAM tests and a manual lcg-cr/lcg-del to the atlasuserdisk was successful. If Birmingham is still on the list by the end of the day, I'll escalate to Ganga operators
20090810 LSL On BB, in directory /egee/soft/middleware/etc/, softlink grid-security now points to grid-security-rsync directory, as created by the rsync operation earlier today (see below). This new setup passed CAver test at 14:18 GMT. The new script therefore can replace manual methods of updating CA certs for BB workers.
20090810 CJC Unpacked glite-WN 3.1.34 and glite-WN-external into /egee/soft/middleware/3.1.34-0 on BlueBear? . The /egee/soft/middleware/prod softlink was changed to point to the new installation directory. Problem when unpacking originally - dumped output into /egee/soft/middleware. Folders which I think can be removed have been renamed *-REMOVE, and should be removed if we continue to pass the SAM tests!
20090810 CJC Copied the fetch-crl script (2.6.3) from BB to epgce4, replacing the more recent version (2.7.0). Running this manually caused the r0 files to be updated properly (ie they contained revoked certificates). Restored the 2.7.0 fetch-crl script on epgce4 for completeness. Opened a GUS ticket.
20090810 CJC Problem noted by Aslam - ~20 BB grid nodes failed a ClusterVision? test and went offline. They have now been brought back up. Aslam keeping an eye on them.
20090810 LSL Noticed lots of error messages in /var/log/fetch-crl-cron.log on both CEs since about 31 July. Files /etc/grid-security/certificates/*r0 only contain a cert and no actual CRL!! And yet BB:/egee/soft/middleware/log/fetch-crl-cron.log shows no problem.
20090810 LSL Wrote a rsync-cert-voms script to facilitate transfer of epgce4 cert/voms CA updates to BB directory /egee/soft/middleware/etc/grid-security-rsync/. Could be invoked as occasional cron job.
20090807 LSL Corrected setting for GLITE_LOCATION_VAR in BB:/egee/soft/middleware/etc/profile.d/grid-env.sh. Its value was /bb/projects/lcgui/prod/glite/var, and is now /egee/soft/middleware/prod/glite/var, which are identical directories, but /bb is not mounted in grid worker nodes.
20090807 LSL Re-routed mains cables for epge3 and epgce4 on UPS, so requiring a reboot (19:30-ish). Air-con dripping, needs a call-out.
20090806 CJC Updated UI and BB WN certificates on /home/lcgui/SL4/etc/grid-security/ and BB:/egee/soft/middleware/etc/grid-security/
20090806 CJC Problem with the University network - cannot reach any of the grid nodes (or university web services). Ironically, CIC emails are getting through to notify that there are problems reaching the nodes! [Caused by campus DNS problem 17:10 to 20:00].
20090806 CJC Upgraded to lcg-CA on all rpm nodes (ce1-4, se1, mo1 and Twin WNs).
20090805 LSL Following the success of the /tmp/ cleanup yesterday, I have added a /etc/cron.daily/bhamdircheck.cron to both CEs, which scans /home/* and /tmp and alerts us about directories with excessive files (> 16k). Its purpose is to monitor the monitors, not to correct the situation, which is best left to specialised cron jobs /etc/cron.d/cleanup-grid-accounts and /etc/cron.daily/tmpwatch.
20090804 CJC Ran /usr/sbin/tmpwatch [using same command line as /etc/cron.daily/tmpwatch but with 24 as the hours] to clear the very large (> 32000) number of files from epgce4:/tmp. This appears to have fixed something, allowing a large number of jobs from different users to start running on ce4. SAM tests for ops, ATLAS, and LHCb all passed. Should review the maximum number of ATLAS processes allowable on epgce4, as it's hit the limit of 20 user jobs
20090804 CJC Turned on extra logging for the globus-job-manager and marshall ( debug 2 in epgce4:/opt/globus/etc/*.conf)
20090804 CJC Killed LHCb pilot jobs on epgce4 - no more defunct globus-gma processes seen! Large number of globus-job-manager processes seen though. I wonder if there is one for every incoming job which hasn't run? GUS Ticket opened.
20090804 CJC Allowing 64 ATLAS pilot jobs to run concurrently on epgce3 as the cluster is relatively quiet.
20090804 CJC Added the line RPCNFSDCOUNT=24 to the file epgsr1:/etc/sysconfig/nfs to deal with the problem of two many mount requests since the software installation move
20090804 CJC Updated host*pem on epgce2
20090803 CJC Updated to glite-WN 3.1.34 on epgce3 WNs. Problems rebooting some of the nodes due to partition labels not matching those given in /etc/fstab. Logged onto nodes via terminal room. / was mounted read only, so this had to be remounted as writable with the command mount -o remount,rw /dev/sda2 /. This allowed changes to /etc/fstab. Labels changed to match those given by tune2fs -l /dev/sda1 and /dev/sda2. Nodes remounted. This leaves only a problem with epgd03...
20090731 CJC Updated lcg-BDII on epgce2. Disabled glite-CE repository. Rebooted
20090731 CJC Upgraded to glite-TORQUE_utils (3.1.9-0), lcg-CE (3.1.33-0) and lcg-vomscerts (5.5.0-1) on epgce3 and 4. Rebooting
20090731 CJC Removed a large number of old globus-tmp.u4n* directories from g-atlp08 (Graeme) home directory on Bluebear. Confirmed that ssh keys exist for that user
20090731 CJC Noticed the following error message on epgce4:/var/log/globus-gatekeeper.log when Graeme Stewart tries to submit a pilot job: GSS authentication failure globus_gss_assist token :3: read failure: Connection closed . This might explain why we have no pilot jobs from him.
20090730 CJC Changed max_user_queuable in qmgr on epgce3 to limit user jobs to 1000 on long and short queues. Set max_queuable to 5000, which caused the GlueCEPolicyMaxTotalJobs item to be updated. LHCb pilot jobs hovering around 6550 - expect this to drop to a more managable level. Epgce3 noted to be very sluggish.
20090730 LSL Noticed that NGS jobs were logged with qsub returncode 189, as they don't specify a queue. Updated the wrapper script /usr/bin/qsub.sh to add a suitable -q option on the qsub command for NGS jobs, a method which is epgce3 and epgce4 compatible, on both CEs.
20090730 CJC Updated maui.cfg to allow Mark to run 100 camont jobs. Updated USERCFG[cam004] and GROUPCFG[camont] MAXPROC=100. This should return to normal on Friday evening!!!
20090730 CJC epgse1 fails SAM tests. Investigation finds that SRMV2.2 service is no longer running. This is almost certainly the reason for the fault and is almost certainly due to the 1.7 DPM bug. Service restarted - next (valid) SAM due just before 10am
20090729 CJC epgce4 still sporadically failing SAM tests for ops and ATLAS. Noticed that ops failed with a Globus Error 10, which is explained here.
20090729 CJC The Maui attribute NODEALLOCATIONPOLICY on epgce3 changed from LOAD to CPULOAD in an attempt to better balance the job allocation.
20090728 LSL To bring epgce4/BB more into line with epgce3, gridusers on BB no longer require a .profile file, the grid environment being set up from /etc/profile.d/grid-env.sh as on epgce3 workers. Checked with pbstest.
20090728 CJC Increased number of ATLAS pilot jobs allowed on epgce3 from 12 to 48. This brings it into line with LHCb pilot jobs (also allowed Peter Love's Pilot DN to run 48 processes).
20090728 LSL Brought my pbstest script up to date. This is useful for submitting jobs and getting the output, for a particular userid. Using this, checked that g-opss07 account on epgce4/BlueBEAR is working OK as far as job submission and output retrieval are concerned: it is.
20090727 CJC Installed lcg-CE-3.1.32 on epgce4 in response to lots of defunct globus-gma processes and there only LHCb pilot jobs in the queue. LHCb production jobs appear immediately after upgrade
20090727 CJC All experiment software areas for the epgce3 WNs, with the exception of ALICE, are now mounted from epgsr1:/disk/f15d/egee/soft/. More details here
20090727 CJC Installed lcg-CE-3.1.32 on epgce3. This updated globus-gma to version 1.0.12 which I'm assured (TB-SUPPORT 16th July) will address the slow WMS update problem. Also updated glite-SE_dpm_mysql on epgse1 and glite-SE_dpm_disk on epgsr1.
20090727 LSL On BlueBEAR, restrictions were placed on worker machines sshd allowusers in April. This is now updated so as to allow Chris, myself, and also g-admin to ssh from a bluebear login node to grid workers. Note that ssh access uses password, not keys (except for g-admin whose ssh keys are accessible) as /bb is deliberately unmounted on grid workers.
20090727 LSL Noticed for epgce4 no qsub records in /var/log/messages, no jobs being submitted, including SAM test jobs. Rebooted epgce4, and jobs started appearing immediately. Kept ps and netstat outputs in /tmp/psefl.[12] and /tmp/netstat-ntlp.[12] so we can find out what service failed, at leisure.
20090724 LSL We've had 444444 values in bdii responses: see 20090716. Value is in /opt/glite/etc/gip/ldif/static-file-CE.ldif. Spotted that this coincides with message "lcg-info-dynamic-scheduler: VO max jobs backend command returned nonzero exit status" and "Exiting without output, GIP will use static values" in /var/log/messages on CE. Happens a few times per day, but today ~ 40 times for epgce3.
20090724 LSL User g-opss07 had no ssh key files, or .profile, so fixed on BB and epgce4: this has caused SAM tests to fail for a week. File loss unexplained - see also 20090528. Generating keys is documented in LocalGridKeys? .
20090724 CJC Certificate renewal for sr1, mo1 and ce2 started. A log of how this was done is available here
20090724 CJC Camont tests complete (until next week). cam004 directive commented out in maui.cfg and GROUPCFG[camont] returns to MAXPROC=4,6
20090723 CJC Curiously, there are queued jobs on epgce1. checkjob -v states they are queued because no resources are available. The majority of jobs belong to ngs. A few more are production/software jobs. Perhaps epgce1 is still broadcast as being available?
20090723 CJC No network saturation was seen in ganglia due to camont jobs. Updated maui.cfg to allow Mark to run 100 camont jobs. Updated USERCFG[cam004] and GROUPCFG[camont] MAXPROC=100. This should return to normal on Friday evening!!!
20090722 CJC Updated maui.cfg to allow Mark to run 25 camont jobs tomorrow. Updated USERCFG[cam004] and GROUPCFG[camont] MAXPROC=25
20090722 CJC Updated maui.cfg to allow more pilot job processes to run on epgce3. Previously limited to 10, now limited to 24 (similar to atlas production)
20090722 CJC Removed empty directories from epgce3:rmdir /home//.globus/job/epgce3.ph.bham.ac.uk/ for users atl073, prdatl08 (Graeme accounts) and atl052, pilatl14, prdatl11, prdatl19 (Peter accounts). Repeated on epgce4 for g-atl012, g-atlo08, g-atlp13, g-atlp17 (Peter) and g-atl057, g-atlp08 (Graeme).
20090722 CJC Rebooted epgce3 after failing to see any pilot jobs from hammer cloud
20090721 CJC Started a journal of HammerCloud observations
20090717 CJC Updated Maui FairShare? settings on epgce3. ATLAS now has 43% (previously 28%), LHCb 21% (previously 17%) and Zeus 13% (previously 3%). Biomed, Camont and H1 maintained at 3% each. All other VOs reduced to nominal value of 1%. epgce4 remains unchanged for now (ATLAS on 41%). LHCb share will eventually reduced and replaced by ALICE.
20090716 LSL/CJC epgce4 GSTAT information showed GlueCEStateWaitingJobs: 444444, and Ricardo of LHCb had mentioned this. On the other hand, my own LDAP query showed sensible information. Nevertheless, rebooted epgce4. A compare of process names before/after showed that a service globus-job-manager-marshal restart might have achieved the same thing.
20090715 CJC APEL Publisher RSS feed status now reporting "ERROR [Please use the Gap Publisher to synchronised this dataset]". Publisher rerun on epgmo1 with Gap settings.
20090715 CJC Use vimaui on epgce3 to add the lines USERCFG[cam004] MAXPROC=24 and GROUPCFG[camont] FSTARGET=5 MAXPROC=20,24. This should allow Mark Slater to run Camont tests over the next few days. This change should be removed before 18/07/09!
20090715 CJC Restored WMS to pre RALDownTime defaults on local and bluebear UI. Some VO configurations on BlueBear? were updated to reflect the supported WMS
20090714 CJC Installed glite-apel-core-2.0.9-13 and glite-apel-publisher-2.0.9-10 on epgmo1 at the suggestion of theGUS Helpdesk. Rerunning apel publisher to see if the accounting problem is fixed
20090714 CJC Removed /home/lcgui/SL4/prod/glite/etc/voms/na48-na48-voms.cern.ch on the local UI and replaced it with /home/lcgui/SL4/prod/glite/etc/voms/na48-voms.cern.ch. This should contain the correct VOMS server information to contact.
20090710 CJC Restarted httpd, gmond and gmetad services on epgmo1, re-enabling the Ganglia monitoring.
20090710 CJC epgmo1:/opt/glite/bin/apel-publisher editted to allow 2048MB of memory. Resulted in heavy swapping on epgmo1 but apel has not yet crashed. This might be the source of the occasional RGMA SAM test failures due to timeouts.
20090710 CJC Updated glite-MON on epgmo1. Backup of mysql database dumped to ~cjc/Grid/ManRepo/Backups/epgmo1/20090708/. Reran yaim, with APEL_PUBLISH_LIMIT variable added. New yaim config backup can be found at ~cjc/Grid/ManRepo/Yaim/epgmo1/. This site-info.def file is a stripped down version of the existing file, so other services (Nagios?) may now be broken/missing.
20090709 CJC epgd02 rebooted remotely. Automounts correct lhcb software disk.
20090709 CJC LHCb software area copied from epgce3:/egee/soft/lhcb to epgsr1:/disk/f15d/egee/soft/lhcb after gus ticket reported disk was full. epgce3 WNs updated to mount correct software area. Full details here.
20090707 LSL The DPM packages got updated to 1.7 on 13 May on epgse1 and epgsr1 because my action on 20090512 to chkconfig yum off was insufficient: I should have done a service yum stop too!
20090707 LSL LHCb requests that the qstat command is available to WNs so that jobs can discover their own time-left (GGUS ticket 50000). This is a reasonable request, particularly in a job-wrapper script. Module environment is currently disabled for grid WNs. So the SL5 and SL4 versions of qstat from subdirs of /cvos/shared/apps/torque/ have been copied to /egee/soft/local/ and these are now linked to at grid WN boot time into /usr/bin in base and in SL4 chroot.
20090702 CJC Global problems with the APEL publishing cause CIC to broadcast a request that APEL SAM tests be ignored until further notice.
20090702 CJC Steve Lloyd's tests are reporting Birmingham having 320 CPUs total, but 567 free. Not sure how to fix this ...
20090702 CJC Problem voms-proxy-init 'ing on eprexa. Updated certificates, but problem was down to large number of files in /tmp. Emailed Mark to ask about deleting temporary files.
20090702 CJC Changed RFIO buffer on local farm and grid CE's to 4096 bytes. Other sites (QMUL) run with such a small buffer for a long time and have not experienced problems
20090702 CJC BB back online. Failed one SAM test, but I think this was because the cluster wasn't back up before the end of the scheduled downtime. Four ATLAS pilot jobs are already running, even though the ldap information has not been updated. Reverted /opt/lcg/libexec/lcg-info-dynamic-pbs to original file.
20090629 CJC Haven't solved the mysqld problem, just coded around it :s Added the line tmpdir=/var/tmp to the [mysqld] field in /etc/my.cnf, and this seems to have fixed the problem as I can now restart mysqld as a service. Other dpm/srm/rfio services restarted. Successfully lcg-cr and cp'ed a file from the SE. Awaiting results of SAM tests within the hour...
20090629 CJC Checked yum logs on se1 and sr1. Looks like DPM 1.7 was installed automatically on 13th May. Why problems with the schema only appeared now is still a mystery. Followed these instructions to change the schema - reasonably painless. Can now lcg-cp/r as before and local rfio access has been restored. mysqld still causing problems - can only run as user process. GGUS ticket opened.
20090629 CJC Updated ldap information for epgce4 as BB downtime is causing problems for LHCb (they can still see the queue). Would normally use qmgr -c "set queue glong enabled=false", but we don't own the torque server for BB so this is not possible. Instead, edited /opt/lcg/libexec/lcg-info-dynamic-pbs so that push @output, "GlueCEStateStatus: $Status\n"; becomes @output, "GlueCEStateStatus: Draining\n";. Backup saved to /opt/lcg/libexec/lcg-info-dynamic-pbs.20090629. The command lcg-info --vo atlas --list-ce --attrs 'CEStatus' confirms that epgce4 is not available for jobs.
20090628 CJC Birmingham site entered into the Atlas Ganga blacklist. epgse1 failing ops and atlas SAM tests - can't lcg-cp from SE.
20090628 CJC yum update -y lcg-CA on epgd01-16
20090628 CJC pbsnodes -a on epgce4 shows status of all BlueBEAR nodes. grep on u4n shows only 14 cpus have jobs (some cpus have more than one job), the highest being u4n128
20090627 CJC Very large (~103) number of gridftp processes running on epgse1. Checked log file and the majority of requests are coming from a single H1 user certificate at a range of different sites. Stopped srmv1, srmv2 and srmv2.2 services temporarily (triggered a failed SAM test) and the number of gridftp processes decreased. Restarted services and epgse1 became more responsive (including responding to local athena file requests). Will continue to monitor and get in touch with H1 user if appropriate.
20090627 CJC SAM tests on epgce3 and epgse1 fail when trying to copy files to the SE. /tmp directory missing on SE, due to mistake on Monday when trying to distribute new certificates using rdist. /tmp has been restored with the same permissions as those on epgsr1. CRL cron job running manually. Expect another set of SAM tests within the hour, which should reveal if this has fixed the problem.
20090626 CJC BB jobs queued again. Re-running fetch CRL script. qstat/qs command fails to respond. Full details here.
20090626 CJC Post Edit: This didn't fix anything! It just republished already existing data! Addressed APEL problem in gus tickets 49689 and 49453. Haven't fixed java memory problem, but I think accounting can be successfully published on a day by day basis. Example APEL config file can be found on epgmo1:/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml.BHAM.gap. Only May 20->May 21 have been published in this manner - waiting for confirmation that it works before proceeding.
20090624 CJC Distributed updated lcg-CA certificates to BlueBEAR and Local UI's. Completed by tarring epgce4:/etc/grid-security/vomsdir/ and certificates/ and installing them in /egee/soft/middleware/etc/grid-security/ on BEAR and /home/lcgui/SL4/etc/grid-security/ on the local system. Backup's copied to vomsdir.20090625/ and certificates.20090625 of installation directories.
20090624 CJC Regenerated /etc/grid-security/vomsdir/grid-voms.desy.de.8119.pem on epgse1 and epgsr1 based on Zeus .crt file on CIC. Removed grid-voms.desy.de.pem file on se (no serial number in the file name). Also removed /etc/grid-security/vomsdir/zeus/grid-voms0.desy.de.lsc (additional 0 in filename).
20090624 CJC Installed pine using yum on epgce3-4, epgmo1, epgse1 and epgsr1 - I really miss the pico editor!
20090623 CJC After speaking to Joseph, I've killed all his jobs on epgce3 and 4. According to Ganga he didn't have any jobs running, but was previously having problems with them entering the sleep state.
20090623 CJC Adding support for the vo.u-psud.fr VO for Karl. More details can be found here. Confirmed as working.
20090623 CJC Updated lcg-CA on epgce1-4 and epgmo1. epgse1, epgsr1 and ce3 worker nodes all appear to have updated automatically. local UI, BB UI and BB WNs not yet updated (don't know how).
20090623 CJC ALL BB JOBS QUEUED - problem fixed by updating lcg-CA to 1.30 and running manual download of CRLs
20090618 LSL Working on APEL publishing problem still, see 20090615, GGUS ticket 49453. Rebooted epgmo1 which provoked another error: tomcat5 wouldn't start because /etc/tomcat5/tomcat5.conf had JAVA_HOME="/usr/java/jdk1.6.0_10" (file last modified 20090106) whereas only java jdk version is 1.6.0_12 (installed by Yves 20090205). Very odd that it was running at all before my reboot, but the previous reboot was 20090115, before the jdk update, so this was a problem waiting to happen!
20090616 CJC Changed UI setup on local system and BlueBEAR to cope with RAL WMS downtime. Full details can be found here.
20090615 LSL/CJC To fix UI client problem for dpm, soft-link /home/lcgui/SL4/prod/lcg/lib/libshift.so.2.1 -> libdpm.so, and /bb/projects/lcgui/prod/lcg/lib/libshift.so.2.1 -> libdpm.so .
20090615 LSL Has submitted a GGUS ticket to request help on epgmo1 not publishing Apel information. Symptoms in apel.log are Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded. More recently getting "Warning: No producers found to answer query" messages, but others are getting this which may be due to RAL machine room manoeuvres.
20090612 LSL Following a GGUS ticket pointing out that compat-libstdc++-33.i386 was missing on epgd11, checked and found that only 6 of the 16 nodes were identical for packages. epgd01 to epgd09 have been yum-updated for RedHat? packages only, using yum -c /etc/yum.conf.wnredhat update. The remainder will be done if there are no adverse effects. This will at least remove minor discrepancies amongst the packages. Then we can check for major package differences.
20090612 LSL On BlueBEAR, on request, Alan has added a new user g-admin, which can be sudo'd to by me and Chris, just like the other g-users. This ID is now used for files previously owned by Yves, namely /bb/projects/lcgui/ and /egee/soft/middleware/, and for the update-crl cron jobs that run on BB on selected WNs. Checked file /egee/soft/middleware/log/fetch-crl-cron.log later to ensure that the cron jobs were continuing to run successfully: they were.
20090612 LSL make /bb/projects/lcgui/etc/grid-security point to /egee/soft/middleware/etc/grid-security, so that the UI software on BB for local users uses the same certificates and CRLs as are used by grid jobs in the grid-section of BB. This saves having to do a separate cron job or manual update.
20090611 LSL Put in bonding for the epgsr1 interfaces to speed up fetching and rfio reads: documented at LocalGridBonding.
20090608 LSL Tweak the maui scheduling limits for cpu bound users like lhcb and zeus, to make best use of the clusters.
20090604 LSL Fixed up proper ownership of software areas on BlueBEAR, which had not been done since the software areas were copied by Yves from eScience cluster (on epgce2), which used a different uid/gid layout. Particularly necessary for atlas and lhcb areas. Kept Alessandro (ATLAS) and Vladimir (LHCb) informed.
20090603 LSL Noticed that STEP 09 jobs are randomly going to short or long queues on both CEs: a time Requirements is missing in the jobs. Wrote a script to qalter and qmove jobs of the relevant user.
20090602 LSL Checked that fair-shares in maui matched those requested by ATLAS and LHCb, and set MAXPROC limits for groups (and some users) to match those fair-shares. To make those semi-permanent for moab on BlueBEAR, so it survives a moab restart, I got Alan to add those to the moab.cfg file.
20090531 LSL Added pilot users for atlas and lhcb. See LocalGridPilotAdd.
20090528 LSL Looking at BB error messages, observed that a few grid users lacked ssh-keys (and also .profile) so job output would always fail to get copied back: g-atl046 g-dtm005 g-dtm056 g-dtm084 g-dtm100 g-ops006. As no-one else lacked these, it is maybe safe to assume that this was a self-inflicted problem by those users.
20090527 CJC Manual edit of site-info.def and vo.d/biomed on epgse1 and epgsr1 to remove extra "/" at end of VO_VOMSES
20090526 LSL Updated bouncycastle package on epgmo1 following advice from APEL team: see 20090520; now the java exceptions for republishing APEL information do not occur. Running my /root/bin/apel-republish to re-process old APEL data, week by week to avoid overloading RGMA, with user DN accounting info added.
20090526 CJC Manual edit of /etc/grid-security/vomsdir/biomed/cclcgvomsli01.in2p3.fr.lsc on epgse1 and epgsr1 to remove extra "/" character at end of first line.
20090521 LSL Updated static information for epgce4: number of job slots to 192, the new figure as of last week. Updated CE_PHYSCPU and CE_LOGCPU in site-info.def, and corresponding definitions in file /opt/glite/etc/gip/ldif/static-file-Cluster.ldif; yaim not actually run. Confirmed that ldap query and later GSTAT reflected the change.
20090521 LSL Updated kernel on epgce3 from 2.6.9-78.0.1 to 2.6.9-78.0.22, as checking new logging didn't reveal the cause of the problem. Not rebooted as yet.
20090521 CJC Copied biomed_certificate.crt to /etc/grid-security/vomsdir/ on epgse1 and epgsr1 (certificate obtained from CIC).
20090521 LSL epgce3 was down 19:30 last night to 09:35 this morning. Similar symptoms to 20090430. Check logging. Approx 77 long jobs continued to run.
20090520 LSL 13:28 to 13:55 on epgce4: qsub and qstat were failing to contact the BlueBEAR pbs server. DNAT ruleset in BlueBEAR export machine, which routes packets from epgce4 to the qmaster machine on the BlueBEAR private network and back, was temporarily missing after Aslam restarted Shorewall. Rang him to remind him that it was necessary to run my /root/nat-qmaster.sh script which adds that ruleset. I must put that in the init.d/shorewall script or put a check in an hourly cron job on bbexport.
20090520 LSL Configuring epgmo1 following info on this DN accounting advice page so that encoded user-DN information is published in APEL accounting. However, this led to java null pointer exceptions in the apel publisher cron job, which I have issued a ticket for.
20090518 LSL f9 Raid replacement disk (ST3750640AS, 750GB) has arrived. Cloning failing drive slot 12 (to spare in slot 10). Then removed failing drive and inserted replacement as new local spare. Reported similar problem in drive slot 11 to supplier.
20090515 LSL looking at possibly redeploying epgce1x as a future backup BDII (mac addr ending f3:79, one of our 2004 Streamline-supplied front-ends, i686 architecture, not to be confused with current epgce1). Successfully installed this machine as a SL4.6 32-bit glite site BDII, with the new site-info.def, and tested its ldap responses remotely.
20090515 LSL created new site-info.def file, starting from the example distributed with glite 3.1, with customisation compatible with our previous site-info.def files. CE-dependent definitions omitted and in future these can go into a subdirectory file.
20090515 LSL fixed problem of epgce2 still advertised as a CE by the epgce2 BDII: on epgce2, kept a ORIG copy of /opt/glite/etc/gip/ldif directory and then deleted static files relating to CE role. Also stopped the globus-gatekeeper service.
20090514 LSL copied off accounting records from epgce1x machine (aka epaf18!) so it can be redeployed. We also have a copy of all accounting records on current epgce1 and epgce2.
20090513 LSL writing a script to add pilot userids/groups to our CEs.
20090512 LSL Pete Gronbech reports that DPM 1.7 is imminent and requires a schema change so should not be done automatically. I've done "chkconfig yum off" on epgse1 and epgsr1 as quick fix to avoid an automatic upgrade.
20090512 LSL Created tgz file from epgce4:/etc/grid-security/{certificates,vomsdir}. Used that to create an updated bluebear /egee/soft/middleware/etc/grid-security/ directory for bluebear's WNs. Used that also on PP system to update /home/lcgui/SL4/etc/grid-security. Ensured preserving same ownership as the userid which automatically updates the CRLs.
20090512 LSL Updated all epgce3 WNs to lcg-CA 1.29: yum -y update lcg-CA.
20090508 LSL Updated lcg-CA pkg to 1.29 on epgce[234] and epgmo1. Servers epgs[er]1 were auto-updated Weds 4am. To do: propagate to UI and WNs.
20090508 LSL Raid f9 reports media write error 311 for drive in slot12, reassign count = 8, though RAID is still in Good status. Googling shows error 311 is Write to disk error. Reported to vendor.
20090508 LSL Steve Lloyd monitor recording a lot of failed Atlas jobs on CE epgce3. SAM tests running clean, apart from an early warning about cert of NIKHEF. Most jobs are running on epgd01. Turns out that epgd01 has no remote NFS mounts currently - all other nodes have the software areas mounted correctly. Put epgd01 offline, pending investigation.
20090506 CJC/LSL epgse1: to remedy SRM GGUS-notified problem, srm1 service restarted.
20090430 LSL epgce3 restarted twice today (11:05 and 20:28), because it froze: responds to ping and telnet 22 but not ssh and console frozen too. Put in extra logging in new /etc/cron.daily/minutely/ to monitor memory use and filesystems.

-- ChristopherCurtis - 19 Jan 2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 19 Jan 2010 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61christopher_32curtis?
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback