Local Grid Journal 2013

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20130501 MWS To try to drag the life of f14 out a bit more, I've set the weight to 0 so new files shouldn't be written there
20130501 MWS More issues with disks on epgsr1 followed up with a kernel panic. Decided to put f14 back on it's own card but just on one channel to see if it's more stable
20130429 MWS Problem found with right (1) channel on f14. Swapped into one daisy chain with other RAIDs on epgsr1 but missed out this channel on f14. Also, switched all LUNs to the one channel in f14 config
20130405 LSL Tidying epgd workers. Moved eight 160GB disks from epgd21-24 to epgd01-08, and two 160GB disks from epgd20 to epgd12 and epgd15 (each had one bad disk). Installed ten new 500GB disks in epgd20-24. So epgd01-19 now have two 160GB, epgd20-24 now have two 500GB.
20130405 MWS Upgraded CREAM, Torque and BDII to EMI2 and jobs seem to be going through OK. Still have ARGUS and APEL to go but I'm waiting for the accounting to sort itself out first.
20130405 MWS Applied the security kernel upgrade to all nodes and rebooted
20130220 MWS Added another QOS group of 'high' for all sgm/ops jobs as setting the priority on hte groups didn't seem to be working
20130220 MWS Changed the sysctl network parameters for the pool nodes as these apparently stop the slow transfer to BNL problem (https://ggus.eu/ws/ticket_info.php?ticket=86105). This is still not understood!
20130219 MWS Disabled the cache in Argus as I've heard this will stop the argus service from failing once a week
20130131 MWS Set the parameter FSQOSWEIGHT and removed user and group ones in the maui.cfg to actually make it register the new QOS fairshares
20130130 MWS After running process accounting on epgf01 for 24 hours and checking the log files didn't blow up, I've enabled process accounting (psacct) on all machines through puppet
20130128 MWS Changed the MAUI config to use QOS Fairshare targets for the various 'groups': Alice(60):Atlas(30):LHCb(5):Others(5). Blanked all the others.
20130125 MWS Added the OpenIPMI? -tools to puppet so all machines will now have this installed.
20130124 MWS Released the limit of ALICE jobs in MAUI so they are now only controlled by fairshare. I'll see if we get any more problems with too much load on the epf* nodes.
20130122 MWS Changed the fair share for ATLAS jobs to 15% from 23% as the combined amount of pilot and prod jobs were ~50% (i.e. too high) and blocking the site. Need to improve the fairshare to take account of multiple groups...
20130121 MWS Installed VomsSnooper? (and the required java-1.7.0-openjdk) on epgpe03 to ease keeping the VOMS info up to date. Go to /opt/GridDevel/vs_scripts, run set_rpm_paths.sh, go to usecases/newVomsRecsForMySite, run voidRun.sh and then use void/xml/site-info.def to update the site-info.def template.
20130121 LSL C6145: Dell informs that C6145 batteries for RAID units have a short life and can be proactively replaced.
20130108 LSL BB2 nodes: Investigate 10GbE driver: SL5 built-in version doesn't work, so download mlnx4_en driver version 1.5.9 from Mellanox website, and get it working.
20130108 LSL/MWS BB2 nodes: decide it best not to ask ITS to replace Mellanox ConnectX-3 firmware for PXE as this will make swap of a failing node much more difficult.
20130104 MWS/LSL SE (implemented on epgpe10) using EMI2, previously flakey, is now performing reliably since nscd service put into use, caching DNS requests
20130103 LSL RAID f25 switched on, logical drives already in place since before Christmas, now logical volumes created and then partitioned, as required for ESDS raids.

