TWiki> Computing Web>LocalGridJournal (revision 11)EditAttach

Local Grid Journal 2013

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20130405 MWS Upgraded CREAM, Torque and BDII to EMI2 and jobs seem to be going through OK. Still have ARGUS and APEL to go but I'm waiting for the accounting to sort itself out first.
20130405 MWS Applied the security kernel upgrade to all nodes and rebooted
20130220 MWS Added another QOS group of 'high' for all sgm/ops jobs as setting the priority on hte groups didn't seem to be working
20130220 MWS Changed the sysctl network parameters for the pool nodes as these apparently stop the slow transfer to BNL problem (https://ggus.eu/ws/ticket_info.php?ticket=86105). This is still not understood!
20130219 MWS Disabled the cache in Argus as I've heard this will stop the argus service from failing once a week
20130131 MWS Set the parameter FSQOSWEIGHT and removed user and group ones in the maui.cfg to actually make it register the new QOS fairshares
20130130 MWS After running process accounting on epgf01 for 24 hours and checking the log files didn't blow up, I've enabled process accounting (psacct) on all machines through puppet
20130128 MWS Changed the MAUI config to use QOS Fairshare targets for the various 'groups': Alice(60):Atlas(30):LHCb(5):Others(5). Blanked all the others.
20130125 MWS Added the OpenIPMI? -tools to puppet so all machines will now have this installed.
20130124 MWS Released the limit of ALICE jobs in MAUI so they are now only controlled by fairshare. I'll see if we get any more problems with too much load on the epf* nodes.
20130122 MWS Changed the fair share for ATLAS jobs to 15% from 23% as the combined amount of pilot and prod jobs were ~50% (i.e. too high) and blocking the site. Need to improve the fairshare to take account of multiple groups...
20130121 MWS Installed VomsSnooper? (and the required java-1.7.0-openjdk) on epgpe03 to ease keeping the VOMS info up to date. Go to /opt/GridDevel/vs_scripts, run set_rpm_paths.sh, go to usecases/newVomsRecsForMySite, run voidRun.sh and then use void/xml/site-info.def to update the site-info.def template.
20130121 LSL C6145: Dell informs that C6145 batteries for RAID units have a short life and can be proactively replaced.
20130108 LSL BB2 nodes: Investigate 10GbE driver: SL5 built-in version doesn't work, so download mlnx4_en driver version 1.5.9 from Mellanox website, and get it working.
20130108 LSL/MWS BB2 nodes: decide it best not to ask ITS to replace Mellanox ConnectX-3 firmware for PXE as this will make swap of a failing node much more difficult.
20130104 MWS/LSL SE (implemented on epgpe10) using EMI2, previously flakey, is now performing reliably since nscd service put into use, caching DNS requests
20130103 LSL RAID f25 switched on, logical drives already in place since before Christmas, now logical volumes created and then partitioned, as required for ESDS raids.

Previous journals: LocalGridJournal2012, LocalGridJournal2011, LocalGridJournal2010, LocalGridJournal2009.

Created LawrenceLowe - 07 Jan 2013

Edit | Attach | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 05 Apr 2013 - 09:19:07 - MarkSlater
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback