Local Grid Journal 2013
This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like
LocalGridInternals for more carefully considered documentation.
20130220 |
MWS |
Added another QOS group of 'high' for all sgm/ops jobs as setting the priority on hte groups didn't seem to be working |
20130220 |
MWS |
Changed the sysctl network parameters for the pool nodes as these apparently stop the slow transfer to BNL problem (https://ggus.eu/ws/ticket_info.php?ticket=86105). This is still not understood! |
20130219 |
MWS |
Disabled the cache in Argus as I've heard this will stop the argus service from failing once a week |
20130131 |
MWS |
Set the parameter FSQOSWEIGHT and removed user and group ones in the maui.cfg to actually make it register the new QOS fairshares |
20130130 |
MWS |
After running process accounting on epgf01 for 24 hours and checking the log files didn't blow up, I've enabled process accounting (psacct) on all machines through puppet |
20130128 |
MWS |
Changed the MAUI config to use QOS Fairshare targets for the various 'groups': Alice(60):Atlas(30):LHCb(5):Others(5). Blanked all the others. |
20130125 |
MWS |
Added the OpenIPMI? -tools to puppet so all machines will now have this installed. |
20130124 |
MWS |
Released the limit of ALICE jobs in MAUI so they are now only controlled by fairshare. I'll see if we get any more problems with too much load on the epf* nodes. |
20130122 |
MWS |
Changed the fair share for ATLAS jobs to 15% from 23% as the combined amount of pilot and prod jobs were ~50% (i.e. too high) and blocking the site. Need to improve the fairshare to take account of multiple groups... |
20130121 |
MWS |
Installed VomsSnooper? (and the required java-1.7.0-openjdk) on epgpe03 to ease keeping the VOMS info up to date. Go to /opt/GridDevel/vs_scripts, run set_rpm_paths.sh, go to usecases/newVomsRecsForMySite, run voidRun.sh and then use void/xml/site-info.def to update the site-info.def template. |
20130121 |
LSL |
C6145: Dell informs that C6145 batteries for RAID units have a short life and can be proactively replaced. |
20130108 |
LSL |
BB2 nodes: Investigate 10GbE driver: SL5 built-in version doesn't work, so download mlnx4_en driver version 1.5.9 from Mellanox website, and get it working. |
20130108 |
LSL/MWS |
BB2 nodes: decide it best not to ask ITS to replace Mellanox ConnectX-3 firmware for PXE as this will make swap of a failing node much more difficult. |
20130104 |
MWS/LSL |
SE (implemented on epgpe10) using EMI2, previously flakey, is now performing reliably since nscd service put into use, caching DNS requests |
20130103 |
LSL |
RAID f25 switched on, logical drives already in place since before Christmas, now logical volumes created and then partitioned, as required for ESDS raids. |
Previous journals:
LocalGridJournal2012,
LocalGridJournal2011,
LocalGridJournal2010,
LocalGridJournal2009.
Created
LawrenceLowe - 07 Jan 2013
Topic revision: r10 - 20 Feb 2013 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61mark_32slater