Local Grid Journal 2012

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

20121207 MWS/LSL BB2 nodes system installed using their 1Gbit interface, using normal PXE, after some trial and error. Look at supporting their 10GbE interface, ideally also with PXE.
20121206 LSL/MWS Physics power failed at approx 2am. All back up before 10am.
20121130 LSL Another failed C6145 drive, beeping and flashing amber, this time on epgf08, reported to Dell yesterday, fitted replacement today.
20121121 LSL On advice from Dell SkyTech engineer, update C6145 fan control board (FCB) firmware on two chassis from [0117] to [0118]. Two were already at [0118].
20121106 MWS/LSL Campus power failed 6th November, finally fixed 23rd November, meanwhile on a reduced supply so grid is in downtime.
20121019 LSL Notice beeping and see that C6145 system epgf07 slot 9 disk is flashing amber. Contact Dell Pro Support 0844 444 3844. New drive on Monday, pick up old on Tuesday.
20120928 MWS Mark observes epgd19 had problems and is now showing just 12GB of its 16GB memory in /proc/meminfo. Marked offline.
20120924 MWS/LSL Mark is testing EMI2 upgrade to epgse1 on a VM and later on bare machine - causes difficulties, later reverts to old SE after number of days testing.
20120924 MWS/LSL Mark temporarily offlines epgu1n046 for Aslam and BB2 installers to test a BB2 10GbE interface on our Force10 switch.
20120919 MWS/LSL Mark adds some tcp settings in sysctl.conf on storage pool nodes to see effect on transfer speeds. Lawrie suggests a future refinement would be to tweak txqueuelen to 10000.
20120918 LSL/MWS Do some traceroutes to BNL as additional info for very slow transfer, add to GGUS ticket 86105.
20120806 LSL Noticed that over the last couple of days, epgu1n021 had two syslogged errors: EDAC k8 MC1: extended error code: ECC chipkill x4 error.
20120802 LSL After successive media write errors on f17 over the last weeks, replacing 3 drives, one by one; about 6 hours to do a rebuild for each one. RAID status remains Good.
20120730 LSL Noticed that epgf05 had 24 July syslog kern.info entry: Northbridge Error (node 4): ECC Error in the Probe Filter directory; also 27 July ditto node 0. Nothing in ipmitool sel list.
20120716 LSL Upgrade epgf03 04 07 BIOS to 2.6.0, and BMC to 1.08, like the others. Now all complete.
20120713 LSL Benchmarks finished on epgf01 02 05 06. All returned to GRID use. No IPMI temperature events now (or over the weekend).
20120711 LSL Upgrade epgf08 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. Same for epgf01 and epgf02. epgf08 returned to use. Still to do: 03 04 07.
20120710 LSL/MWS epgf08 turns itself off at 10:03. Log file /var/log/acpid entries BEGIN/END HANDLER MESSAGES implicates some power event or other at that time.
20120703 MWS Mark notices that epgf01 and epgf02 are very hot and have critical messages in BMC/ipmi log. Turns them off.
20120620 LSL Updated epgf05 and epgf06 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. But their fans are still running at half the speed of other units.
20120620 LSL Install replacement disks on f15 and f16: now back to fully redundant.
20120620 MWS/LSL All services up and working on the new BHAM-1 network.
20120619 LSL/MWS Installation of 14kw Mitsubishi complete and working by about 4pm. DNS updated by network team to new BHAM-1 subnet. Start reconfiguring for new network.
20120618 LSL/MWS In grid downtime: computer room is polythene-curtained-off for new 14kW Mitsubishi aircon unit to be installed, today and tomorrow.
20120615 LSL Observe that BHAM-1 subnet test DNS definitions are in place, will try to speak to Nick on Monday when he's in.
20120608 LSL/MWS Status is epgd* on, epgu* on, epgf01,02,07,08 on. epgf05,06 are off for engineer. epgf03,04 are off because aircon is at 75% capacity.
20120608 LSL Ran Intel's bootutil on epgf03,04,05,06 to configure the 10GbE interfaces to do PXE boot. See below.
20120607 LSL Ran Intel's bootutil on epgf01,02,07,08 to configure the 10GbE interfaces to do PXE boot. See DellC6145Init page for more details.
20120606 LSL epgf05 and epgf06 are over-heating: flashing green/amber power indicator, and lots of entries in ipmi/BMC log. Maybe lack of fan-speed. Mark is draining these.
20120606 LSL On epgf03-06, fix ifcfg-eth0 file so that it will pick up IP of 10G interface via DHCP. epgf05-06 are using this now for all jobs; epgf03-04 will be restarted soon.
20120601 LSL/MWS Big switch-over for all grid services (except epgmo1) to use S4810 switches. Required adding Intel X520-DA interfaces on the pool nodes epgsr*, so which now are at 10G rather than 4x1G. BlueBEAR workers overall bandwidth thereby improved 10-fold. Workers epgf* still using 1G: will be fixed asap.
20120529 LSL Braywhite fit new Mitsubishi aircon PKA-RP100KAL on left wall. Minimal drilling and no welding or soldering so disruption was minimal, it turned out. Still 1 aircon down, 3 working.
20120524 LSL Try without portable aircon, with just epgf03,04 and epgd01,24 on, plus BB of course. Temp 22 deg (sunny), 22.5 @ 14:30, 22.7 at 16:45.
20120523 LSL/MWS Mark has put kickstart in place for BB nodes, Phil in Elms Road has rebooted them to get system. I've put final 3 gig inserts in place and cabled them separately. BB workers epgu1n001-047 are available. Overall bandwidth will still be limited to 1G until big switch-over for grid services (next week).
20120523 LSL Portable aircon heat-dump to corridor during day. Temperature 18 deg but can't online PP workers because many jobs are longer than working day. Rang maintainers to request they ensure a quote for compressor is with Estates by tomorrow.
20120522 LSL Aircon unit A compressor is over-current, probably internal flow problem, say Integral. They provide two portable aircon units but no suitable heat dump area overnight.
20120521 LSL Reported aircon not coping with heat-wave. Maintenance informed. Temp goes to 35 deg so Mark starts shutting workers.
20120521 LSL S4810 switch #5 installed in BlueBEAR rack, supports u1n001 up, and 10 Gb/s fibre link works after T/R swap. Units picking up Mark's dhcp updates.
20120516 LSL/MWS C6145 unit epgf05 + epgf06 now in production.
20120509 LSL Dell engineer replaced faulty 4GB memory card on epgf02 (JCLB95J). Benchmark HS06 subsequently ran to completion (five hours) without problems.
20120508 LSL/MWS Mark starts to use a nominal 10 HEP-SPEC06 (2500 kSI2k) in information system for cluster, together with a $cputmult and $wallmult in mom_priv/config for our different processors.
20120508 LSL/MWS C6145 unit epgf03 + epgf04 now in production.
20120504 LSL Mark has disabled BB CE so last grid job to run via BB Torque/MOAB finished on 2nd May 2012.
20120504 LSL Re-install epgf03 after benchmarks. epgf04 not responsive so re-install: now ok, so run 48-core benchmark as burn-in/evaluate.
20120503 LSL epgf02 has Northbridge memory errors. Checked with a SL6 system as well. Engineer called: will visit Tuesday. Running benchmark on epgf03 as burn-in/evaluate.
20120503 MWS/LSL Install SL5.8 on epgf03 using kickstart with XFS root filesystem, after several iterations in the %pre section.
20120426 LSL Required disks for c6145 units have arrived so fit them and start formatting them as RAID.
20120426 LSL Discussion with PSH of BB about son of BB cluster: due Aug/Sept 2012. Will need to reconfigure gridPP to low BB nodes first, because of phased replacement programme.
20120418 LSL CVMFS on BB cluster tends to get "Transport endpoint is not connected" more often than PP cluster so detect the problem and lazy-umount it if detected. Seems to work. Hunch that the latest cvmfs version 2.0.13 would include a fix for this problem proves incorrect in practice.
