This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like
for more carefully considered documentation.
20121207 |
MWS/LSL |
BB2 nodes system installed using their 1Gbit interface, using normal PXE, after some trial and error. Look at supporting their 10GbE interface, ideally also with PXE. |
20121206 |
LSL/MWS |
Physics power failed at approx 2am. All back up before 10am. |
20121130 |
LSL |
Another failed C6145 drive, beeping and flashing amber, this time on epgf08, reported to Dell yesterday, fitted replacement today. |
20121121 |
LSL |
On advice from Dell SkyTech engineer, update C6145 fan control board (FCB) firmware on two chassis from [0117] to [0118]. Two were already at [0118]. |
20121106 |
MWS/LSL |
Campus power failed 6th November, finally fixed 23rd November, meanwhile on a reduced supply so grid is in downtime. |
20121019 |
LSL |
Notice beeping and see that C6145 system epgf07 slot 9 disk is flashing amber. Contact Dell Pro Support 0844 444 3844. New drive on Monday, pick up old on Tuesday. |
20120928 |
MWS |
Mark observes epgd19 had problems and is now showing just 12GB of its 16GB memory in /proc/meminfo. Marked offline. |
20120924 |
MWS/LSL |
Mark is testing EMI2 upgrade to epgse1 on a VM and later on bare machine - causes difficulties, later reverts to old SE after number of days testing. |
20120924 |
MWS/LSL |
Mark temporarily offlines epgu1n046 for Aslam and BB2 installers to test a BB2 10GbE interface on our Force10 switch. |
20120919 |
MWS/LSL |
Mark adds some tcp settings in sysctl.conf on storage pool nodes to see effect on transfer speeds. Lawrie suggests a future refinement would be to tweak txqueuelen to 10000. |
20120918 |
LSL/MWS |
Do some traceroutes to BNL as additional info for very slow transfer, add to GGUS ticket 86105. |
20120806 |
LSL |
Noticed that over the last couple of days, epgu1n021 had two syslogged errors: EDAC k8 MC1: extended error code: ECC chipkill x4 error. |
20120802 |
LSL |
After successive media write errors on f17 over the last weeks, replacing 3 drives, one by one; about 6 hours to do a rebuild for each one. RAID status remains Good. |
20120730 |
LSL |
Noticed that epgf05 had 24 July syslog kern.info entry: Northbridge Error (node 4): ECC Error in the Probe Filter directory; also 27 July ditto node 0. Nothing in ipmitool sel list. |
20120716 |
LSL |
Upgrade epgf03 04 07 BIOS to 2.6.0, and BMC to 1.08, like the others. Now all complete. |
20120713 |
LSL |
Benchmarks finished on epgf01 02 05 06. All returned to GRID use. No IPMI temperature events now (or over the weekend). |
20120711 |
LSL |
Upgrade epgf08 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. Same for epgf01 and epgf02. epgf08 returned to use. Still to do: 03 04 07. |
20120710 |
LSL/MWS |
epgf08 turns itself off at 10:03. Log file /var/log/acpid entries BEGIN/END HANDLER MESSAGES implicates some power event or other at that time. |
20120703 |
MWS |
Mark notices that epgf01 and epgf02 are very hot and have critical messages in BMC/ipmi log. Turns them off. |
20120620 |
LSL |
Updated epgf05 and epgf06 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. But their fans are still running at half the speed of other units. |
20120620 |
LSL |
Install replacement disks on f15 and f16: now back to fully redundant. |
20120620 |
MWS/LSL |
All services up and working on the new BHAM-1 network. |
20120619 |
LSL/MWS |
Installation of 14kw Mitsubishi complete and working by about 4pm. DNS updated by network team to new BHAM-1 subnet. Start reconfiguring for new network. |
20120618 |
LSL/MWS |
In grid downtime: computer room is polythene-curtained-off for new 14kW Mitsubishi aircon unit to be installed, today and tomorrow. |
20120615 |
LSL |
Observe that BHAM-1 subnet test DNS definitions are in place, will try to speak to Nick on Monday when he's in. |
20120608 |
LSL/MWS |
Status is epgd* on, epgu* on, epgf01,02,07,08 on. epgf05,06 are off for engineer. epgf03,04 are off because aircon is at 75% capacity. |
20120608 |
LSL |
Ran Intel's bootutil on epgf03,04,05,06 to configure the 10GbE interfaces to do PXE boot. See below. |
20120607 |
LSL |
Ran Intel's bootutil on epgf01,02,07,08 to configure the 10GbE interfaces to do PXE boot. See DellC6145Init page for more details. |
20120606 |
LSL |
epgf05 and epgf06 are over-heating: flashing green/amber power indicator, and lots of entries in ipmi/BMC log. Maybe lack of fan-speed. Mark is draining these. |
20120606 |
LSL |
On epgf03-06, fix ifcfg-eth0 file so that it will pick up IP of 10G interface via DHCP. epgf05-06 are using this now for all jobs; epgf03-04 will be restarted soon. |
20120601 |
LSL/MWS |
Big switch-over for all grid services (except epgmo1) to use S4810 switches. Required adding Intel X520-DA interfaces on the pool nodes epgsr*, so which now are at 10G rather than 4x1G. BlueBEAR workers overall bandwidth thereby improved 10-fold. Workers epgf* still using 1G: will be fixed asap. |
20120529 |
LSL |
Braywhite fit new Mitsubishi aircon PKA-RP100KAL on left wall. Minimal drilling and no welding or soldering so disruption was minimal, it turned out. Still 1 aircon down, 3 working. |
20120524 |
LSL |
Try without portable aircon, with just epgf03,04 and epgd01,24 on, plus BB of course. Temp 22 deg (sunny), 22.5 @ 14:30, 22.7 at 16:45. |
20120523 |
LSL/MWS |
Mark has put kickstart in place for BB nodes, Phil in Elms Road has rebooted them to get system. I've put final 3 gig inserts in place and cabled them separately. BB workers epgu1n001-047 are available. Overall bandwidth will still be limited to 1G until big switch-over for grid services (next week). |
20120523 |
LSL |
Portable aircon heat-dump to corridor during day. Temperature 18 deg but can't online PP workers because many jobs are longer than working day. Rang maintainers to request they ensure a quote for compressor is with Estates by tomorrow. |
20120522 |
LSL |
Aircon unit A compressor is over-current, probably internal flow problem, say Integral. They provide two portable aircon units but no suitable heat dump area overnight. |
20120521 |
LSL |
Reported aircon not coping with heat-wave. Maintenance informed. Temp goes to 35 deg so Mark starts shutting workers. |
20120521 |
LSL |
S4810 switch #5 installed in BlueBEAR rack, supports u1n001 up, and 10 Gb/s fibre link works after T/R swap. Units picking up Mark's dhcp updates. |
20120516 |
LSL/MWS |
C6145 unit epgf05 + epgf06 now in production. |
20120509 |
LSL |
Dell engineer replaced faulty 4GB memory card on epgf02 (JCLB95J). Benchmark HS06 subsequently ran to completion (five hours) without problems. |
20120508 |
LSL/MWS |
Mark starts to use a nominal 10 HEP-SPEC06 (2500 kSI2k) in information system for cluster, together with a $cputmult and $wallmult in mom_priv/config for our different processors. |
20120508 |
LSL/MWS |
C6145 unit epgf03 + epgf04 now in production. |
20120504 |
LSL |
Mark has disabled BB CE so last grid job to run via BB Torque/MOAB finished on 2nd May 2012. |
20120504 |
LSL |
Re-install epgf03 after benchmarks. epgf04 not responsive so re-install: now ok, so run 48-core benchmark as burn-in/evaluate. |
20120503 |
LSL |
epgf02 has Northbridge memory errors. Checked with a SL6 system as well. Engineer called: will visit Tuesday. Running benchmark on epgf03 as burn-in/evaluate. |
20120503 |
MWS/LSL |
Install SL5.8 on epgf03 using kickstart with XFS root filesystem, after several iterations in the %pre section. |
20120426 |
LSL |
Required disks for c6145 units have arrived so fit them and start formatting them as RAID. |
20120426 |
LSL |
Discussion with PSH of BB about son of BB cluster: due Aug/Sept 2012. Will need to reconfigure gridPP to low BB nodes first, because of phased replacement programme. |
20120418 |
LSL |
CVMFS on BB cluster tends to get "Transport endpoint is not connected" more often than PP cluster so detect the problem and lazy-umount it if detected. Seems to work. Hunch that the latest cvmfs version 2.0.13 would include a fix for this problem proves incorrect in practice. |