LocalGridJournal2012 < Computing

Computing Web>LocalGridTopics>LocalGridJournal2012 (08 Jan 2013, _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe? ) (raw view)
EditAttach
---+ Local Grid Journal 2012

This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation.

| 20121207 | MWS/LSL | BB2 nodes system installed using their 1Gbit interface, using normal PXE, after some trial and error. Look at supporting their 10GbE interface, ideally also with PXE. |
| 20121206 | LSL/MWS | Physics power failed at approx 2am. All back up before 10am. |
| 20121130 | LSL | Another failed C6145 drive, beeping and flashing amber, this time on epgf08, reported to Dell yesterday, fitted replacement today. |
| 20121121 | LSL | On advice from Dell !SkyTech engineer, update C6145 fan control board (FCB) firmware on two chassis from [0117] to [0118]. Two were already at [0118]. |
| 20121106 | MWS/LSL | Campus power failed 6th November, finally fixed 23rd November, meanwhile on a reduced supply so grid is in downtime. |
| 20121019 | LSL | Notice beeping and see that C6145 system epgf07 slot 9 disk is flashing amber. Contact Dell Pro Support 0844 444 3844. New drive on Monday, pick up old on Tuesday. |
| 20120928 | MWS | Mark observes epgd19 had problems and is now showing just 12GB of its 16GB memory in /proc/meminfo. Marked offline. |
| 20120924 | MWS/LSL | Mark is testing EMI2 upgrade to epgse1 on a VM and later on bare machine - causes difficulties, later reverts to old SE after number of days testing. |
| 20120924 | MWS/LSL | Mark temporarily offlines epgu1n046 for Aslam and BB2 installers to test a BB2 10GbE interface on our Force10 switch. |
| 20120919 | MWS/LSL | Mark adds some tcp settings in sysctl.conf on storage pool nodes to see effect on transfer speeds. Lawrie suggests a future refinement would be to tweak txqueuelen to 10000. |
| 20120918 | LSL/MWS | Do some traceroutes to BNL as additional info for very slow transfer, add to GGUS ticket 86105. |
| 20120806 | LSL | Noticed that over the last couple of days, epgu1n021 had two syslogged errors: EDAC k8 MC1: extended error code: ECC chipkill x4 error. |
| 20120802 | LSL | After successive media write errors on f17 over the last weeks, replacing 3 drives, one by one; about 6 hours to do a rebuild for each one. RAID status remains Good. |
| 20120730 | LSL | Noticed that epgf05 had 24 July syslog kern.info entry: Northbridge Error (node 4): ECC Error in the Probe Filter directory; also 27 July ditto node 0. Nothing in ipmitool sel list. |
| 20120716 | LSL | Upgrade epgf03 04 07 BIOS to 2.6.0, and BMC to 1.08, like the others. Now all complete. |
| 20120713 | LSL | Benchmarks finished on epgf01 02 05 06. All returned to GRID use. No IPMI temperature events now (or over the weekend). |
| 20120711 | LSL | Upgrade epgf08 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. Same for epgf01 and epgf02. epgf08 returned to use. Still to do: 03 04 07. |
| 20120710 | LSL/MWS | epgf08 turns itself off at 10:03. Log file /var/log/acpid entries BEGIN/END HANDLER MESSAGES implicates some power event or other at that time. |
| 20120703 | MWS | Mark notices that epgf01 and epgf02 are very hot and have critical messages in BMC/ipmi log. Turns them off. |
| 20120620 | LSL | Updated epgf05 and epgf06 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. But their fans are still running at half the speed of other units. |
| 20120620 | LSL | Install replacement disks on f15 and f16: now back to fully redundant. |
| 20120620 | MWS/LSL | All services up and working on the new BHAM-1 network. |
| 20120619 | LSL/MWS | Installation of 14kw Mitsubishi complete and working by about 4pm. DNS updated by network team to new BHAM-1 subnet. Start reconfiguring for new network. |
| 20120618 | LSL/MWS | In grid downtime: computer room is polythene-curtained-off for new 14kW Mitsubishi aircon unit to be installed, today and tomorrow. |
| 20120615 | LSL | Observe that BHAM-1 subnet test DNS definitions are in place, will try to speak to Nick on Monday when he's in. |
| 20120608 | LSL/MWS | Status is epgd* on, epgu* on, epgf01,02,07,08 on. epgf05,06 are off for engineer. epgf03,04 are off because aircon is at 75% capacity. |
| 20120608 | LSL | Ran Intel's bootutil on epgf03,04,05,06 to configure the 10GbE interfaces to do PXE boot. See below. |
| 20120607 | LSL | Ran Intel's bootutil on epgf01,02,07,08 to configure the 10GbE interfaces to do PXE boot. See DellC6145Init page for more details. |
| 20120606 | LSL | epgf05 and epgf06 are over-heating: flashing green/amber power indicator, and lots of entries in ipmi/BMC log. Maybe lack of fan-speed. Mark is draining these. |
| 20120606 | LSL | On epgf03-06, fix ifcfg-eth0 file so that it will pick up IP of 10G interface via DHCP. epgf05-06 are using this now for all jobs; epgf03-04 will be restarted soon. |
| 20120601 | LSL/MWS | Big switch-over for all grid services (except epgmo1) to use S4810 switches. Required adding Intel X520-DA interfaces on the pool nodes epgsr*, so which now are at 10G rather than 4x1G. BlueBEAR workers overall bandwidth thereby improved 10-fold. Workers epgf* still using 1G: will be fixed asap. |
| 20120529 | LSL | Braywhite fit new Mitsubishi aircon PKA-RP100KAL on left wall. Minimal drilling and no welding or soldering so disruption was minimal, it turned out. Still 1 aircon down, 3 working. |
| 20120524 | LSL | Try without portable aircon, with just epgf03,04 and epgd01,24 on, plus BB of course. Temp 22 deg (sunny), 22.5 @ 14:30, 22.7 at 16:45. |
| 20120523 | LSL/MWS | Mark has put kickstart in place for BB nodes, Phil in Elms Road has rebooted them to get system. I've put final 3 gig inserts in place and cabled them separately. BB workers epgu1n001-047 are available. Overall bandwidth will still be limited to 1G until big switch-over for grid services (next week). |
| 20120523 | LSL | Portable aircon heat-dump to corridor during day. Temperature 18 deg but can't online PP workers because many jobs are longer than working day. Rang maintainers to request they ensure a quote for compressor is with Estates by tomorrow. |
| 20120522 | LSL | Aircon unit A compressor is over-current, probably internal flow problem, say Integral. They provide two portable aircon units but no suitable heat dump area overnight. |
| 20120521 | LSL | Reported aircon not coping with heat-wave. Maintenance informed. Temp goes to 35 deg so Mark starts shutting workers. |
| 20120521 | LSL | S4810 switch #5 installed in BlueBEAR rack, supports u1n001 up, and 10 Gb/s fibre link works after T/R swap. Units picking up Mark's dhcp updates. |
| 20120516 | LSL/MWS | C6145 unit epgf05 + epgf06 now in production. |
| 20120509 | LSL | Dell engineer replaced faulty 4GB memory card on epgf02 (!JCLB95J). Benchmark HS06 subsequently ran to completion (five hours) without problems. |
| 20120508 | LSL/MWS | Mark starts to use a nominal 10 HEP-SPEC06 (2500 kSI2k) in information system for cluster, together with a $cputmult and $wallmult in mom_priv/config for our different processors. |
| 20120508 | LSL/MWS | C6145 unit epgf03 + epgf04 now in production. |
| 20120504 | LSL | Mark has disabled BB CE so last grid job to run via BB Torque/MOAB finished on 2nd May 2012. |
| 20120504 | LSL | Re-install epgf03 after benchmarks. epgf04 not responsive so re-install: now ok, so run 48-core benchmark as burn-in/evaluate. |
| 20120503 | LSL | epgf02 has Northbridge memory errors. Checked with a SL6 system as well. Engineer called: will visit Tuesday. Running benchmark on epgf03 as burn-in/evaluate. |
| 20120503 | MWS/LSL | Install SL5.8 on epgf03 using kickstart with XFS root filesystem, after several iterations in the %pre section. |
| 20120426 | LSL | Required disks for c6145 units have arrived so fit them and start formatting them as RAID. |
| 20120426 | LSL | Discussion with PSH of BB about son of BB cluster: due Aug/Sept 2012. Will need to reconfigure gridPP to low BB nodes first, because of phased replacement programme. |
| 20120418 | LSL | CVMFS on BB cluster tends to get "Transport endpoint is not connected" more often than PP cluster so detect the problem and lazy-umount it if detected. Seems to work. Hunch that the latest cvmfs version 2.0.13 would include a fix for this problem proves incorrect in practice. |
Topic revision: r27 - 08 Jan 2013 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe?
Computing
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback