TWiki
>
Computing Web
>
LocalGridTopics
>
LocalGridJournal2012
(08 Jan 2013,
_47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe
?
)
(raw view)
E
dit
A
ttach
---+ Local Grid Journal 2012 This is a reverse order diary of events, without retrospective editing (so keep it raw and short, max ~ 3 lines). See other pages like LocalGridInternals for more carefully considered documentation. | 20121207 | MWS/LSL | BB2 nodes system installed using their 1Gbit interface, using normal PXE, after some trial and error. Look at supporting their 10GbE interface, ideally also with PXE. | | 20121206 | LSL/MWS | Physics power failed at approx 2am. All back up before 10am. | | 20121130 | LSL | Another failed C6145 drive, beeping and flashing amber, this time on epgf08, reported to Dell yesterday, fitted replacement today. | | 20121121 | LSL | On advice from Dell !SkyTech engineer, update C6145 fan control board (FCB) firmware on two chassis from [0117] to [0118]. Two were already at [0118]. | | 20121106 | MWS/LSL | Campus power failed 6th November, finally fixed 23rd November, meanwhile on a reduced supply so grid is in downtime. | | 20121019 | LSL | Notice beeping and see that C6145 system epgf07 slot 9 disk is flashing amber. Contact Dell Pro Support 0844 444 3844. New drive on Monday, pick up old on Tuesday. | | 20120928 | MWS | Mark observes epgd19 had problems and is now showing just 12GB of its 16GB memory in /proc/meminfo. Marked offline. | | 20120924 | MWS/LSL | Mark is testing EMI2 upgrade to epgse1 on a VM and later on bare machine - causes difficulties, later reverts to old SE after number of days testing. | | 20120924 | MWS/LSL | Mark temporarily offlines epgu1n046 for Aslam and BB2 installers to test a BB2 10GbE interface on our Force10 switch. | | 20120919 | MWS/LSL | Mark adds some tcp settings in sysctl.conf on storage pool nodes to see effect on transfer speeds. Lawrie suggests a future refinement would be to tweak txqueuelen to 10000. | | 20120918 | LSL/MWS | Do some traceroutes to BNL as additional info for very slow transfer, add to GGUS ticket 86105. | | 20120806 | LSL | Noticed that over the last couple of days, epgu1n021 had two syslogged errors: EDAC k8 MC1: extended error code: ECC chipkill x4 error. | | 20120802 | LSL | After successive media write errors on f17 over the last weeks, replacing 3 drives, one by one; about 6 hours to do a rebuild for each one. RAID status remains Good. | | 20120730 | LSL | Noticed that epgf05 had 24 July syslog kern.info entry: Northbridge Error (node 4): ECC Error in the Probe Filter directory; also 27 July ditto node 0. Nothing in ipmitool sel list. | | 20120716 | LSL | Upgrade epgf03 04 07 BIOS to 2.6.0, and BMC to 1.08, like the others. Now all complete. | | 20120713 | LSL | Benchmarks finished on epgf01 02 05 06. All returned to GRID use. No IPMI temperature events now (or over the weekend). | | 20120711 | LSL | Upgrade epgf08 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. Same for epgf01 and epgf02. epgf08 returned to use. Still to do: 03 04 07. | | 20120710 | LSL/MWS | epgf08 turns itself off at 10:03. Log file /var/log/acpid entries BEGIN/END HANDLER MESSAGES implicates some power event or other at that time. | | 20120703 | MWS | Mark notices that epgf01 and epgf02 are very hot and have critical messages in BMC/ipmi log. Turns them off. | | 20120620 | LSL | Updated epgf05 and epgf06 BIOS from 2.4.0 to 2.6.0, and BMC from 1.04 to 1.08. But their fans are still running at half the speed of other units. | | 20120620 | LSL | Install replacement disks on f15 and f16: now back to fully redundant. | | 20120620 | MWS/LSL | All services up and working on the new BHAM-1 network. | | 20120619 | LSL/MWS | Installation of 14kw Mitsubishi complete and working by about 4pm. DNS updated by network team to new BHAM-1 subnet. Start reconfiguring for new network. | | 20120618 | LSL/MWS | In grid downtime: computer room is polythene-curtained-off for new 14kW Mitsubishi aircon unit to be installed, today and tomorrow. | | 20120615 | LSL | Observe that BHAM-1 subnet test DNS definitions are in place, will try to speak to Nick on Monday when he's in. | | 20120608 | LSL/MWS | Status is epgd* on, epgu* on, epgf01,02,07,08 on. epgf05,06 are off for engineer. epgf03,04 are off because aircon is at 75% capacity. | | 20120608 | LSL | Ran Intel's bootutil on epgf03,04,05,06 to configure the 10GbE interfaces to do PXE boot. See below. | | 20120607 | LSL | Ran Intel's bootutil on epgf01,02,07,08 to configure the 10GbE interfaces to do PXE boot. See DellC6145Init page for more details. | | 20120606 | LSL | epgf05 and epgf06 are over-heating: flashing green/amber power indicator, and lots of entries in ipmi/BMC log. Maybe lack of fan-speed. Mark is draining these. | | 20120606 | LSL | On epgf03-06, fix ifcfg-eth0 file so that it will pick up IP of 10G interface via DHCP. epgf05-06 are using this now for all jobs; epgf03-04 will be restarted soon. | | 20120601 | LSL/MWS | Big switch-over for all grid services (except epgmo1) to use S4810 switches. Required adding Intel X520-DA interfaces on the pool nodes epgsr*, so which now are at 10G rather than 4x1G. BlueBEAR workers overall bandwidth thereby improved 10-fold. Workers epgf* still using 1G: will be fixed asap. | | 20120529 | LSL | Braywhite fit new Mitsubishi aircon PKA-RP100KAL on left wall. Minimal drilling and no welding or soldering so disruption was minimal, it turned out. Still 1 aircon down, 3 working. | | 20120524 | LSL | Try without portable aircon, with just epgf03,04 and epgd01,24 on, plus BB of course. Temp 22 deg (sunny), 22.5 @ 14:30, 22.7 at 16:45. | | 20120523 | LSL/MWS | Mark has put kickstart in place for BB nodes, Phil in Elms Road has rebooted them to get system. I've put final 3 gig inserts in place and cabled them separately. BB workers epgu1n001-047 are available. Overall bandwidth will still be limited to 1G until big switch-over for grid services (next week). | | 20120523 | LSL | Portable aircon heat-dump to corridor during day. Temperature 18 deg but can't online PP workers because many jobs are longer than working day. Rang maintainers to request they ensure a quote for compressor is with Estates by tomorrow. | | 20120522 | LSL | Aircon unit A compressor is over-current, probably internal flow problem, say Integral. They provide two portable aircon units but no suitable heat dump area overnight. | | 20120521 | LSL | Reported aircon not coping with heat-wave. Maintenance informed. Temp goes to 35 deg so Mark starts shutting workers. | | 20120521 | LSL | S4810 switch #5 installed in BlueBEAR rack, supports u1n001 up, and 10 Gb/s fibre link works after T/R swap. Units picking up Mark's dhcp updates. | | 20120516 | LSL/MWS | C6145 unit epgf05 + epgf06 now in production. | | 20120509 | LSL | Dell engineer replaced faulty 4GB memory card on epgf02 (!JCLB95J). Benchmark HS06 subsequently ran to completion (five hours) without problems. | | 20120508 | LSL/MWS | Mark starts to use a nominal 10 HEP-SPEC06 (2500 kSI2k) in information system for cluster, together with a $cputmult and $wallmult in mom_priv/config for our different processors. | | 20120508 | LSL/MWS | C6145 unit epgf03 + epgf04 now in production. | | 20120504 | LSL | Mark has disabled BB CE so last grid job to run via BB Torque/MOAB finished on 2nd May 2012. | | 20120504 | LSL | Re-install epgf03 after benchmarks. epgf04 not responsive so re-install: now ok, so run 48-core benchmark as burn-in/evaluate. | | 20120503 | LSL | epgf02 has Northbridge memory errors. Checked with a SL6 system as well. Engineer called: will visit Tuesday. Running benchmark on epgf03 as burn-in/evaluate. | | 20120503 | MWS/LSL | Install SL5.8 on epgf03 using kickstart with XFS root filesystem, after several iterations in the %pre section. | | 20120426 | LSL | Required disks for c6145 units have arrived so fit them and start formatting them as RAID. | | 20120426 | LSL | Discussion with PSH of BB about son of BB cluster: due Aug/Sept 2012. Will need to reconfigure gridPP to low BB nodes first, because of phased replacement programme. | | 20120418 | LSL | CVMFS on BB cluster tends to get "Transport endpoint is not connected" more often than PP cluster so detect the problem and lazy-umount it if detected. Seems to work. Hunch that the latest cvmfs version 2.0.13 would include a fix for this problem proves incorrect in practice. |
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r27
<
r26
<
r25
<
r24
<
r23
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r27 - 08 Jan 2013
-
_47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe
?
Computing
Log In
Computing Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
Webs
ALICE
ATLAS
BILPA
CALICE
Computing
General
LHCb
LinearCollider
Main
NA62
Publish
Sandbox
TWiki
Welcome
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback