Computing Web>DellC6145Init (23 Nov 2012, _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe? )

EditAttach

Preparing Dell C6145 units for first use

Ordering

I ordered our Dell C6145 units using the usual Dell configurator screens. I had chosen the two motherboard version ("2MB") of the unit. It was clear from the Dell configurator screens that everything had to be ordered as a quantity per motherboard: RAM, number of drives, and so on. This was re-enforced by the fact that items ordered like disks and RAM were twice as expensive as on other Dell products, because of the two motherboards. So I configured 9 of the 2.5" 300GB 10krpm SAS drives per motherboard. Unfortunately, Dell delivered each unit with just 5 drives per motherboard, so I had to get back to them to have the remaining drives delivered. Moral: insist that you get (and then check) an order acknowledgement.

Firmware levels

Firmware levels supplied were: VBIOS 0.90.7a, Dell BIOS 2.4.0, BMC 1.04 or 1.06, FCB [00117] or [0118], MegaRAID 2.120.63-1242. The BIOS and BMC were later upgraded on advice from Dell support, to BIOS 2.6.0 and BMC 1.08, to fix overheating issues. The FCB (Fan Control Board) firmware was upgraded to [0118] on the two machines that didn't already have it, on the advice of Sky-Tech engineer Stuart. See later section.

Setting up the disk drives as RAID

The 9 drives per motherboard were to be configured as RAID 5. This number of 2.5" drives is the maximum per motherboard that the C6145 can take when used with the 115 watt processors I had ordered. I had the LSI 9260-8i SAS/SATA card in the configuration. The purpose of RAIDing was to improve the performance of the disk system where we have so many cores per motherboard (48). While a single disk may give our jobs acceptable performance on a 4 core or 6 core system, we need striping to get the same performance with more cores. As we wanted 50GB of disk space per core (ATLAS requirements), this worked out nicely using nine 300GB drives to make a 2.4 TB RAID 5 system. The C6145 system with 2.5" drives and LSI 9260-8i comes with MegaRAID support.

A complication was that 2.4 TB is beyond the capability of a conventional DOS partition table, as would be confirmed during the Scientific Linux 5 installation, so GPT partitioning has to be used (if any partitioning is to be used) on an area exceeding 2TiB. However, GPT partitions aren't supported by normal BIOS boot nor SL5 grub, so the /boot area of the normal Linux GRUB setup has to be on a different smaller area, not part of the 2.4 TB. It would have been excessive to devote a whole physical disk to the boot area, and a bit clunky to use a USB memory stick as the boot device, so we wanted to split the RAID set up. In MegaRAID, a RAID set can be divided into so-called virtual disks. (On another RAID system, Infortrend, the equivalent nomenclature is partitions, not to be confused with DOS/GPT partitions). So the technique had to be to create (at least) two virtual disks, with the size of the first one suitable for a DOS partition table: the chosen size was 1 GiB, which is more than enough for a /boot area.

The C6145 in the 2.5" setup comes with LSI MegaRAID ability. On boot, I typed in Ctl/H in order to bring up the WebBIOS utility (nothing to do with the world wide web) which supports a GUI method of configuring the RAID.

In the MegaRAID initial screen, I chose Configuration Wizard, used the Manual configuration (as "Automatic" configuration would have given me RAID6), used a Control mouse-click on the drives to use (all of them, in turn) and then clicked Add to Array, and on the right-hand-side, clicked Accept DG and clicked Next.

In the SPAN screen, I then clicked Add to SPAN, and Next.The space available was shown as 2.178 TB. In MegaRAID, GB and TB mean the binary-based units GiB and TiB it seems.

In the Virtual Drive screen, I chose RAID level 5, kept the Stripe Size as 64 kB, selected Write Back with BBU (later changed to Always Writeback), gave a Select Size of 1 GB, and clicked Accept. This gave me the first virtual drive.

I then clicked Back, Add to SPAN (again!), and Next, and in the Virtual Drive screen again, RAID level 5 is now the only option for this RAID set, so again I selected the appropriate write-back option, and set a Select Size of 2.177 TB, as now indicated as the available space, and clicked Accept. This gave me the second virtual drive. Then I clicked Next.

In the Preview screen, it showed VD0 and VD1 as the two virtual drives, with the required sizes. I said Yes to Save Configuration and to Initialize. In the next screen, for each virtual device, I chose Slow Initialization and clicked Go so that it was done immediately (it took about 45 minutes for the large virtual disk). When these were ready to use, I exited the MegaRAID application and rebooted the machine.

This has of course to be done for each motherboard of the C6145 unit. It's worth noting also that the MegaRAID firmware uses slot 0 upwards for counting drives, whereas the engraved numbers on the top of the C6145 chassis run from 1 to 24. This could cause confusion, say when being asked to replace a failing drive. Once the unit is installed in a rack with other machines so that those engraved numbers are hidden, that's less of an issue!

BBU and associated settings

The battery backup unit (BBU) of the Megaraid (LSI 9260-8i) was the source of some warnings at boot, when I had initially selected Writeback with BBU, complaining variously that the battery was absent, charging, or calibrating, and not always fully-charged as one would hope. Since Write-back mode is said to perform much better than Write-through mode, I thought it more important that the system should be performing well at all times, rather than under-performing for some indeterminate time after boot. So I re-entered the MegaRAID utility, chose Virtual disks on the left,chose each virtual disk in turn and then Properties and Go, and changed the setting to Always Writeback. In the unlikely event that a power-failure happens while the BBU is re-charging and causes data-loss and the file-system is subsequently unable to recover from it, the unit's operating system will have to be re-installed, but there is no user data kept on this RAID so that's not a cause for concern, and re-installation is a straightforward semi-automatic business with our setups.

Done for all.

Some problems starting the WebBIOS utility

The WebBIOS utility didn't always start after a Ctl/H, for the first invocation where the drive configuration had been changed (because of the under-supply). Ideally the WebBIOS utility takes no longer than 20 seconds to start after the Ctl/H. In my case, the monitor output would stay with the cursor flashing in the left hand top corner of the screen. The machine might then reboot. One solution was to start the system up with the drives popped-out, as I was going to re-initialise them anyway, and then push them in after the initial WebBIOS / MegaRAID screen had appeared. This worked and I could then initialise the RAID.

I had further problems when trying to re-visit the WebBIOS later, on two machines: the cursor would stay flashing in the left hand top corner. On both occasions the problem disappeared after I used Ctl/Y rather than Ctl/H, then immediately quit the LSI CLI mode, and rebooted. This might be a coincidence of course.

I had trouble with the same GUI when using a PS/2 mouse via a USB-PS/2 "Y" adapter: the cursor stayed at the left edge of the screen. I didn't have that trouble with a USB mouse.

Installing and setting up partitions

I then booted into a SL5.7 installation system. At the point where the anaconda GUI has started, I switched to the console text mode using Ctl/Alt/F2, and configured the partitions initially by hand. This was later incorporated into a kickstart %pre section. The default in SL5 is that partitions start at sector 63, which is not aligned with the RAID's data stripe, so that performance is said to be potentially halved. This is discussed briefly in this forum discussion of stripe/partition alignment when using the XFS filesystem, and also here for ext4.The following setup was used in the %pre section of our kickstart:

parted /dev/sda mklabel msdos
parted /dev/sda mkpart p 2048s 204799s
parted /dev/sda mkpart p 204800s 716799s

parted /dev/sdb mklabel gpt
parted /dev/sdb mkpart p 2048s 201326591s
parted /dev/sdb mkpart p 201326592s 100%

So here I've made two partitions on the small /dev/sda for diagnostics and /boot (leaving some spare space), and another two on the large /dev/sdb for swap and for the root filesystem. This is just an example. I've specified in sectors because this version of parted (1.8.1) is not alignment-aware, and more recent versions still do not seem alignment-friendly when using say GiB units for start and end. Using stripe-alignment for a low activity partition like /boot is maybe over-zealous!

Using a multiple of 2048 sectors for the start of partitions assumes a full data stripe width of 1 MiB (for example, for 8 data-drives and a 128 kiB stripe-size), but is also fine if, like me, you have 8 data-drives and used the MegaRAID default stripe-size of 64kiB.

Making the filesystem

In order to stripe optimally, the filesystem needs to be informed of the stripe width that underlies the virtual disk. For an ext3 system, this can be done by a mkfs.ext3 or tune2fs command with the option -E stride=16,stripe-width=128 /dev/sdb2 based on our RAID's parameters (stripe 64 kiB, 8 data-drives), and the ext3 blocksize of 4 kiB. If you are using RHEL5 / SL5, tune2fs doesn't support -E, so it's harder to do this after install, and mkfs.ext3 doesn't support the stripe-width option, but mkfs.ext4 does. Also mkfs.xfs is available in the installation system. If using kickstart, the mkfs command with those options could be done in a %pre section of the kickstart file, with --noformat specified on the corresponding part directive in the kickstart body. If not using kickstart, the format can be done from the Ctl/Alt/F2 shell window. So for our example RAID, we could do one of the following (but for SL5, performance is poor for ext4: see section below):

mkfs.ext3 -m 1 -E stride=16 /dev/sdb2   # RHEL5/SL5 restriction, doesn't support stripe-width
mkfs.ext4 -m 1 -E stride=16,stripe-width=128 /dev/sdb2
mkfs.xfs -f -d su=64k,sw=8 /dev/sdb2

It's worthy of noting for the future that using XFS for the root partition is not allowed in SL6.2 interactive anaconda (or RHEL6 or Centos6.2, it seems), whether by pre-formatting or by selecting it in the GUI. It remains to be seen whether it's allowed when using kickstart. If it isn't, there's a choice of using ext4 if it performs well, or by changing the sdb partitioning so that at least some parts of the file-tree could use XFS. But hopefully this unjustified oddity will be fixed by the time we need SL6.

Initial read performance timings of the RAID

With a SL 5.8 system loaded, I used hdparm -t /dev/sdb as a quick test of the performance of the RAID. This gives a read performance of between 560 and 595 MBytes/second. I then used blockdev --setra 4096 /dev/sdb to increase the read-ahead buffer to 2 MB. The hdparm test then gave 950 to 960 MBytes/second. This is acceptable and justifies our use of RAID in order to get reasonable throughput with this many-core system. When I get time, I'll repeat the exercise using a proper I/O benchmark, to get read and write performance.

For comparison with single non-RAIDed disks on another Dell system (R410), the same hdparm test gave me 100-105 MByte/second for a 7.2krpm SATA disk, and 150-165 MBytes/second for a 15krpm SAS disk. This wasn't improved by setting a higher read-ahead.

Initial write performance timings of the RAID

With an XFS file system formatted using mkfs.xfs -f -d su=64k,sw=8 /dev/sdb2, the command time dd if=/dev/zero of=/root/bigfile1 bs=$((1024*1024)) count=$((200*1024)) conv=sync creating a 200GiB file took 4m59s and dd reported the speed as 737 MB/s.

Further XFS results: 5m03s 707 MB/s; 5m05s, 705 MB/s. A proper benchmark needs to be done at some point.

With an ext4 file system formatted using mkfs.ext4 -E stride=16,stripe-width=128 /dev/sdb2, the same dd command as before creating a 200GB file took 7m15s and reported a speed of 494 MB/s, and with a nodelalloc option on mount (which helps with later OSes), took 9m33s and dd reported a speed of 375 MB/s! The RAID activity lights showed a cycle of 3 seconds on / 3 seconds off, which is something I've noted elsewhere with ext4. This is more a comment on the vagaries of early ext4 in SL5 than on the C6145: it seems that for large files at least, there's something odd about ext4 in the SL5 release. This was with the SL5.8 release updated to the latest kernel as of 2012-05-03. We shall therefore be using XFS with this RAID!

PXE booting from Intel X520 DA network adapter

I had ordered the C6145 machines with an Intel X520-DA card to support a couple of ports of 10GbE ethernet, as well as the built-in dual 1GbE ports. These cards do not come PXE boot enabled. We normally boot with PXE booting, to allow a centralised choice of re-installation or boot from hard disk, so this needed to be enabled. Otherwise we would have had to PXE boot from a 1GbE interface and then use a 10GbE interface when the system was up, which would have worked but not been so straightforward as using one interface.

Intel provide a BootUtil package to do this job: a PREBOOT.EXE zipped-up executable can be downloaded, currently at http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19186. This executable can be unzipped using a unzip PREBOOT.EXE command under Linux, or by running it as a command under Windows. The APPS/BootUtil directory of this expansion contains what we need. There are versions there of BootUtil for DOS, Linux 32 and 64 bit, and Windows 32 and 64 bit. The version I had downloaded was 1.3.27.0.

As configuring the card to enable PXE is a one-off for each card, not needed day-to-day, I just did the following steps:

Create a bootable DOS system on a small USB flash drive (for example, by this method)
Copy the APPS/BootUtil/DOS/BootUtil.exe from the above Intel files to this USB-based system
Insert this USB flash drive into each C6145 system in turn, and boot into this DOS system
Run BOOTUTIL -E to enumerate/list the interfaces. The 10GbE ports were listed as 3 and 4 for me.
Run BOOTUTIL -BOOTENABLE=PXE -NIC=3
Run BOOTUTIL -BOOTENABLE=PXE -NIC=4
Type Ctl/Alt/Delete to reboot
Observe that there are now four tries at PXE boot: two for 1GbE ports (version GE V.1.3.64), and two for 10GbE ports (version XE 2.1.50)
If preferred, not essential, use Ctl/S just after the Dell logo appears, to configure which ports now attempt PXE boot

Done for all.

The commands above were put into a c6145.bat file on the USB stick. The output of the last BOOTUTIL command was, just for the record:

Port  Network Address    Series   WOL   Flash Firmware           Version
====  =================  =======  ===   ==============           =======
1     00:xx:xx:xx:xx:5c  Gigabit  YES   FLASH Not Present    
2     00:xx:xx:xx:xx:5d  Gigabit  N/A   FLASH Not Present
3     90:e2:xx:xx:xx:18  10GbE    N/A   UEFI,PXE Enabled,iSCSI    2.1.50
4     90:e2:xx:xx:xx:19  10GbE    N/A   UEFI,PXE Enabled,iSCSI    2.1.50

Upgrading the BIOS and BMC firmware

On advice from Dell after some critical temperature and fan-speed issues, I updated the BIOS firmware. As with updating the Intel X520 DA adapter, above, I found it quickest to do the update by booting from a small DOS system on a USB stick. This was prepared from my normal Linux system. So I downloaded a PEC6145BIOS206.exe file from Dell's web site, which is said to be the floppy file version, unzipped it, and put it on the already-mentioned DOS USB stick. For the BMC firmware, I similarly downloaded PEC6145BMC108.exe, unzipped it, and copied the SOCFlash/dos directory contents to the USB stick. Later (Nov 2012), on advice from Sky-Tech engineer Stuart, I downloaded FCB.00_118_A00_customer.exe, unzipped it, and copied the contents to the USB disk.

With these files on the USB-disk DOS system, it was a simple matter of booting from the USB stick on each C6145 node and updating the BIOS, rebooting, updating the BMC firmware, and rebooting into Linux.

These updates fixed the temperature / fan problems we were seeing on these C6145 units.

Physical measurements

Power requirements 230V supply	Turbo disabled		Turbo enabled
Conditions	amps	watts	amps	watts
Powered off	0.47	14
One node quiet, one off	2.06	462
Both nodes waiting for work	3.14	715	3.52	801
Both nodes HS06 unpacking	3.44	756	4.04	920
Both nodes HS06 compiling	3.62	829	4.08	930
Both nodes HS06 2x48core 64-bit run	5.92	1352	6.28	1420
Both nodes HS06 ditto	6.11	1388	6.41	1440
Both nodes HS06 ditto peak	6.25	1420	6.45	1466

Note that the current and power are both of the two redundant power sockets taken together. With these 6234 processors, wikipedia says the available modes are 2.4 GHz (base), 2.7 GHz (full-load Turbo), 3.0 GHz (half-load Turbo). The Turbo mode was turned on by entering F2 setup and choosing Advanced and CPU configuration and setting CPB (Turbo) to Auto in place of Disabled.

Deployment

For deployment of these servers as part of our Grid cluster contribution, please look for another document.

Document first version LawrenceLowe - 26 Apr 2012

Topic revision: r24 - 23 Nov 2012 - _47C_61UK_47O_61eScience_47OU_61Birmingham_47L_61ParticlePhysics_47CN_61lawrence_32lowe?

Computing

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback