Performance tests on 72 TB Infortrend ESDS RAID Storage

Author: L.S.Lowe. File: raidperf24. Original version: 20120723. This update: 20130104. Part of Guide to the Local System.

The RAID for this series of tests is an Infortrend EonStorDS ESDS S24S G2240, equipped with 24 Hitachi disks (3TB, SATA, 7.2k, HUA723030ALA640), and 2 GB of RAM buffer memory, to be set up with 2 RAIDsets of 12 disks each configured as RAID6, and so with the data equivalent of 10 disks each (30 TB of real data). The RAID stripe size will be kept at the factory default: 128 kiB. I have three of these currently, all to be deployed at the same time. For my earlier Infortrend RAIDs, see my local Guide (linked above).

The Infortrend brochure says that this ESDS S24S-G2240 is equipped for host connectivity with two 6Gb/s SAS 4x wide ports, without any host-out ports, and, for an expansion enclosure, one 6Gb/s SAS 4x wide port. I interpret this as meaning peak transfer rate of 24 Gbit/sec, that is 3 GByte/sec, per port. (There are alternative four host-port and dual controller versions).

The HBA card used to connect the RAID to the Dell R710 host was an LSI SAS 9201-16e full-height half-length card (part number LSI00276) which uses PCI-Express x8 (generation 2) bus. PCI-Express x8 G2 has a theoretical performance of 5 GT/sec or 500 MBytes/sec per lane, so a theoretical 4 GBytes/sec.

When re-deploying one of the RAIDs (f25) later, the HBA card used to connect the RAID to a Dell R710 host was an LSI SAS3801E low-profile half-length card (part number LSI00138). This operates on a 3Gb/s link 4x wide port, so I understand this to be a peak transfer rate of 12 Gbit/sec, that is 1.5 GByte/sec, per port. It also uses PCI-Express x8 G2. The RAID performance with this lower-spec card was found to be equally good, see results below, so wasn't a limiting factor in its read or write performance.

One of the Infortrend units had been delivered with firmware level 3.86C.09, and two with 3.86C.11.

Firmware functional performance - version 3.86C.09

Initialization time tests

Two logical drives were created: RAID6, 12 disks each. This was done serially; creation took just a few seconds, and then initialization took under 10 minutes to reach 1% completion, and took 8h04m and 8h02m respectively to reach 100% completion, in the default online mode. No other activity was done during this time: host SAS cables were not connected during initialization. Confirmed later at 7h58m.

Clone time tests

While initialization is generally a one-off operation, cloning a drive or rebuilding the array can be needed a number of times during the life-time of a RAID.

A clone operation of an individual disk drive can be done if it's failing, and if you have a spare drive in the unit to copy it to. To do 5% took around 16 minutes, 45% took 2h35m, so that's about 5h44m to do the complete copy. This works out as around 145 MBytes/second, close to the 157 Mbyte/sec official individual sustained transfer speed of these particular Hitachi drives.

Rebuild time tests

For a rebuild after an individual drive failure, the rebuild-time is a critical parameter of the reliability figure calculation for a RAID, as a long rebuild time makes the RAID open to further disk failures during the rebuild window, and gives the potential for complete data loss particularly for RAID-5.

To test re-build time, I needed to simulate a disk failure, so I removed a drive from an existing 12-disk RAID-6 logical drive which was part of an online logical volume, while it was not busy. After a couple of minutes I replaced it, but it was marked EXILED, so I needed to clear that status for this perfectly good drive. It should be possible to do this by clearing the disk reserved area. I couldn't do that on the unit itself, oddly, because the unit was sufficiently busy with the EXILED drive that the password for enabling that operation was cleared several times before I was able to enter it in full! Maybe if I'd used the web GUI, I'd have had more success. Anyway, I cleared the reserved area successfully using another RAID with spare slots. Having re-inserted the cleared drive into its original RAID slot, it was then accepted as a fresh drive, and re-building started. I noted that the configuration disk array parameter Rebuild Priority was set to Normal, as on my previous Infortrend arrays. This presumably only affects relative priority of rebuild versus host-I/O activity, so where there is no host activity, its setting has no likely effect.

Rebuilding reached 1% after 10 minutes, 10% after 54 minutes. The full rebuild time was 8h36m for this 12-disk 3TB RAID6 array.

Firmware functional performance - version 3.88A.03

This section is about one of those blind alleys that you can go down, and then regret it ...!

I had seen that the firmware levels on the other two RAIDs was out of step, and that also there was more recent firmware available from Infortrend, so I decided with Infortrend's help to update to the latest recommended level at the time: 3.88A.03. Before doing this I deleted the logical drives already created so that new logical drives could be created with any benefits of the new firmware.

Initialization time tests with this firmware

Unfortunately this update caused me annoying issues. The time to create the two logical drives (one at a time) went up by a factor of 4: to be precise 34h30m, and 34h23m respectively, each. This seems to be a crazy amount of time. If it just affects initialisation, then of course I can live with it, but not if it indicates underlying performance issues.

I've checked on another similar RAID that this does not depend on whether the initialisation is done in Online or Offline mode: in the situation where there is no host activity, I wouldn't expect a difference anyway. Also I did a Restore Defaults using the button at the back, which is like a Factory Reset: this cleared the password and the unit name that I had set, but didn't help with restoring a sane initialisation time. (Later on I also did a clear of all logical drives and removal of all drives reserved areas, see firmware section below, and started again, but this didn't help either). Also I checked that turning on Verification on LD Initialization Writes did make it take even longer (108m to do 2%, which extrapolates to 90 hours), so it's unlikely that the problem is due to verifies being done unsolicited. As you'd expect, all these tests were done with a Controller Reset (that is, a RAID reboot) after any config change, to have a clean start.

Clone time tests with this firmware

I measured the clone time by having a logical drive for slots 1-12 and a global spare at slot 24, and requesting a clone of slot 12 to slot 24. This simulates a case where slot 12 is known to have a failing drive. Only the two disks are involved in this operation. It took 28 minutes for the clone operation to reach 1% of completion, 56 minutes for 2%, 80 minutes for 3%. So this would take 44 hours to complete! With the earlier 3.86C.09 firmware (above), it would take under 6 hours.

Rebuild time tests with this firmware

I've already mentioned the importance of a low rebuild time in rating a RAID for overall data security and reliability.

For a RAIDset of 12 drives in a RAID6: after 70 minutes, rebuilding had reached just 1%! The full rebuild-time was 60 hours 1 minute. This is to be compared with around 8.5 hours with the earlier firmware!

Manufacturer comments on this firmware

Infortrend were helpful in providing links to the firmware versions, but took a couple of weeks to agree that there was an issue with the later firmware versions, not only with the initialize time but also the rebuild time, and eventually put it down to the addition of an extra layer of data services in those firmware versions. They said that they were planning a new version of firmware which gave the customer the option of turning such services off or on. Unfortunately too late for me: I had RAIDs to deploy!

Downgrading firmware

Needless to say, after my unfortunate experience with the later firmware, and while the manufacturer was mulling over the causes, I had wanted immediately to try the earlier firmware versions that had come with these three RAIDs: in particular 3.86C.09. But it wasn't as simple as that, and it took a while to discover what extra steps were needed to achieve the earlier performance.

The time taken wasn't helped by the fact that installing an old version of firmware couldn't be done quickly via the web browser GUI, as this silently refused to accept earlier than current versions, so I ended up doing it via Hyperterm at the recommended speed of 9600 bits/sec, which takes two hours. Multiply that by the number of retries I needed to get the steps right!

My solution to getting a good working version of the Infortrend firmware was: remove all data on the drives by unmaking the Host LUN, deleting the logical volumes, and deleting the logical drives; remove the reserved area on every physical drive; power off; remove power leads for a minute; pull out every physical drive (half-way); replace power leads; power on with Restore Defaults button pressed; wait for initialisation; set a password; shutdown the controller (not a reset) to prepare for update; update the firmware via the serial port at 9600 baud using Hyperterm; allow reboot with Restore Defaults button pressed; push in every physical drive; create logical drives and volumes as required.

This may have been excessively elaborate, but certainly it was necessary to remove the reserved area on every drive (which is later automatically re-instated when the disks are made part of a RAIDset), in order to achieve the earlier firmware's good performance. Deciding if other steps are un/necessary I'll leave to others to discover!

Because of these shenanigans required for proper downgrading, I've only tested the 3.86C.09 and 3.88A.03 firmware versions for sure, and not the 3.86C.11 firmware which two of the RAIDs had been delivered with.

Creating LUNs

In Infortrend naming convention, Logical Drives are RAIDsets made up of a set of actual disk drives, Logical Volumes are sets of one or more Logical Drives, Partitions are subsets of Logical Drives/Volumes which can be assigned a channel LUN. In earlier versions of the Infortrend firmware, it was possible to partition a Logical Drive or a Logical Volume and assign either sort of partition to a channel LUN. In the version now in use, there is a strict hierarchy: a Logical Volume must be formed of one or more Logical Drives, only a Logical Volume can be partitioned, and only a logical Volume Partition can be assigned to a channel LUN. At least this makes the documentation straightforward, I guess. Creating a logical Volume takes around a minute, but creating a Partition takes nearly 4 minutes. I found later that deleting a Logical Volume takes 6 minutes. The disks are fairly busy during these times, and there was no interaction possible with the user interface during that time.

So I assigned Channel 0 ID 0 LUN 0 to LV0 Partition 0, and Channel 1 ID 0 LUN 0 to LV1 Partition 0.

These Partitions have nothing to do with the DOS or GPT partition that might (or might not) be created on the "disk" corresponding to the LUN, as seen by the host operating system.

When I subsequently attached the device to a server with an operating system, I found for me on that occasion with more than one LUN defined, that the order of the /dev/sdX device files, dynamically assigned at boot time, wasn't top to bottom in the RAID. There are probably various factors which could affect this (cabling, ordering of sockets on the LSI sas/sata HBA card, or presentation order by the RAID unit), but of course this is exactly why Linux uses filesystem LABELs and/or UUIDs, in order to eliminate ambiguities once the filesystems have been setup.

Host I/O performance

Already noted is that the sustained I/O transfer speed of the individual drives in this RAID is 150 MBytes/second. So the maximum sustained real-data transfer speed in a RAID6 of 10+2 drives is going to be 1500 MBytes/second, and less than that if other factors come into play (as they do!).

Quick hdparm read performance test

The hdparm command provides an option -t to do quick simple read tests on a disk drive or system. The hdparm -t command was run on the raw /dev/sdb device from a SL5/RHEL5 system, with the --direct option (O_DIRECT), and without (giving normal kernel page-buffered I/O). The amount of read-ahead was modified per run, using the blockdev --setra command. This gave the following results:

hdparm read test

Quick dd write performance to raw LUN device

A quick dd test was run using the dd(1) *nix command, copying from /dev/zero to the raw host device corresponding to the logical volume partition for that device file. Obviously not to be run after putting a filesystem on that device! This was done for a variety of dd bs values ("blocksize"). Most of the bs values were chosen as a multiple of the RAID single-disk stripe values, but just for comparison, there's one which is not such a multiple at the left hand edge of each graph. The two graphs are for the earliest and later firmwares.

ddwrite results ddwrite results

Creating a filesystem on a logical volume partition

One of the logical volumes of 30TB was formatted as XFS. As I mentioned, for me the order of the dynamically assigned /dev/sdX device files wasn't top to bottom in the RAID; that's not a problem once the filesystems have a filesystem LABEL or a UUID, but worth checking (as always), say by exercising the RAID and checking visually, before setting up that label or uuid. It's perfectly possible for an external RAID to turn up as /dev/sda and an internal disk (say a RAID-1) to get assigned as /dev/sdb!

I formatted and mounted as the bare device, rather than as a GPT partition. (There was no real need for a GPT partition table, but if I did partition, it would be sensible to ensure that partition(s) begin on a RAID full-stripe boundary). I later found this useful XFS guide which confirms my mkfs.xfs parameters below:

# mkfs.xfs -f -d su=128k,sw=10 -L 24a /dev/xxx
meta-data=/dev/xxx               isize=256    agcount=32, agsize=228880256 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=7324168192, imaxpct=25
         =                       sunit=32     swidth=320 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal log           bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

real	0m0.816s
user	0m0.002s
sys	0m0.016s

# mount LABEL=24a /disk/f24a

Bonnie++ performance

Bonnie++ 1.03e tests were run on the same Scientific Linux 5.8 system as above. I'm using this system for this RAID in particular as the unit will be acting as a storage pool node for GridPP, and this currently requires SL5/RHEL5.

For the first setup, the system was booted with a kernel parameter of mem=4G in order to limit the amount of RAM available, as we wanted to measure the performance of the RAID, not of the Linux page buffer in RAM; a benchmark file space of 24 GiBytes was used, much larger than that RAM value; the bonnie chunk size was set as the RAID's per-disk stripe: 128 kiB. A second setup using the full RAM of 24 GiBytes and a larger file space of 48 GiBytes and a bonnie chunk size of 8 kiBytes was also done, and yielded very similar throughput results, as in graphs below (and much better random seeks: see detailed figures further below). The benchmark was repeated for a number of different read-ahead values, set using the blockdev --setra command.

More importantly, the tests were done on two different firmwares: the earliest supplied 3.86C.09, and the latest supplied 3.88A.03. Just as with the firmware functional tests, the later firmware has signicantly lower performance, for Writes and for ReWrites (which in bonnie++ are read & modify & update in place), as seen in the graphs below.

bonnie results bonnie results

Bonnie++ results in detail

Note that bonnie++ reports data rates in kiBytes/second, while the graphs above use MBytes/second (ie millions of bytes), the latter being commonly used when talking about data rates.

bonnie++ Sequential Output Sequential Input Random
Sequential Create Random Create
Label - fstype - Read-ahead sectorsSize:Chunk SizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU
3.86C.09 firmware
Server RAM 4GB, system SL5.8
Server RAM 24GB, system SL5.8
3.88A.03 firmware
Server RAM 4GB, system SL5.8
Server RAM 24GB, system SL5.8
Reminder of an example good result from above
Different raid: f26
3.86C.09 firmware
Server rex15 RAM 12GB, system SL6.2
One RAID6 set on one channel, single bonnie++
Two RAID6 sets on one channel: two synched bonnie++ (sum them for total)
Two RAID6 sets on 2 separate channels, two synced bonnie++ (sum them for total)
One RAID6+0 single bonnie++
One RAID6+0 two synched bonnie++
One RAID6+0 single bonnie++ with various read-aheads
final set on a single logical RAID6 drive

Worth noting how much sensible the XFS file create and delete time is, under RHEL6 / SL6 (result lines beginning 26a or 26b), compared with the poor XFS times in earlier RHEL5 / SL5 systems (also see my comparison of XFS and ext4 filesystems). Create time is around 6 times shorter, delete time 15 times shorter (check).