Performance tests on 48 TB SATA Infortrend RAID Storage, ext4 and xfs, Oct 2010

Author: L.S.Lowe. File: raidperf20. Original version, tests performed: 20101018. This minor update: 20120426. Part of Guide to the Local System.

The RAID for this series of tests is an Infortrend A24S-G2130, equipped with 24 Hitachi A7K2000 disks (2TB, SATA, 7.2k, HUA722020ALA330), and 1 GB of RAM buffer memory, set up with 2 raidsets of 12 disks each configured as RAID6, and so with the data equivalent of 10 disks each. The RAID stripe size was kept at the factory default: 128 kB.

The A24S-G2130 RAID controller comes with 4 multi-lane 12Gbps host ports (quoting the Infortrend brochure). The RAID was attached to a Dell R410 server via a LSI SAS3801E host bus adapter card. The server was equipped with two Intel E5520 quad processors and 12GB of RAM. This is a summary graph of the bonnie++ results, below:

f20 bonnie r/w performance

Making the file-systems

The ext4 file-system was formatted on a 10TB raid-partition as follows:
# time mkfs -t ext4 -E stride=32,stripe-width=320 -i 65536 /dev/sd$dv
....
real    3m4.054s
user    0m2.823s
sys     0m32.838s
# tune2fs -c 0 -i 0 -r 1024000 -L $fs /dev/sd$dv

The xfs file-system was formatted on a full 20TB raidset as follows:

# time mkfs.xfs -d su=128k,sw=10 -L $fs /dev/sd$dv
meta-data=/dev/sdb               isize=256    agcount=19, agsize=268435424 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=4883118080, imaxpct=5
         =                       sunit=32     swidth=320 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

real    0m14.011s
user    0m0.000s
sys     0m0.285s

Later: making a ext4 partition of 16TiB (17.6TB)

Some months after successfully deploying xfs, as required for the 20TB raidsets because the ext4 e2fsprogs package only supported filesystems up to 16 TiB, the fileserver had a kernel crash, with a traceback implicating XFS. The traceback was similar to that in this bug report. After also having had kernel crashes with xfs filesystems causing repeated loss of all data on another (non-Infortrend) off-the-shelf NAS device, and with no kernel fix in sight for the problem, I decided it was time to look again at ext4.

I didn't want to disturb the logical drive RAIDsets of the RAID, which remained at 20TB each. But mkfs.ext4 didn't make it easy for me, even when I specified an extra parameter of block-count to keep the size within 16TiB, because it checked and complained about the size of the whole device first!

I then thought to make a GPT partition table and partition using

         parted devicefile mklabel gpt
         parted devicefile mkpart 2560s 16TiB
Note the use of 2560 sectors as the start offset of the partition, which is intended to cause data stripes written to remain optimally aligned with the RAID, otherwise performance can suffer. (This is irrespective of filesystem type: for example see this discussion of stripe/partition alignment when using the XFS filesystem). The use of 16TiB as the end offset ensured that the size was just less than 16TiB.

Later that day I decided not to use a GPT partition, but instead to firmware-partition the RAID device into a 17.6 TB lun (16777000 MiB, just under 16 TiB), with the remaining 2.4 TB in another lun. This will make expansion easier once e2fsprogs is finally enhanced.

Timings for ext4 fsck recovery

See these previous tests on a similar system.

Bonnie++ tests

Bonnie++ tests (version 1.03e) were performed with various parameters. For the purpose of these particular test runs, the amount of caching that Linux can do in RAM was not what we wanted to measure, even though that will improve the performance in many situations. So the data size chosen (24GB) is sufficiently bigger than the 12 GB RAM on the server plus the 1 GB RAM cache in the RAID.

The read-ahead was set to various values using the blockdev command (ra=4096 means read-ahead of 4096 sectors), and the file-system was mounted with ext4 default options, and with ext4 -o nodelalloc as well, and the xfs file-system was tested. Some tests for the same setup were repeated to get an idea of consistency. The kernel in use was 2.6.32.21-168.fc12.x86_64 in Fedora 12. Bonnie speed results are in kiBytes/second for I/O rates, and number per second for file creation and deletion rates.

SetupSize:chkchrW%cBlkW%cReW%cchrR%cblkR%cSeek%cCreate/EraseSeqCre%cSeqRead%cSeqDel%cRanCre%cRanRd%cRanDel%c
ext4 default
ra=256,20b-ext4o24G:128k811759947649060735668717949632399616223.11256:1000:1000/2564365197108795867148393423609727168271599236
ra=2048,20b-ext4o24G:128k881969947950461832109768589863552429228.81256:1000:1000/2563188190114230856884393411449727751281564936
ra=4096,20b-ext4o24G:128k882799947693759906959789159970544131222.01256:1000:1000/2563212789104989866899193414209727139271608736
ra=4096,20b-ext4o24G:128k851659947206657910149785089969397633224.41256:1000:1000/2564397598111441856532794401499727186271610436
ra=6144,20b-ext4o24G:128k8833999472704609535810797999972318634222.71256:1000:1000/2563098290118095906157193417789727127291587338
ra=8192,20b-ext4o24G:128k8292999471103599495410793109971395533222.21256:1000:1000/2563236788123535897241893434559727503271630436
ra=12288,20b-ext4o24G:128k8787299481416599932511795049979059837232.41256:1000:1000/2563138991114661866930193412079629078291569936
ra=16384,20b-ext4o24G:128k82989994786706010379511824469980728035222.22256:1000:1000/2563292289112688907025693387729527280281587538
ra=16384,20b-ext4o24G:128k85754994757156010056811820469980171735223.32256:1000:1000/2563092289117835906567193415899727223271601236
ra=20480,20b-ext4o24G:128k84357994743385910419811774059980001938222.02256:1000:1000/2563182690117924906177794423709726242271644038
ra=24576,20b-ext4o24G:128k88202994771745810448111802769980405836222.31256:1000:1000/2563166589120075896290594416599727206291562437
ra=28672,20b-ext4o24G:128k79100994793555810344011823809980170535227.32256:1000:1000/2563237691113255837157993411439728204331548739
ra=32768,20b-ext4o24G:128k87624994792555910419011809279979621734223.22256:1000:1000/2563010088128455956813094427479727286271603737
ext4 with nodelalloc option
ra=256,20b-ext4o-n24G:128k79649994254848514027417804409836311118213.91256:1000:1000/256354419690431734853895297508224998271489948
ra=2048,20b-ext4o-n24G:128k79368993966247820126822814559872490635215.91256:1000:1000/256252158592973764766695303238225127271494648
ra=4096,20b-ext4o-n24G:128k78046994255478319590821781209874222234214.61256:1000:1000/256258018684738774813995296428325092271497048
ra=4096,20b-ext4o-n24G:128k80446994224798422169924789269874219634211.81256:1000:1000/2563171687104644774931295297118324884251495747
ra=6144,20b-ext4o-n24G:128k82165994255918322312424808119875652336216.81256:1000:1000/256242998692723815035195285908324645261479549
ra=8192,20b-ext4o-n24G:128k80847994235268520507722812749979342637213.11256:1000:1000/256268078698525744976695306738225068261497747
ra=12288,20b-ext4o-n24G:128k80222994214608519015121790309985270739212.61256:1000:1000/256253228697086774800595292018324894261485247
ra=16384,20b-ext4o-n24G:128k80301994212878321256224787389888102144214.81256:1000:1000/256259068788404765028495300318225047261495446
ra=16384,20b-ext4o-n24G:128k81411984165098121772824813339987338539208.41256:1000:1000/256266078792546734946595277327627098281479646
ra=20480,20b-ext4o-n24G:128k81400993695407521458223785149987008938209.21256:1000:1000/256272788796293755180595306458325119261485849
ra=24576,20b-ext4o-n24G:128k77745973692177621831723799819886125038210.31256:1000:1000/256263178790393784951095303138224955261491547
ra=28672,20b-ext4o-n24G:128k77815983736757521202124807039986497438205.51256:1000:1000/2562586387100285875124795302948526021281469048
ra=32768,20b-ext4o-n24G:128k76146993703537722688025813959887285838206.71256:1000:1000/256256088695173754807295261837727206291479247
xfs
ra=256,20a-xfs24G:128k862109929823327769008759839835782619132.21256:1000:1000/25611875759964298103604911352738532498828547
ra=2048,20a-xfs24G:128k853739933552629822788767209972387733128.11256:1000:1000/25611790729763998103195011129728864898846650
ra=4096,20a-xfs24G:128k896839938815033856878771119873213433130.31256:1000:1000/256118087210138898102904911331719504298832248
ra=4096,20a-xfs24G:128k864229931710327689416800729856333925129.11256:1000:1000/256795849104911985094318043528224594574534
ra=4096,20a-xfs24G:128k883649930085727871468784969974920234136.71256:1000:1000/2561182370119711981037648113737110403198834145
ra=6144,20a-xfs24G:128k856869940520435904389800019973821433130.81256:1000:1000/25611733729966298103094911405767556198828048
ra=8192,20a-xfs24G:128k896389928890326910479800269978057334133.51256:1000:1000/25611759729768398103485011285738769498827047
ra=12288,20a-xfs24G:128k881929935879832943989811589984007737154.21256:1000:1000/256118207210296198102794811308749417198825647
ra=16384,20a-xfs24G:128k876189931858228957689812109983725237130.11256:1000:1000/25611893739679498103225011319738782498826347
ra=16384,20a-xfs24G:128k882629922312619705777801069862307328126.31256:1000:1000/256799149111657985073327932507635295521932
ra=16384,20a-xfs24G:128k8979499381078339544210769059884475740132.01256:1000:1000/256117817011213198103314811363719905298827846
ra=20480,20a-xfs24G:128k8849799336654309799810791659885000238136.11256:1000:1000/256117427310228798103084911231759363698834947
ra=24576,20a-xfs24G:128k8961399363928319824510756139784158738139.11256:1000:1000/256117607310278698103084811311759389298838747
ra=28672,20a-xfs24G:128k8948399383640349996910769449884713841142.71256:1000:1000/25611616739217498103225011302689219998832248
ra=32768,20a-xfs24G:128k87983993656953210021410799709884289538129.21256:1000:1000/256117177210173398103134911356738642198843748

Comments: ext4 performance is as good as or better than xfs performance for these tests. A feature of the Read-Write tests (ReW column) is that the performance doubles when the nodelalloc mount option is used.

At the time of writing, ext4/e2fsprogs does not support file-systems bigger than 16TB, which is why, for the ext4 tests above, a firmware-partition of 10TB was used. For production purposes, xfs will initially be used because of the ext4 limitation.

Quick Iozone tests

Performed using Iozone version 3-347. Iozone tests 0 and 1 were performed only (see man iozone). Results columns are in kiBytes/second. The read-ahead blockdev setting (ra) was set to 4096 and to 16384 sectors, on ext4 and xfs filesystems. As usual, note that the rewrite test on Iozone is a different sort of test to the one with a similar name on bonnie++.

Type of Iozone testWriteRewriteReadReread
Throughput test with 1 process: ra=4096,20b-ext4o-nodelalloc 427428 443663 619943 619656
Throughput test with 1 process: ra=16384,20b-ext4o-nodelalloc 442429 443190 720980 725769
Throughput test with 1 process: ra=4096,20a-xfs 513985 172980 629283 624194
Throughput test with 1 process: ra=16384,20a-xfs 452636 171898 733350 732641

L.S.Lowe