Performance tests on SAS Infortrend RAID, ext4 and xfs, Aug 2010

Author: L.S.Lowe. File: raidperf18. Original version: 20100812. This update: 20101023. Part of Guide to the Local System.

The RAID for this series of tests is an Infortrend S12S-G1033, equipped with 12 ST3450856SS disks (450GB, SAS, 15k), configured as RAID6, and so with the data equivalent of 10 disks. The RAID stripe size was kept at the factory default: 128 kB. The disk data area is split into four firmware partitions, a b c d. The areas a and b are 1.5TB, and c and d are 0.75TB.

The RAID was attached to a Dell R410 server via a LSI SAS3801E host bus adapter card. The server was equipped with two Intel E5520 quad processors and 12GB of RAM.

Making the file-systems

The ext4 file-system was formatted as follows: this is for a 1.5TB partition. As our average filesize is likely to be fairly large, well in excess of 32768, it was sensible to ask for a bytes per inode of 32768 in place of the default of 16384, with the default inode size of 256 bytes. (In previous systems, these defaults were 8192 and 128 respectively). For stride: our RAID system uses a default chunk-size of 128 kBytes, which is 32 ext4-blocks (4 kB). For stripe-width, the RAID-6 set has 12 disks, giving 10 data disks, so the stripe-width is 10 times the stride value. The tune2fs command here tunes the system reserved area to 1% and turns off regular full fsck checking, so occasional full fsck checks will have to be done by hand, at a convenient time. The following commands were used (Fedora 12 system):
# time mkfs -t ext4 -E stride=32,stripe-width=320 -i 32768 /dev/sd$dv
mke2fs 1.41.9 (22-Aug-2009)
/dev/sdb is entire device, not just one partition!
Proceed anyway? (y,n) y
......
real    0m57.763s
user    0m0.984s
sys     0m13.239s
# tune2fs -c 0 -i 0 -m 1 -L $fs /dev/sd$dv

The xfs file-system was formatted as follows: again this is for a 1.5TB partition:

# time mkfs.xfs -d su=128k,sw=10 -L 18b /dev/sd$dv2
meta-data=/dev/sdc               isize=256    agcount=4, agsize=91518048 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=366072192, imaxpct=5
         =                       sunit=32     swidth=320 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=178752, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
real	0m4.540s
user	0m0.001s
sys	0m0.096s

Bonnie++ tests

Bonnie++ tests (version 1.03e) were performed with various parameters. For the purpose of these particular test runs, the amount of caching that Linux can do in RAM was not what we wanted to measure, even though that will improve the performance in many situations. So the data size chosen (24GB) is sufficiently bigger than the 12 GB RAM on the server plus the 1 GB RAM cache in the RAID.

The read-ahead was set to various values using the blockdev command (ra=4096 means read-ahead of 4096 blocks), and the file-system was mounted with ext4 default options, and with ext4 -o nodelalloc as well, and a xfs file-system was tested. Some tests for the same setup were repeated to get an idea of consistency. The kernel in use was 2.6.32.16-150.fc12.x86_64 in Fedora 12. Bonnie speed results are in kiBytes/second for I/O rates, and number per second for file creation and deletion rates.

Some of the following tests were performed after the RAID went into production mode: these are marked + in the Setup column. The kernel in use for this was 2.6.32.21-168.fc12.x86_64. The read-ahead for those tests included values as high as 32768 sectors, to enable comparison with bulk data RAIDs (like this one): this wouldn't be used in production for a home file-system where many files are small.

SetupSize:chkchrW%cBlkW%cReW%cchrR%cblkR%cSeek%cCreate/EraseSeqCre%cSeqRead%cSeqDel%cRanCre%cRanRd%cRanDel%c
ext4 default
ra=256,18a-ext4o24G:128k871119936913660683918663699325346612371.32256:1000:1000/2564365798121547847553196438299727046241621935
ra=1024,18a-ext4o24G:128k820699936784259731797764459735619516355.62256:1000:1000/2564134198121054847449096401389726842251550036
ra=4096,18a-ext4o24G:128k850899925618843906259784899949531922374.33256:1000:1000/2563235592102360866237796401229826357271580937
ra=4096,18a-ext4o24G:128k888709937727161911649825509956114524373.23256:1000:1000/2564445598127743847554096438769726863241603935
ra=4096,18a-ext4o24G:128k882309937172559889659791819949380024359.62256:1000:1000/2564501798125987837493796447229726361241540834
ext4 with nodelalloc option
ra=256,18a-ext4o-n24G:128k79369993332216712218814795209830484315360.52256:1000:1000/256351589693245685212296310818224498231459344
ra=256,18b-ext4o-n+24G:128k77645983285986912169315762999828043615340.32256:1000:1000/256281028883045674879194285828024765251464745
ra=1024,18a-ext4o-n24G:128k79833993327767214782815806889842202518362.12256:1000:1000/256356539694896674951896304688324394231435344
ra=4096,18a-ext4o-n24G:128k80076993312007116988418803639853043625350.52256:1000:1000/256350619692081685347296295558323927241493245
ra=4096,18a-ext4o-n24G:128k82495993399726717932418809739854131924369.32256:1000:1000/256370389699478675426496305868224480221552245
ra=4096,18a-ext4o-n24G:128k82934993334676617988318799999853704023352.32256:1000:1000/256370229698881665563796287278025179231487244
ra=4096,18b-ext4o-n+24G:128k79208993273747216023018808819848699222334.22256:1000:1000/2562659789104505824753793260177325142261424644
ra=8192-18b-ext4o-n+24G:128k79963993296356817804720809769862318228360.23256:1000:1000/256343399689004684937994288028023021241434845
ra=16384-18b-ext4o-n+24G:128k78406993300137017565019801619862907530334.82256:1000:1000/256276338876055644537993258887226492291449647
ra=24576-18b-ext4o-n+24G:128k78195973255166817128218744719962055328321.32256:1000:1000/256268288897686785022594260437025533261462746
ra=32768-18b-ext4o-n+24G:128k78132973203626817018418811049862704528331.53256:1000:1000/256278978877276644914894250657125975291416346
xfs
ra=256,18b-xfs24G:128k858669929379726793309732499728220713295.82256:1000:1000/25611161691113851009771471055864103472100571034
ra=1024,18b-xfs24G:128k878869931765528810608790479839356418322.52256:1000:1000/2561117766115875100976246106476510556999580735
ra=4096,18b-xfs24G:128k903029927970922934378786159851191324308.32256:1000:1000/256112527211067999981847105186610652599630236

Comments: ext4 performance is as good as or exceeds xfs performance for these tests. A feature of the Read-Write tests (ReW column) is that the performance doubles when the nodelalloc mount option is used.

Timings for simple data creation and deletion on various types of file-system on a SAS RAID

Files were created using: dd if=/dev/zero of=$fn count=nn bs=131072 where nn was 8 or more depends on the file size. Files were deleted, after measures to ensure data and metadata buffers were flushed out of server and RAID caches, using: rm -Rf dirname. Timings were done by the bash time built-in command. The blockdev read-ahead setting for all these tests was 4096 blocks.

The ext4o and xfs file-systems were 1.5TB each and were up to 44% full. The ext2 and ext3o file-systems were 0.75TB each and were up to 88% full. ext3o and ext4o mean ext3 and ext4 with the (default) ordered mode.

While doing some initial dd tests, the disk activity lights on the RAID were checked. Using dd to create a large file (a few GB or more) showed that the disk activity came in bursts every second: roughly half-a-second on, half-a-second off. This does not occur when the file-size is below a couple of GB. This is borne out by the result timings. Looking at top, flush-8:16 was running at around 70% of one processor, with dd at 20% of another, when the output file grows to around 4GB or bigger. (This unit has 2x E5520 quad processors). Also, iotop reports some oddly large figures (>1000 M/s) for Disk Write for flush and jbd2. After some trials, it was found that using the nodelalloc mount option gave improved performance, oddly! Again looking at top, now with this mount option in place, flush-8:16 is at around 3%, jbd2/sdb-8 at around 10%, and dd 70%. iotop is also more sensible, reporting dd as writing at 300 M/s. It's not clear if this is a bug or a feature in the kernel: I am running 2.6.32.16-150.fc12.x86_64 currently. The lower write performance for a default ext4 mount wasn't evident in the bonnie++ results in the section above; on the other hand, bonnie++ generates multiple 1GB files rather than one big file, which may (or may not) be relevant!

Conclusion: the nodelalloc option seems a good idea for sane ext4 behaviour with the present kernel.

Data size Data shape Create time ext2 Create time ext3o Create time ext4o Create time ext4o,nodelalloc Create time xfs
10GB 1MiB files
1000 files per dir
10 directories
real 0m59.269s
user 0m4.571s
sys 0m22.389s
real 0m48.386s
user 0m5.090s
sys 0m34.088s
real 0m36.629s
user 0m4.712s
sys 0m22.520s
real 0m40.174s
user 0m5.083s
sys 0m33.351s
real 0m43.529s
user 0m4.675s
sys 0m23.371s
100GB 1MiB files
1000 files per dir
100 directories
real 12m55.747s
user 0m50.404s
sys 3m46.392s
real 9m37.460s
user 0m50.430s
sys 5m57.278s
real 7m6.369s
user 0m50.177s
sys 3m46.584s
real 6m55.249s
user 0m49.877s
sys 5m44.072s
real 8m25.341s
user 0m50.537s
sys 3m58.345s
100GB 10MiB files
100 files per dir
100 directories
real 9m42.778s
user 0m6.569s
sys 2m12.464s
real 8m45.528s
user 0m6.467s
sys 4m29.624s
real 7m24.878s
user 0m6.707s
sys 2m27.540s
real 5m38.262s
user 0m6.807s
sys 4m16.459s
real 11m19.553s
user 0m6.797s
sys 2m22.753s
100GB 100MiB files
100 files per dir
10 directories
real 6m59.802s
user 0m1.046s
sys 2m6.885s
real 7m15.308s
user 0m0.930s
sys 4m13.310s
real 7m40.650s
user 0m0.886s
sys 2m13.302s
real 5m30.942s
user 0m1.052s
sys 4m6.909s
real 10m46.309s
user 0m0.970s
sys 2m18.211s
100GB 1GiB files
100 files per dir
1 directories
real 6m1.267s
user 0m0.417s
sys 2m6.913s
real 7m11.983s
user 0m0.451s
sys 4m22.091s
real 8m35.436s
user 0m0.397s
sys 3m45.442s
real 5m35.668s
user 0m0.549s
sys 4m12.811s
real 5m44.729s
user 0m0.437s
sys 2m18.844s
100GB 5GiB files
20 files per dir
1 directories
real 6m50.423s
user 0m0.340s
sys 2m12.040s
real 7m5.830s
user 0m0.354s
sys 4m21.068s
real 8m45.591s
user 0m0.302s
sys 4m5.133s
real 5m19.863s
user 0m0.468s
sys 3m58.840s
real 5m14.756s
user 0m0.442s
sys 2m17.564s
100GB 10GiB files
10 files per dir
1 directories
real 6m32.086s
user 0m0.306s
sys 2m17.188s
real 7m41.802s
user 0m0.332s
sys 4m28.373s
real 10m15.886s
user 0m0.323s
sys 3m25.950s
real 5m22.866s
user 0m0.356s
sys 4m6.127s
real 5m4.910s
user 0m0.364s
sys 2m17.101s
Data size Data shape Delete time ext2 Delete time ext3o Delete time ext4o Delete time ext4o,nodelalloc Delete time xfs
10GB 1MiB files
1000 files per dir
10 directories
real 0m36.025s
user 0m0.006s
sys 0m0.368s
real 0m43.659s
user 0m0.007s
sys 0m0.920s
real 0m0.948s
user 0m0.005s
sys 0m0.516s
real 0m1.179s
user 0m0.003s
sys 0m0.566s
real 0m1.671s
user 0m0.006s
sys 0m0.678s
100GB 1MiB files
1000 files per dir
100 directories
real 6m0.509s
user 0m0.065s
sys 0m3.029s
real 7m11.398s
user 0m0.075s
sys 0m9.115s
real 0m9.333s
user 0m0.039s
sys 0m5.268s
real 0m12.384s
user 0m0.042s
sys 0m5.887s
real 0m16.022s
user 0m0.053s
sys 0m6.838s
100GB 10MiB files
100 files per dir
100 directories
real 2m7.691s
user 0m0.019s
sys 0m1.767s
real 2m21.691s
user 0m0.023s
sys 0m5.009s
real 0m5.263s
user 0m0.009s
sys 0m2.667s
real 0m6.001s
user 0m0.008s
sys 0m2.741s
real 0m1.530s
user 0m0.004s
sys 0m0.575s
100GB 100MiB files
100 files per dir
10 directories
real 1m30.588s
user 0m0.003s
sys 0m1.188s
real 1m46.732s
user 0m0.005s
sys 0m4.186s
real 0m3.588s
user 0m0.000s
sys 0m2.441s
real 0m3.544s
user 0m0.000s
sys 0m2.461s
real 0m0.272s
user 0m0.001s
sys 0m0.057s
100GB 1GiB files
100 files per dir
1 directories
real 1m30.137s
user 0m0.000s
sys 0m1.195s
real 1m36.648s
user 0m0.001s
sys 0m4.298s
real 0m6.742s
user 0m0.000s
sys 0m2.587s
real 0m4.320s
user 0m0.000s
sys 0m2.510s
real 0m0.088s
user 0m0.001s
sys 0m0.007s
100GB 5GiB files
20 files per dir
1 directories
real 1m29.608s
user 0m0.000s
sys 0m1.359s
real 1m33.934s
user 0m0.000s
sys 0m4.268s
real 0m3.429s
user 0m0.001s
sys 0m2.483s
real 0m3.791s
user 0m0.000s
sys 0m2.479s
real 0m0.065s
user 0m0.000s
sys 0m0.001s
100GB 10GiB files
10 files per dir
1 directories
real 1m28.458s
user 0m0.000s
sys 0m1.160s
real 1m34.002s
user 0m0.000s
sys 0m4.169s
real 0m3.792s
user 0m0.001s
sys 0m2.470s
real 0m3.462s
user 0m0.001s
sys 0m2.469s
real 0m0.068s
user 0m0.001s
sys 0m0.004s

Iozone tests on 1 channel and 2 channel scenarios

All tests above were performed with the filesystems accessed via the same single SAS I/O channel.

As this RAID has two SAS channel interfaces (although just the one controller) and can be configured to split the firmware partitions/LUNs between those two channels, and the server has two SAS channel interfaces, it was useful to check if that gave performance gain in the simple situation where the RAID unit is configured for one 12-disk RAIDset and there is just one server that the RAID is attached to.

For this test, Iozone version 3-347 was used. All filesystems were formatted with ext4 and mounted with the nodelalloc option. The read-ahead blockdev setting was 4096 blocks for all filesystem devices. Iozone tests 0 and 1 were performed only (see man iozone). Results columns are in kiBytes/second.

/dev/sdb on /disk/18a type ext4 (rw,nodelalloc)
/dev/sdc on /disk/18c type ext4 (rw,nodelalloc)
/dev/sdd on /disk/18b type ext4 (rw,nodelalloc)
/dev/sde on /disk/18d type ext4 (rw,nodelalloc)

The conclusion is no: there is no performance gain by using two channels to access one RAIDset via one controller on this RAID. This isn't unexpected. So I can use the second channel interface on the server for other purposes, like back-up (the RAID itself doesn't have a SAS-out interface for daisy chaining). It's interesting to note that the total throughput is better when the multiple Iozone streams were directed at different file-systems (LUNs), than when they are all directed at the same file-system (one LUN).

The final 8 lines show a completely reconfigured scenario where the RAID is set up as one LUN of 4.5TB, and a GPT partition table has been put in place to carve the area into 4 software partitions:

/dev/sdb1 on /disk/18a type ext4 (rw,nodelalloc)
/dev/sdb2 on /disk/18b type ext4 (rw,nodelalloc)
/dev/sdb3 on /disk/18c type ext4 (rw,nodelalloc)
/dev/sdb4 on /disk/18d type ext4 (rw,nodelalloc)

That shows decreased performance compared with the 1 channel 4 LUNs scenario. So our production configuration will be with 1 channel and multiple LUNs, and no software partitioning of the LUNs.

Type of Iozone testWriteRewriteReadReread
1 channel for the 4 LUNs: filesystems a b c d all on one channel
Throughput test with 1 process: a 346479 332649 450167 457191
Throughput test with 2 processes: a+b 384678 369588 454230 459465
Throughput test with 3 processes: a+b+c 384599 372825 458146 457339
Throughput test with 4 processes: a+b+c+d 376672 368269 471071 473646
Throughput test with 1 process: a 326613 339005 450204 456079
Throughput test with 2 processes: a+a 316312 310243 371587 366667
Throughput test with 3 processes: a+a+a 298118 297610 344950 340613
Throughput test with 4 processes: a+a+a+a 294010 293234 328245 321795
2 channels for the 4 LUNs: filesystems a&c on ch0, b&d on ch1
Throughput test with 1 process: a 350565 306995 452540 458903
Throughput test with 2 processes: a+b 377580 368764 458391 461924
Throughput test with 3 processes: a+b+c 389046 377313 454927 456743
Throughput test with 4 processes: a+b+c+d 375546 364314 474236 471858
Throughput test with 1 process: a 350777 328803 446399 454587
Throughput test with 2 processes: a+a 318038 310567 362857 362713
Throughput test with 3 processes: a+a+a 297536 298133 339188 336537
Throughput test with 4 processes: a+a+a+a 295239 294594 322317 319261
1 channel for 1 LUN: filesystems a b c d on GPT partitions
Throughput test with 1 process: a 317951 316096 415092 427690
Throughput test with 2 processes: a+b 302721 316027 348647 347296
Throughput test with 3 processes: a+b+c 299956 311405 334372 335956
Throughput test with 4 processes: a+b+c+d 300421 309448 305303 311809
Throughput test with 1 process: a 329863 309818 415362 424972
Throughput test with 2 processes: a+a 296115 291133 344669 342420
Throughput test with 3 processes: a+a+a 276889 276264 320414 318283
Throughput test with 4 processes: a+a+a+a 274836 273876 302671 303608
For comparison, non-RAID internal disk ST3500320NS ra=4096
Throughput test with 1 process: a 84651 79899 96938 96935
Throughput test with 2 processes: a+a 65270 64597 80271 80250
Throughput test with 3 processes: a+a+a 58296 57075 71239 71088
Throughput test with 4 processes: a+a+a+a 50836 49668 61477 61379
Throughput test with 1 process: a (repeat) 90356 79738 99168 99199
Throughput test with 1 process: a (repeat) 81832 81616 96937 96958
Throughput test with 1 process: a (ra=16384) 82035 81164 96938 96975

L.S.Lowe