Author: L.S.Lowe. File: raidperf18. Original version: 20100812. This update: 20101023. Part of Guide to the Local System.
The RAID for this series of tests is an Infortrend S12S-G1033, equipped with 12 ST3450856SS disks (450GB, SAS, 15k), configured as RAID6, and so with the data equivalent of 10 disks. The RAID stripe size was kept at the factory default: 128 kB. The disk data area is split into four firmware partitions, a b c d. The areas a and b are 1.5TB, and c and d are 0.75TB.The RAID was attached to a Dell R410 server via a LSI SAS3801E host bus adapter card. The server was equipped with two Intel E5520 quad processors and 12GB of RAM.
# time mkfs -t ext4 -E stride=32,stripe-width=320 -i 32768 /dev/sd$dv mke2fs 1.41.9 (22-Aug-2009) /dev/sdb is entire device, not just one partition! Proceed anyway? (y,n) y ...... real 0m57.763s user 0m0.984s sys 0m13.239s # tune2fs -c 0 -i 0 -m 1 -L $fs /dev/sd$dv
The xfs file-system was formatted as follows: again this is for a 1.5TB partition:
# time mkfs.xfs -d su=128k,sw=10 -L 18b /dev/sd$dv2 meta-data=/dev/sdc isize=256 agcount=4, agsize=91518048 blks = sectsz=512 attr=2 data = bsize=4096 blocks=366072192, imaxpct=5 = sunit=32 swidth=320 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=178752, version=2 = sectsz=512 sunit=32 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 real 0m4.540s user 0m0.001s sys 0m0.096s
The read-ahead was set to various values using the blockdev command (ra=4096 means read-ahead of 4096 blocks), and the file-system was mounted with ext4 default options, and with ext4 -o nodelalloc as well, and a xfs file-system was tested. Some tests for the same setup were repeated to get an idea of consistency. The kernel in use was 2.6.32.16-150.fc12.x86_64 in Fedora 12. Bonnie speed results are in kiBytes/second for I/O rates, and number per second for file creation and deletion rates.
Some of the following tests were performed after the RAID went into production mode: these are marked + in the Setup column. The kernel in use for this was 2.6.32.21-168.fc12.x86_64. The read-ahead for those tests included values as high as 32768 sectors, to enable comparison with bulk data RAIDs (like this one): this wouldn't be used in production for a home file-system where many files are small.
Setup | Size:chk | chrW | %c | BlkW | %c | ReW | %c | chrR | %c | blkR | %c | Seek | %c | Create/Erase | SeqCre | %c | SeqRead | %c | SeqDel | %c | RanCre | %c | RanRd | %c | RanDel | %c |
ext4 default | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ra=256,18a-ext4o | 24G:128k | 87111 | 99 | 369136 | 60 | 68391 | 8 | 66369 | 93 | 253466 | 12 | 371.3 | 2 | 256:1000:1000/256 | 43657 | 98 | 121547 | 84 | 75531 | 96 | 43829 | 97 | 27046 | 24 | 16219 | 35 |
ra=1024,18a-ext4o | 24G:128k | 82069 | 99 | 367842 | 59 | 73179 | 7 | 76445 | 97 | 356195 | 16 | 355.6 | 2 | 256:1000:1000/256 | 41341 | 98 | 121054 | 84 | 74490 | 96 | 40138 | 97 | 26842 | 25 | 15500 | 36 |
ra=4096,18a-ext4o | 24G:128k | 85089 | 99 | 256188 | 43 | 90625 | 9 | 78489 | 99 | 495319 | 22 | 374.3 | 3 | 256:1000:1000/256 | 32355 | 92 | 102360 | 86 | 62377 | 96 | 40122 | 98 | 26357 | 27 | 15809 | 37 |
ra=4096,18a-ext4o | 24G:128k | 88870 | 99 | 377271 | 61 | 91164 | 9 | 82550 | 99 | 561145 | 24 | 373.2 | 3 | 256:1000:1000/256 | 44455 | 98 | 127743 | 84 | 75540 | 96 | 43876 | 97 | 26863 | 24 | 16039 | 35 |
ra=4096,18a-ext4o | 24G:128k | 88230 | 99 | 371725 | 59 | 88965 | 9 | 79181 | 99 | 493800 | 24 | 359.6 | 2 | 256:1000:1000/256 | 45017 | 98 | 125987 | 83 | 74937 | 96 | 44722 | 97 | 26361 | 24 | 15408 | 34 |
ext4 with nodelalloc option | ||||||||||||||||||||||||||
ra=256,18a-ext4o-n | 24G:128k | 79369 | 99 | 333221 | 67 | 122188 | 14 | 79520 | 98 | 304843 | 15 | 360.5 | 2 | 256:1000:1000/256 | 35158 | 96 | 93245 | 68 | 52122 | 96 | 31081 | 82 | 24498 | 23 | 14593 | 44 |
ra=256,18b-ext4o-n+ | 24G:128k | 77645 | 98 | 328598 | 69 | 121693 | 15 | 76299 | 98 | 280436 | 15 | 340.3 | 2 | 256:1000:1000/256 | 28102 | 88 | 83045 | 67 | 48791 | 94 | 28582 | 80 | 24765 | 25 | 14647 | 45 |
ra=1024,18a-ext4o-n | 24G:128k | 79833 | 99 | 332776 | 72 | 147828 | 15 | 80688 | 98 | 422025 | 18 | 362.1 | 2 | 256:1000:1000/256 | 35653 | 96 | 94896 | 67 | 49518 | 96 | 30468 | 83 | 24394 | 23 | 14353 | 44 |
ra=4096,18a-ext4o-n | 24G:128k | 80076 | 99 | 331200 | 71 | 169884 | 18 | 80363 | 98 | 530436 | 25 | 350.5 | 2 | 256:1000:1000/256 | 35061 | 96 | 92081 | 68 | 53472 | 96 | 29555 | 83 | 23927 | 24 | 14932 | 45 |
ra=4096,18a-ext4o-n | 24G:128k | 82495 | 99 | 339972 | 67 | 179324 | 18 | 80973 | 98 | 541319 | 24 | 369.3 | 2 | 256:1000:1000/256 | 37038 | 96 | 99478 | 67 | 54264 | 96 | 30586 | 82 | 24480 | 22 | 15522 | 45 |
ra=4096,18a-ext4o-n | 24G:128k | 82934 | 99 | 333467 | 66 | 179883 | 18 | 79999 | 98 | 537040 | 23 | 352.3 | 2 | 256:1000:1000/256 | 37022 | 96 | 98881 | 66 | 55637 | 96 | 28727 | 80 | 25179 | 23 | 14872 | 44 |
ra=4096,18b-ext4o-n+ | 24G:128k | 79208 | 99 | 327374 | 72 | 160230 | 18 | 80881 | 98 | 486992 | 22 | 334.2 | 2 | 256:1000:1000/256 | 26597 | 89 | 104505 | 82 | 47537 | 93 | 26017 | 73 | 25142 | 26 | 14246 | 44 |
ra=8192-18b-ext4o-n+ | 24G:128k | 79963 | 99 | 329635 | 68 | 178047 | 20 | 80976 | 98 | 623182 | 28 | 360.2 | 3 | 256:1000:1000/256 | 34339 | 96 | 89004 | 68 | 49379 | 94 | 28802 | 80 | 23021 | 24 | 14348 | 45 |
ra=16384-18b-ext4o-n+ | 24G:128k | 78406 | 99 | 330013 | 70 | 175650 | 19 | 80161 | 98 | 629075 | 30 | 334.8 | 2 | 256:1000:1000/256 | 27633 | 88 | 76055 | 64 | 45379 | 93 | 25888 | 72 | 26492 | 29 | 14496 | 47 |
ra=24576-18b-ext4o-n+ | 24G:128k | 78195 | 97 | 325516 | 68 | 171282 | 18 | 74471 | 99 | 620553 | 28 | 321.3 | 2 | 256:1000:1000/256 | 26828 | 88 | 97686 | 78 | 50225 | 94 | 26043 | 70 | 25533 | 26 | 14627 | 46 |
ra=32768-18b-ext4o-n+ | 24G:128k | 78132 | 97 | 320362 | 68 | 170184 | 18 | 81104 | 98 | 627045 | 28 | 331.5 | 3 | 256:1000:1000/256 | 27897 | 88 | 77276 | 64 | 49148 | 94 | 25065 | 71 | 25975 | 29 | 14163 | 46 |
xfs | ||||||||||||||||||||||||||
ra=256,18b-xfs | 24G:128k | 85866 | 99 | 293797 | 26 | 79330 | 9 | 73249 | 97 | 282207 | 13 | 295.8 | 2 | 256:1000:1000/256 | 11161 | 69 | 111385 | 100 | 9771 | 47 | 10558 | 64 | 103472 | 100 | 5710 | 34 |
ra=1024,18b-xfs | 24G:128k | 87886 | 99 | 317655 | 28 | 81060 | 8 | 79047 | 98 | 393564 | 18 | 322.5 | 2 | 256:1000:1000/256 | 11177 | 66 | 115875 | 100 | 9762 | 46 | 10647 | 65 | 105569 | 99 | 5807 | 35 |
ra=4096,18b-xfs | 24G:128k | 90302 | 99 | 279709 | 22 | 93437 | 8 | 78615 | 98 | 511913 | 24 | 308.3 | 2 | 256:1000:1000/256 | 11252 | 72 | 110679 | 99 | 9818 | 47 | 10518 | 66 | 106525 | 99 | 6302 | 36 |
Comments: ext4 performance is as good as or exceeds xfs performance for these tests. A feature of the Read-Write tests (ReW column) is that the performance doubles when the nodelalloc mount option is used.
The ext4o and xfs file-systems were 1.5TB each and were up to 44% full. The ext2 and ext3o file-systems were 0.75TB each and were up to 88% full. ext3o and ext4o mean ext3 and ext4 with the (default) ordered mode.
While doing some initial dd tests, the disk activity lights on the RAID were checked. Using dd to create a large file (a few GB or more) showed that the disk activity came in bursts every second: roughly half-a-second on, half-a-second off. This does not occur when the file-size is below a couple of GB. This is borne out by the result timings. Looking at top, flush-8:16 was running at around 70% of one processor, with dd at 20% of another, when the output file grows to around 4GB or bigger. (This unit has 2x E5520 quad processors). Also, iotop reports some oddly large figures (>1000 M/s) for Disk Write for flush and jbd2. After some trials, it was found that using the nodelalloc mount option gave improved performance, oddly! Again looking at top, now with this mount option in place, flush-8:16 is at around 3%, jbd2/sdb-8 at around 10%, and dd 70%. iotop is also more sensible, reporting dd as writing at 300 M/s. It's not clear if this is a bug or a feature in the kernel: I am running 2.6.32.16-150.fc12.x86_64 currently. The lower write performance for a default ext4 mount wasn't evident in the bonnie++ results in the section above; on the other hand, bonnie++ generates multiple 1GB files rather than one big file, which may (or may not) be relevant!
Conclusion: the nodelalloc option seems a good idea for sane ext4 behaviour with the present kernel.
Data size | Data shape | Create time ext2 | Create time ext3o | Create time ext4o | Create time ext4o,nodelalloc | Create time xfs |
---|---|---|---|---|---|---|
10GB |
1MiB files 1000 files per dir 10 directories |
real 0m59.269s user 0m4.571s sys 0m22.389s |
real 0m48.386s user 0m5.090s sys 0m34.088s |
real 0m36.629s user 0m4.712s sys 0m22.520s |
real 0m40.174s user 0m5.083s sys 0m33.351s |
real 0m43.529s user 0m4.675s sys 0m23.371s |
100GB |
1MiB files 1000 files per dir 100 directories |
real 12m55.747s user 0m50.404s sys 3m46.392s |
real 9m37.460s user 0m50.430s sys 5m57.278s |
real 7m6.369s user 0m50.177s sys 3m46.584s |
real 6m55.249s user 0m49.877s sys 5m44.072s |
real 8m25.341s user 0m50.537s sys 3m58.345s |
100GB | 10MiB files 100 files per dir 100 directories |
real 9m42.778s user 0m6.569s sys 2m12.464s |
real 8m45.528s user 0m6.467s sys 4m29.624s |
real 7m24.878s user 0m6.707s sys 2m27.540s |
real 5m38.262s user 0m6.807s sys 4m16.459s |
real 11m19.553s user 0m6.797s sys 2m22.753s |
100GB | 100MiB files 100 files per dir 10 directories |
real 6m59.802s user 0m1.046s sys 2m6.885s |
real 7m15.308s user 0m0.930s sys 4m13.310s |
real 7m40.650s user 0m0.886s sys 2m13.302s |
real 5m30.942s user 0m1.052s sys 4m6.909s |
real 10m46.309s user 0m0.970s sys 2m18.211s |
100GB | 1GiB files 100 files per dir 1 directories |
real 6m1.267s user 0m0.417s sys 2m6.913s |
real 7m11.983s user 0m0.451s sys 4m22.091s |
real 8m35.436s user 0m0.397s sys 3m45.442s |
real 5m35.668s user 0m0.549s sys 4m12.811s |
real 5m44.729s user 0m0.437s sys 2m18.844s |
100GB |
5GiB files 20 files per dir 1 directories |
real 6m50.423s user 0m0.340s sys 2m12.040s |
real 7m5.830s user 0m0.354s sys 4m21.068s |
real 8m45.591s user 0m0.302s sys 4m5.133s |
real 5m19.863s user 0m0.468s sys 3m58.840s |
real 5m14.756s user 0m0.442s sys 2m17.564s |
100GB |
10GiB files 10 files per dir 1 directories |
real 6m32.086s user 0m0.306s sys 2m17.188s |
real 7m41.802s user 0m0.332s sys 4m28.373s |
real 10m15.886s user 0m0.323s sys 3m25.950s |
real 5m22.866s user 0m0.356s sys 4m6.127s |
real 5m4.910s user 0m0.364s sys 2m17.101s |
Data size | Data shape | Delete time ext2 | Delete time ext3o | Delete time ext4o | Delete time ext4o,nodelalloc | Delete time xfs |
10GB |
1MiB files 1000 files per dir 10 directories |
real 0m36.025s user 0m0.006s sys 0m0.368s |
real 0m43.659s user 0m0.007s sys 0m0.920s |
real 0m0.948s user 0m0.005s sys 0m0.516s |
real 0m1.179s user 0m0.003s sys 0m0.566s |
real 0m1.671s user 0m0.006s sys 0m0.678s |
100GB | 1MiB files 1000 files per dir 100 directories |
real 6m0.509s user 0m0.065s sys 0m3.029s |
real 7m11.398s user 0m0.075s sys 0m9.115s |
real 0m9.333s user 0m0.039s sys 0m5.268s |
real 0m12.384s user 0m0.042s sys 0m5.887s |
real 0m16.022s user 0m0.053s sys 0m6.838s |
100GB | 10MiB files 100 files per dir 100 directories |
real 2m7.691s user 0m0.019s sys 0m1.767s |
real 2m21.691s user 0m0.023s sys 0m5.009s |
real 0m5.263s user 0m0.009s sys 0m2.667s |
real 0m6.001s user 0m0.008s sys 0m2.741s |
real 0m1.530s user 0m0.004s sys 0m0.575s |
100GB | 100MiB files 100 files per dir 10 directories |
real 1m30.588s user 0m0.003s sys 0m1.188s |
real 1m46.732s user 0m0.005s sys 0m4.186s |
real 0m3.588s user 0m0.000s sys 0m2.441s |
real 0m3.544s user 0m0.000s sys 0m2.461s |
real 0m0.272s user 0m0.001s sys 0m0.057s |
100GB | 1GiB files 100 files per dir 1 directories |
real 1m30.137s user 0m0.000s sys 0m1.195s |
real 1m36.648s user 0m0.001s sys 0m4.298s |
real 0m6.742s user 0m0.000s sys 0m2.587s |
real 0m4.320s user 0m0.000s sys 0m2.510s |
real 0m0.088s user 0m0.001s sys 0m0.007s |
100GB | 5GiB files 20 files per dir 1 directories |
real 1m29.608s user 0m0.000s sys 0m1.359s |
real 1m33.934s user 0m0.000s sys 0m4.268s |
real 0m3.429s user 0m0.001s sys 0m2.483s |
real 0m3.791s user 0m0.000s sys 0m2.479s |
real 0m0.065s user 0m0.000s sys 0m0.001s |
100GB | 10GiB files 10 files per dir 1 directories |
real 1m28.458s user 0m0.000s sys 0m1.160s |
real 1m34.002s user 0m0.000s sys 0m4.169s |
real 0m3.792s user 0m0.001s sys 0m2.470s |
real 0m3.462s user 0m0.001s sys 0m2.469s |
real 0m0.068s user 0m0.001s sys 0m0.004s |
As this RAID has two SAS channel interfaces (although just the one controller) and can be configured to split the firmware partitions/LUNs between those two channels, and the server has two SAS channel interfaces, it was useful to check if that gave performance gain in the simple situation where the RAID unit is configured for one 12-disk RAIDset and there is just one server that the RAID is attached to.
For this test, Iozone version 3-347 was used. All filesystems were formatted with ext4 and mounted with the nodelalloc option. The read-ahead blockdev setting was 4096 blocks for all filesystem devices. Iozone tests 0 and 1 were performed only (see man iozone). Results columns are in kiBytes/second.
/dev/sdb on /disk/18a type ext4 (rw,nodelalloc) /dev/sdc on /disk/18c type ext4 (rw,nodelalloc) /dev/sdd on /disk/18b type ext4 (rw,nodelalloc) /dev/sde on /disk/18d type ext4 (rw,nodelalloc)
The conclusion is no: there is no performance gain by using two channels to access one RAIDset via one controller on this RAID. This isn't unexpected. So I can use the second channel interface on the server for other purposes, like back-up (the RAID itself doesn't have a SAS-out interface for daisy chaining). It's interesting to note that the total throughput is better when the multiple Iozone streams were directed at different file-systems (LUNs), than when they are all directed at the same file-system (one LUN).
The final 8 lines show a completely reconfigured scenario where the RAID is set up as one LUN of 4.5TB, and a GPT partition table has been put in place to carve the area into 4 software partitions:
/dev/sdb1 on /disk/18a type ext4 (rw,nodelalloc) /dev/sdb2 on /disk/18b type ext4 (rw,nodelalloc) /dev/sdb3 on /disk/18c type ext4 (rw,nodelalloc) /dev/sdb4 on /disk/18d type ext4 (rw,nodelalloc)
That shows decreased performance compared with the 1 channel 4 LUNs scenario. So our production configuration will be with 1 channel and multiple LUNs, and no software partitioning of the LUNs.
Type of Iozone test | Write | Rewrite | Read | Reread |
---|---|---|---|---|
1 channel for the 4 LUNs: filesystems a b c d all on one channel | ||||
Throughput test with 1 process: a | 346479 | 332649 | 450167 | 457191 |
Throughput test with 2 processes: a+b | 384678 | 369588 | 454230 | 459465 |
Throughput test with 3 processes: a+b+c | 384599 | 372825 | 458146 | 457339 |
Throughput test with 4 processes: a+b+c+d | 376672 | 368269 | 471071 | 473646 |
Throughput test with 1 process: a | 326613 | 339005 | 450204 | 456079 |
Throughput test with 2 processes: a+a | 316312 | 310243 | 371587 | 366667 |
Throughput test with 3 processes: a+a+a | 298118 | 297610 | 344950 | 340613 |
Throughput test with 4 processes: a+a+a+a | 294010 | 293234 | 328245 | 321795 |
2 channels for the 4 LUNs: filesystems a&c on ch0, b&d on ch1 | ||||
Throughput test with 1 process: a | 350565 | 306995 | 452540 | 458903 |
Throughput test with 2 processes: a+b | 377580 | 368764 | 458391 | 461924 |
Throughput test with 3 processes: a+b+c | 389046 | 377313 | 454927 | 456743 |
Throughput test with 4 processes: a+b+c+d | 375546 | 364314 | 474236 | 471858 |
Throughput test with 1 process: a | 350777 | 328803 | 446399 | 454587 |
Throughput test with 2 processes: a+a | 318038 | 310567 | 362857 | 362713 |
Throughput test with 3 processes: a+a+a | 297536 | 298133 | 339188 | 336537 |
Throughput test with 4 processes: a+a+a+a | 295239 | 294594 | 322317 | 319261 |
1 channel for 1 LUN: filesystems a b c d on GPT partitions | ||||
Throughput test with 1 process: a | 317951 | 316096 | 415092 | 427690 |
Throughput test with 2 processes: a+b | 302721 | 316027 | 348647 | 347296 |
Throughput test with 3 processes: a+b+c | 299956 | 311405 | 334372 | 335956 |
Throughput test with 4 processes: a+b+c+d | 300421 | 309448 | 305303 | 311809 |
Throughput test with 1 process: a | 329863 | 309818 | 415362 | 424972 |
Throughput test with 2 processes: a+a | 296115 | 291133 | 344669 | 342420 |
Throughput test with 3 processes: a+a+a | 276889 | 276264 | 320414 | 318283 |
Throughput test with 4 processes: a+a+a+a | 274836 | 273876 | 302671 | 303608 |
For comparison, non-RAID internal disk ST3500320NS ra=4096 | ||||
Throughput test with 1 process: a | 84651 | 79899 | 96938 | 96935 |
Throughput test with 2 processes: a+a | 65270 | 64597 | 80271 | 80250 |
Throughput test with 3 processes: a+a+a | 58296 | 57075 | 71239 | 71088 |
Throughput test with 4 processes: a+a+a+a | 50836 | 49668 | 61477 | 61379 |
Throughput test with 1 process: a (repeat) | 90356 | 79738 | 99168 | 99199 |
Throughput test with 1 process: a (repeat) | 81832 | 81616 | 96937 | 96958 |
Throughput test with 1 process: a (ra=16384) | 82035 | 81164 | 96938 | 96975 |
L.S.Lowe