Performance tests for ext4 and xfs on an Infortrend RAID

Author: L.S.Lowe. File: raidperf11. Original version: 20081209. This update: 20090731. Part of Guide to the Local System.

These are some performance tests on a Infortrend EonStor RAID system, attached via a LSI22320RB-F scsi HBA card, also known as LSI22320-R. The Infortrend RAID is a 24-disk box arranged as two RAID-6 arrays of 12 disks each, each disk 1 TB. So each file-system will be 10 TB. The PowerEdge-server operating system is currently Fedora 11 (64-bit), but the box is dual-bootable to the previous Fedora 10 as well. The RAID has a 1GB cache and the server has 2GB RAM and one Intel 5160 dual-core processor. The RAID has  a chunk-size of 128 kBytes (which is the factory default and is optimum on it for sequential access).

Also see this Contents list for earlier and later tests.

The tests show that ext4 is a big improvement over ext3 for speed of mkfs and fsck and file deletion. Read performance is excellent when read-ahead settings are set appropriately. Write performance is very good but can be excellent if write-barriers can be safely turned-off** in a particular sort of environment.

Timings for ext4 mkfs on a 10 TB file-system

As our average filesize is likely to be large, well in excess of 65536, it was sensible to ask for a bytes per inode of 65536 in place of the default of 16384, with the default inode size of 256 bytes. (In previous systems, these defaults were 8192 and 128 respectively). This reduces the inode space overhead to 256/65536 = 0.4%. For stride: our RAID system uses a default chunk-size of 128 kBytes, which is 32 ext4-blocks (4 kB). For stripe-width, each RAID-6 set has 12 disks, giving 10 data disks, so the stripe-width is 10 times the stride value. The tune2fs command tunes the system reserved area to 400 MB (should be enough!) and turns off regular full fsck checking, so occasional full fsck checks will have to be done by hand, at a convenient time. The following commands were used (Fedora 10 system):
mkfs -t ext4 -E stride=32,stripe-width=320 -i 65536 devicename
real 4m42.252s
user 0m2.885s
sys 0m30.641s

### For a similar mkfs but with -i 131072, mkfs took real 3m38s, sys 0m18s
### For a similar mkfs but with -t ext3, mkfs took real 27m10s, sys 0m33s

# tune2fs -c 0 -i 0 -r 102400 devicename
# mount ...
# df -T /disk/11a
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
/dev/sdb      ext4   9726161912    171668 9725580644   1% /disk/11a

Timings for ext4 fsck recovery on a 10 TB file-system

# With 2 million large files put on the filesystem ....

# df -TH /disk/11a
Filesystem Type Size Used Avail Use% Mounted on
/dev/sdb ext4 10T 5.1T 5.0T 51% /disk/11a

# Pulled-out server power plug, and then restarted the system ....

# time fsck devicename
fsck 1.41.3 (12-Oct-2008)
e2fsck 1.41.3 (12-Oct-2008)
11a: recovering journal
11a: clean, 2003101/152578048 files, 1244696169/2441239040 blocks
real 0m1.251s
user 0m0.763s
sys 0m0.401s

# Now force a full fsck consistency check on this file-system

# time fsck -f devicename
fsck 1.41.3 (12-Oct-2008)
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
11a: 2003101/152578048 files (19.8% non-contiguous), 1244696177/2441239040 blocks
real 4m45.834s
user 2m42.751s
sys 0m2.272s
# On moving to a Fedora 11 64-bit system,  do a further normal fsck check.
# Also forced a full fsck consistency check. Both timings follow.
# Note that there are now 6M files in place of 2M, and it's 25% full not 50%.
# time fsck.ext4 devicename
e2fsck 1.41.4 (27-Jan-2009)
11a: clean, 6193839/152578048 files, 629022036/2441239040 blocks
real    0m0.854s
user    0m0.654s
sys     0m0.155s
# fsck.ext4 -f devicename
e2fsck 1.41.4 (27-Jan-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
11a: 6193839/152578048 files (0.2% non-contiguous), 629022036/2441239040 blocks
real 12m27s

Timings for simple data creation and deletion on a 10 TB file-system

Files were created using: dd if=/dev/zero of=$fn count=nn bs=131072, where nn = 8 or more depends on the file size. [Ideally, the dd should also use the conv=fsync option, to ensure buffer flush overhead is including in timings, but if the total size being written exceeds RAM by a very large margin, as here, this is not important.] Files were deleted, after meaures to ensure buffers were flushed out of server and RAID caches, using: rm -Rf dirname.  Timings were done by the bash time built-in command.

Data in red are from a SL5 / RHEL5 system. Data in black are from a Fedora 10 system. It's the same hardware, with a dual boot.

Data size
Data shape
Create time
ext2
Create time
ext3o
Create time
ext4o (barrier=1)
Create time
ext4o barrier=0
Create time
xfs (barrier)
Create time
xfs nobarrier
10GB
1MiB file,
1000 files/dir,
10 dir
real    0m48.904s
user    0m2.571s
sys     0m13.138s
real    0m43.675s
user    0m2.302s
sys     0m15.944s
real    0m57.437s
user    0m2.730s
sys     0m21.569s
real    0m55.972s
user    0m2.343s
sys     0m27.640s



real    0m59.810s
user    0m2.484s
sys     0m19.168s



real    0m57.615s
user    0m2.491s
sys     0m26.440s
real    3m35.545s
user    0m2.751s
sys     0m13.102s
real    1m56.742s
user    0m2.313s
sys     0m16.775s



real    0m43.322s
user    0m2.582s
sys     0m13.094s
100GB
1MiB file,
1000 files/dir,
100 dirs
real    8m29.126s
user    0m26.711s
sys     2m4.953s
real    7m38.323s
user    0m23.194s
sys     2m43.536s
real    12m10.949s
user    0m28.111s
sys     3m51.506s
real    12m35.604s
user    0m24.770s
sys     4m48.326s



real    10m33.521s
user    0m26.472s
sys     4m3.919s



real    7m42.901s
user    0m28.684s
sys     5m19.282s
real    40m33.414s
user    0m28.206s
sys     2m16.300s
real    21m33.492s
user    0m24.103s
sys     2m52.960s
real    7m29.459s
user    0m26.882s
sys     2m14.342s
real    7m25.802s
user    0m24.389s
sys     2m54.731s
100GB
10MiB file,
1000 files/dir,
10 dirs



real    6m53.810s
user    0m3.071s
sys     1m37.432s
real    8m47.557s
user    0m4.284s
sys     2m54.904s
real    9m20.885s
user    0m3.594s
sys     3m53.846s



real    11m16.134s
user    0m3.922s
sys     2m38.688s



real    7m49.848s
user    0m3.989s
sys     2m53.951s
real    16m19.207s
user    0m4.142s
sys     1m50.147s
real    13m56.126s
user    0m3.636s
sys     2m16.774s



real    7m11.916s
user    0m3.646s
sys     2m16.462s
100GB
10GiB file,
10 files/dir,
1 dir
real    7m17.119s
user    0m0.377s
sys     1m19.629s
real    7m22.659s
user    0m0.330s
sys     1m50.320s
real    8m36.844s
user    0m0.368s
sys     3m5.284s
real    11m11.748s
user    0m0.354s
sys     4m2.122s



real    10m38.975s
user    0m0.318s
sys     2m27.480s



real    7m19.614s
user    0m0.325s
sys     2m29.694s
real    8m0.995s
user    0m0.423s
sys     1m54.819s
real    8m44.645s
user    0m0.361s
sys     2m21.015s
real    7m17.117s
user    0m0.403s
sys     1m50.620s
real    7m19.851s
user    0m0.339s
sys     2m20.550s


Delete timings: it was important to ensure that all buffers were flushed from the server cache and RAID cache before starting a delete: otherwise deletes can be spuriously fast. This was done by interleaving other creation steps of the 480GB data between creation and deletion for a particular step.

Data size
Data shape
Delete time
ext2
Delete time
ext3o
Delete time
ext4o (barrier=1)

Delete time
ext4o barrier=0
Delete time
xfs (barrier)
Delete time
xfs nobarrier
10GB
1MiB file,
1000 files/dir,
10 dir
real    0m11.577s
user    0m0.001s
sys     0m0.396s
real    0m56.190s
user    0m0.005s
sys     0m0.344s
real    0m1.033s
user    0m0.005s
sys     0m0.891s
real    1m23.686s
user    0m0.013s
sys     0m1.012s



real    0m0.739s
user    0m0.004s
sys     0m0.471s



real    0m0.868s
user    0m0.004s
sys     0m0.490s
real    0m29.862s
user    0m0.003s
sys     0m1.199s
real    0m5.900s
user    0m0.006s
sys     0m0.441s
real    0m6.108s
user    0m0.007s
sys     0m1.304s
real    0m1.020s
user    0m0.005s
sys     0m0.478s
100GB
1MiB file,
1000 files/dir,
100 dirs
real    9m27.612s
user    0m0.055s
sys     0m3.111s
real    17m4.312s
user    0m0.074s
sys     0m13.102s
real    13m50.109s
user    0m0.063s
sys     0m9.857s



real    0m15.999s
user    0m0.032s
sys     0m6.055s



real    0m12.935s
user    0m0.037s
sys     0m4.995s
real    4m25.052s
user    0m0.054s
sys     0m8.903s
real    1m0.754s
user    0m0.049s
sys     0m4.449s
real    1m5.323s
user    0m0.062s
sys     0m9.784s
real    0m8.438s
user    0m0.041s
sys     0m4.698s
100GB
10MiB file,
1000 files/dir,
10 dirs
real    3m24.135s
user    0m0.005s
sys     0m1.423s
real    2m56.238s
user    0m0.010s
sys     0m1.302s
real    4m35.323s
user    0m0.002s
sys     0m5.805s
real    4m57.621s
user    0m0.010s
sys     0m5.528s



real    0m10.063s
user    0m0.004s
sys     0m3.218s



real    0m12.270s
user    0m0.002s
sys     0m3.548s
real    0m26.857s
user    0m0.004s
sys     0m1.239s
real    0m6.023s
user    0m0.007s
sys     0m0.432s



real    0m0.896s
user    0m0.007s
sys     0m0.478s
100GB
10GiB file,
10 files/dir,
1 dir
real    2m21.418s
user    0m0.000s
sys     0m1.329s
real    2m39.210s
user    0m0.000s
sys     0m1.357s
real    1m35.205s
user    0m0.000s
sys     0m5.204s
real    2m22.824s
user    0m0.000s
sys     0m5.376s



real    0m3.667s
user    0m0.000s
sys     0m2.563s



real    0m3.344s
user    0m0.000s
sys     0m2.203s
real    0m0.660s
user    0m0.001s
sys     0m0.434s
real    0m0.001s
user    0m0.000s
sys     0m0.001s
real    0m0.457s
user    0m0.000s
sys     0m0.436s
real    0m0.001s
user    0m0.000s
sys     0m0.001s


Read-ahead settings

The read-ahead setting was set by the command:
         blockdev --setra $rab /dev/$dev
where $rab is the read-ahead buffer size in 512-byte sectors; this was equivalent in 2.6 kernel to doing:
         echo $rabkb > /sys/block/$dev/queue/read_ahead_kb 
where $rabkb is the read-ahead buffer size in kBytes. The system default is 128 kBytes, which is generally far too small for big files.

Timings for bonnie++ benchmark

Bonnie runs were done for a variety of read-ahead buffer values and of read/write datablock sizes. These are just the results for read-ahead buffer setting of 4096 sectors (2 MiB), which was optimum for all file-systems, and a read/write datablock size of 32 kB, varying which had not much effect. In all cases, the data size was 8GB, sufficiently bigger than the RAM and raid-cache sizes of 2GB and 1GB respectively. Only results for the default (ordered) mount option are shown for ext3 and ext4. Presumably ext3 in this kernel version 2.6.27.5-117.fc10.i686 does not support barrier, as there was no real difference between the values for options barrier=0,1 so only one is shown. This bonnie++ test show that sequential read preformance does not vary much between file-systems when read-ahead buffer is large, and that write performance is improved if barriers can be safely turned-off**. It does not reveal the poor file-deletion performance of ext3 in the previous section,  probably because for the create/delete bonnie step, I used the default filesize of zero. It shows the same poor xfs (barrier) file creation performance of the previous sections.

bonnie++ step being performed
Units
ext2
ext3o
ext4o (barrier)
ext4o barrier=0
xfs (barrier)
xfs nobarrier
Block Output 8GB:32kB (8GB data, 32kB blks)
kB/sec
244750
232150
167028
234804
197991
240840
Block Input 8GB:32kB, 2MiB read-ahead
kB/sec
252522
250428
255534
257082
258847
259223
Random seeks
/sec
389
391
502
514
275
277
Create Sequ,Random
/sec
29986,30054
23894,31727
28018,28885
28751,29802
1701,1688 6475,6420
Delete Sequ,Random
/sec
184274,69779
86227,14194
68956,6053
71655,13578
1752,1311
19987,7799


Bonnie CSV values:


ext2----,8G:32k,46572,98,244750,21,95767,12,79417,96,252522,17,389.2,1,
256/256,29986,99,+++++,+++,184274,99,30054,97,430370,99,69779,99
ext3obar,8G:32k,46221,97,232856,50,89583,16,80230,97,250051,17,387.1,1,
256/256,23898,71,444656,100,85154,85,31575,93,427814,100,14294,20
ext3onob,8G:32k,47033,99,232150,49,89235,15,78734,96,250428,16,391.2,1,
256/256,23894,71,446827,99,86227,86,31727,93,430843,99,14194,20
ext4obar,8G:32k,49695,96,167028,20,83297,13,79158,97,255534,17,502.1,1,
256/256,28018,92,423243,99,68956,88,28885,93,406118,100,6053,9
ext4onob,8G:32k,50596,99,234804,29,94345,14,82687,99,257082,18,514.4,1,
256/256,28751,93,420693,99,71655,91,29802,96,406985,99,13578,22
xfs--bar,8G:32k,41380,87,197991,23,91445,23,82362,99,258847,18,275.6,0,
256/256,1701,17,510076,99,1752,5,1688,17,415676,99,1311,4
xfs--nob,8G:32k,47390,99,240840,28,90546,22,82266,99,259223,18,277.0,1,
256/256,6475,61,503892,100,19987,60,6420,61,414236,99,7799,26

Read/write performance over nfs to ext4 and effect of tuning

See this nfs performance page.

Oddities

- When using SL5 system, several times when under heavy load for an XFS file-system, and also once when making an ext3 file-system, the driver came up with errors of the following form, which required a reboot before the scsi was accessible again. This was never repeated for the Fedora 10 system, and so maybe is attributable to the particular SL5 version of the mpt driver:
    kernel: mptscsih: ioc1: attempting task abort! (sc=c86a4e40)
kernel: sd 1:0:11:0:
kernel:         command: Read(10): 28 00 00 64 00 00 00 00 08 00
kernel: mptscsih: ioc1: task abort: SUCCESS (sc=c86a4e40)
- On Fedora 10, when using ext2, when creating the 1MB*1000dirs*100dirs files and when copying directories using cp -a, and on one occasion when simply creating a single low level directory on an empty filesystem, the operations failed with I/O error and there were corresponding messages in /var/log/messages:
   grow_buffers: requested out-of-range block 18446744071757758592 for device sdb
grow_buffers: requested out-of-range block 18446744071787708803 for device sdb
This may be a problem with kernel 2.6.27.5-117.fc10.i686 specifically. But also, it's not clear to me that ext2 actually supports file-systems bigger than 8TB, so that might have been the problem. Nobody would want to use ext2 in production mode on such a big file-system anyway: only for performance comparison tests such as these!

- Mounts for the additional RAID file-systems added to /etc/fstab after install did not work at boot time. The message was fsck.ext4 : Unable to resolve LABEL=xx, and the system drops into file-system repair mode. In that repair mode, the file-system is perfectly accessible, oddly. Similar messages appear for an ext3 file-system or if the device file name is used instead of a LABEL, and probably also if UUID is used. This is Fedora 10 at kernel  level 2.6.27.7-134.fc10.i686. A circumvention is to put noauto in the /etc/fstab entry fourth field and then mount it in /etc/rc.d/rc.local instead: that works without problems.

** Useful Links


L.S.Lowe