Performance tests for ext4 and xfs on an Infortrend RAID

Author: L.S.Lowe. File: raidperf11. Original version: 20081209. This update: 20090731. Part of Guide to the Local System.

These are some performance tests on a Infortrend EonStor RAID system, attached via a LSI22320RB-F scsi HBA card, also known as LSI22320-R. The Infortrend RAID is a 24-disk box arranged as two RAID-6 arrays of 12 disks each, each disk 1 TB. So each file-system will be 10 TB. The PowerEdge-server operating system is currently Fedora 11 (64-bit), but the box is dual-bootable to the previous Fedora 10 as well. The RAID has a 1GB cache and the server has 2GB RAM and one Intel 5160 dual-core processor. The RAID has a chunk-size of 128 kBytes (which is the factory default and is optimum on it for sequential access).

Also see this Contents list for earlier and later tests.

The tests show that ext4 is a big improvement over ext3 for speed of mkfs and fsck and file deletion. Read performance is excellent when read-ahead settings are set appropriately. Write performance is very good but can be excellent if write-barriers can be safely turned-off** in a particular sort of environment.

Timings for ext4 mkfs on a 10 TB file-system

As our average filesize is likely to be large, well in excess of 65536, it was sensible to ask for a bytes per inode of 65536 in place of the default of 16384, with the default inode size of 256 bytes. (In previous systems, these defaults were 8192 and 128 respectively). This reduces the inode space overhead to 256/65536 = 0.4%. For stride: our RAID system uses a default chunk-size of 128 kBytes, which is 32 ext4-blocks (4 kB). For stripe-width, each RAID-6 set has 12 disks, giving 10 data disks, so the stripe-width is 10 times the stride value. The tune2fs command tunes the system reserved area to 400 MB (should be enough!) and turns off regular full fsck checking, so occasional full fsck checks will have to be done by hand, at a convenient time. The following commands were used (Fedora 10 system):

mkfs -t ext4 -E stride=32,stripe-width=320 -i 65536 devicename
real    4m42.252s
user    0m2.885s
sys     0m30.641s

### For a similar mkfs but with -i 131072, mkfs took real 3m38s, sys 0m18s
### For a similar mkfs but with -t ext3, mkfs took real 27m10s, sys 0m33s 

# tune2fs -c 0 -i 0 -r 102400 devicename
# mount ...
# df -T /disk/11a
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
/dev/sdb      ext4   9726161912    171668 9725580644   1% /disk/11a

Timings for ext4 fsck recovery on a 10 TB file-system

# With 2 million large files put on the filesystem ....

# df -TH /disk/11a
Filesystem    Type     Size   Used  Avail Use% Mounted on
/dev/sdb      ext4      10T   5.1T   5.0T  51% /disk/11a

# Pulled-out server power plug, and then restarted the system ....

# time fsck devicename
fsck 1.41.3 (12-Oct-2008)
e2fsck 1.41.3 (12-Oct-2008)
11a: recovering journal
11a: clean, 2003101/152578048 files, 1244696169/2441239040 blocks
real    0m1.251s
user    0m0.763s
sys     0m0.401s

# Now force a full fsck consistency check on this file-system

# time fsck -f devicename
fsck 1.41.3 (12-Oct-2008)
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
11a: 2003101/152578048 files (19.8% non-contiguous), 1244696177/2441239040 blocks
real    4m45.834s
user    2m42.751s
sys     0m2.272s

# On moving to a Fedora 11 64-bit system,  do a further normal fsck check.
# Also forced a full fsck consistency check. Both timings follow.
# Note that there are now 6M files in place of 2M, and it's 25% full not 50%.

# time fsck.ext4 devicename
e2fsck 1.41.4 (27-Jan-2009)
11a: clean, 6193839/152578048 files, 629022036/2441239040 blocks
real    0m0.854s
user    0m0.654s
sys     0m0.155s

# fsck.ext4 -f devicename
e2fsck 1.41.4 (27-Jan-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
11a: 6193839/152578048 files (0.2% non-contiguous), 629022036/2441239040 blocks
real    12m27s

Timings for simple data creation and deletion on a 10 TB file-system

Files were created using: dd if=/dev/zero of=$fn count=nn bs=131072, where nn = 8 or more depends on the file size. [Ideally, the dd should also use the conv=fsync option, to ensure buffer flush overhead is including in timings, but if the total size being written exceeds RAM by a very large margin, as here, this is not important.] Files were deleted, after meaures to ensure buffers were flushed out of server and RAID caches, using: rm -Rf dirname. Timings were done by the bash time built-in command.

Data in red are from a SL5 / RHEL5 system. Data in black are from a Fedora 10 system. It's the same hardware, with a dual boot.

Data size	Data shape	Create time ext2	Create time ext3o	Create time ext4o (barrier=1)	Create time ext4o barrier=0	Create time xfs (barrier)	Create time xfs nobarrier
10GB	1MiB file, 1000 files/dir, 10 dir	real 0m48.904s user 0m2.571s sys 0m13.138s real 0m43.675s user 0m2.302s sys 0m15.944s	real 0m57.437s user 0m2.730s sys 0m21.569s real 0m55.972s user 0m2.343s sys 0m27.640s	real 0m59.810s user 0m2.484s sys 0m19.168s	real 0m57.615s user 0m2.491s sys 0m26.440s	real 3m35.545s user 0m2.751s sys 0m13.102s real 1m56.742s user 0m2.313s sys 0m16.775s	real 0m43.322s user 0m2.582s sys 0m13.094s
100GB	1MiB file, 1000 files/dir, 100 dirs	real 8m29.126s user 0m26.711s sys 2m4.953s real 7m38.323s user 0m23.194s sys 2m43.536s	real 12m10.949s user 0m28.111s sys 3m51.506s real 12m35.604s user 0m24.770s sys 4m48.326s	real 10m33.521s user 0m26.472s sys 4m3.919s	real 7m42.901s user 0m28.684s sys 5m19.282s	real 40m33.414s user 0m28.206s sys 2m16.300s real 21m33.492s user 0m24.103s sys 2m52.960s	real 7m29.459s user 0m26.882s sys 2m14.342s real 7m25.802s user 0m24.389s sys 2m54.731s
100GB	10MiB file, 1000 files/dir, 10 dirs	real 6m53.810s user 0m3.071s sys 1m37.432s	real 8m47.557s user 0m4.284s sys 2m54.904s real 9m20.885s user 0m3.594s sys 3m53.846s	real 11m16.134s user 0m3.922s sys 2m38.688s	real 7m49.848s user 0m3.989s sys 2m53.951s	real 16m19.207s user 0m4.142s sys 1m50.147s real 13m56.126s user 0m3.636s sys 2m16.774s	real 7m11.916s user 0m3.646s sys 2m16.462s
100GB	10GiB file, 10 files/dir, 1 dir	real 7m17.119s user 0m0.377s sys 1m19.629s real 7m22.659s user 0m0.330s sys 1m50.320s	real 8m36.844s user 0m0.368s sys 3m5.284s real 11m11.748s user 0m0.354s sys 4m2.122s	real 10m38.975s user 0m0.318s sys 2m27.480s	real 7m19.614s user 0m0.325s sys 2m29.694s	real 8m0.995s user 0m0.423s sys 1m54.819s real 8m44.645s user 0m0.361s sys 2m21.015s	real 7m17.117s user 0m0.403s sys 1m50.620s real 7m19.851s user 0m0.339s sys 2m20.550s

Delete timings: it was important to ensure that all buffers were flushed from the server cache and RAID cache before starting a delete: otherwise deletes can be spuriously fast. This was done by interleaving other creation steps of the 480GB data between creation and deletion for a particular step.

Data size	Data shape	Delete time ext2	Delete time ext3o	Delete time ext4o (barrier=1)	Delete time ext4o barrier=0	Delete time xfs (barrier)	Delete time xfs nobarrier
10GB	1MiB file, 1000 files/dir, 10 dir	real 0m11.577s user 0m0.001s sys 0m0.396s real 0m56.190s user 0m0.005s sys 0m0.344s	real 0m1.033s user 0m0.005s sys 0m0.891s real 1m23.686s user 0m0.013s sys 0m1.012s	real 0m0.739s user 0m0.004s sys 0m0.471s	real 0m0.868s user 0m0.004s sys 0m0.490s	real 0m29.862s user 0m0.003s sys 0m1.199s real 0m5.900s user 0m0.006s sys 0m0.441s	real 0m6.108s user 0m0.007s sys 0m1.304s real 0m1.020s user 0m0.005s sys 0m0.478s
100GB	1MiB file, 1000 files/dir, 100 dirs	real 9m27.612s user 0m0.055s sys 0m3.111s	real 17m4.312s user 0m0.074s sys 0m13.102s real 13m50.109s user 0m0.063s sys 0m9.857s	real 0m15.999s user 0m0.032s sys 0m6.055s	real 0m12.935s user 0m0.037s sys 0m4.995s	real 4m25.052s user 0m0.054s sys 0m8.903s real 1m0.754s user 0m0.049s sys 0m4.449s	real 1m5.323s user 0m0.062s sys 0m9.784s real 0m8.438s user 0m0.041s sys 0m4.698s
100GB	10MiB file, 1000 files/dir, 10 dirs	real 3m24.135s user 0m0.005s sys 0m1.423s real 2m56.238s user 0m0.010s sys 0m1.302s	real 4m35.323s user 0m0.002s sys 0m5.805s real 4m57.621s user 0m0.010s sys 0m5.528s	real 0m10.063s user 0m0.004s sys 0m3.218s	real 0m12.270s user 0m0.002s sys 0m3.548s	real 0m26.857s user 0m0.004s sys 0m1.239s real 0m6.023s user 0m0.007s sys 0m0.432s	real 0m0.896s user 0m0.007s sys 0m0.478s
100GB	10GiB file, 10 files/dir, 1 dir	real 2m21.418s user 0m0.000s sys 0m1.329s real 2m39.210s user 0m0.000s sys 0m1.357s	real 1m35.205s user 0m0.000s sys 0m5.204s real 2m22.824s user 0m0.000s sys 0m5.376s	real 0m3.667s user 0m0.000s sys 0m2.563s	real 0m3.344s user 0m0.000s sys 0m2.203s	real 0m0.660s user 0m0.001s sys 0m0.434s real 0m0.001s user 0m0.000s sys 0m0.001s	real 0m0.457s user 0m0.000s sys 0m0.436s real 0m0.001s user 0m0.000s sys 0m0.001s

Read-ahead settings

The read-ahead setting was set by the command:

         blockdev --setra $rab /dev/$dev

where $rab is the read-ahead buffer size in 512-byte sectors; this was equivalent in 2.6 kernel to doing:

         echo $rabkb > /sys/block/$dev/queue/read_ahead_kb

where $rabkb is the read-ahead buffer size in kBytes. The system default is 128 kBytes, which is generally far too small for big files.

Timings for bonnie++ benchmark

Bonnie runs were done for a variety of read-ahead buffer values and of read/write datablock sizes. These are just the results for read-ahead buffer setting of 4096 sectors (2 MiB), which was optimum for all file-systems, and a read/write datablock size of 32 kB, varying which had not much effect. In all cases, the data size was 8GB, sufficiently bigger than the RAM and raid-cache sizes of 2GB and 1GB respectively. Only results for the default (ordered) mount option are shown for ext3 and ext4. Presumably ext3 in this kernel version 2.6.27.5-117.fc10.i686 does not support barrier, as there was no real difference between the values for options barrier=0,1 so only one is shown. This bonnie++ test show that sequential read preformance does not vary much between file-systems when read-ahead buffer is large, and that write performance is improved if barriers can be safely turned-off**. It does not reveal the poor file-deletion performance of ext3 in the previous section, probably because for the create/delete bonnie step, I used the default filesize of zero. It shows the same poor xfs (barrier) file creation performance of the previous sections.

bonnie++ step being performed	Units	ext2	ext3o	ext4o (barrier)	ext4o barrier=0	xfs (barrier)	xfs nobarrier
Block Output 8GB:32kB (8GB data, 32kB blks)	kB/sec	244750	232150	167028	234804	197991	240840
Block Input 8GB:32kB, 2MiB read-ahead	kB/sec	252522	250428	255534	257082	258847	259223
Random seeks	/sec	389	391	502	514	275	277
Create Sequ,Random	/sec	29986,30054	23894,31727	28018,28885	28751,29802	1701,1688	6475,6420
Delete Sequ,Random	/sec	184274,69779	86227,14194	68956,6053	71655,13578	1752,1311	19987,7799

Bonnie CSV values:


ext2----,8G:32k,46572,98,244750,21,95767,12,79417,96,252522,17,389.2,1,
               256/256,29986,99,+++++,+++,184274,99,30054,97,430370,99,69779,99
ext3obar,8G:32k,46221,97,232856,50,89583,16,80230,97,250051,17,387.1,1,
               256/256,23898,71,444656,100,85154,85,31575,93,427814,100,14294,20
ext3onob,8G:32k,47033,99,232150,49,89235,15,78734,96,250428,16,391.2,1,
               256/256,23894,71,446827,99,86227,86,31727,93,430843,99,14194,20
ext4obar,8G:32k,49695,96,167028,20,83297,13,79158,97,255534,17,502.1,1,
               256/256,28018,92,423243,99,68956,88,28885,93,406118,100,6053,9
ext4onob,8G:32k,50596,99,234804,29,94345,14,82687,99,257082,18,514.4,1,
               256/256,28751,93,420693,99,71655,91,29802,96,406985,99,13578,22
xfs--bar,8G:32k,41380,87,197991,23,91445,23,82362,99,258847,18,275.6,0,
               256/256,1701,17,510076,99,1752,5,1688,17,415676,99,1311,4
xfs--nob,8G:32k,47390,99,240840,28,90546,22,82266,99,259223,18,277.0,1,
               256/256,6475,61,503892,100,19987,60,6420,61,414236,99,7799,26

Read/write performance over nfs to ext4 and effect of tuning

See this nfs performance page.

Oddities

- When using SL5 system, several times when under heavy load for an XFS file-system, and also once when making an ext3 file-system, the driver came up with errors of the following form, which required a reboot before the scsi was accessible again. This was never repeated for the Fedora 10 system, and so maybe is attributable to the particular SL5 version of the mpt driver:

    kernel: mptscsih: ioc1: attempting task abort! (sc=c86a4e40)
    kernel: sd 1:0:11:0:
    kernel:         command: Read(10): 28 00 00 64 00 00 00 00 08 00  
    kernel: mptscsih: ioc1: task abort: SUCCESS (sc=c86a4e40)

- On Fedora 10, when using ext2, when creating the 1MB*1000dirs*100dirs files and when copying directories using cp -a, and on one occasion when simply creating a single low level directory on an empty filesystem, the operations failed with I/O error and there were corresponding messages in /var/log/messages:

   grow_buffers: requested out-of-range block 18446744071757758592 for device sdb
   grow_buffers: requested out-of-range block 18446744071787708803 for device sdb

This may be a problem with kernel 2.6.27.5-117.fc10.i686 specifically. But also, it's not clear to me that ext2 actually supports file-systems bigger than 8TB, so that might have been the problem. Nobody would want to use ext2 in production mode on such a big file-system anyway: only for performance comparison tests such as these!

- Mounts for the additional RAID file-systems added to /etc/fstab after install did not work at boot time. The message was fsck.ext4 : Unable to resolve LABEL=xx, and the system drops into file-system repair mode. In that repair mode, the file-system is perfectly accessible, oddly. Similar messages appear for an ext3 file-system or if the device file name is used instead of a LABEL, and probably also if UUID is used. This is Fedora 10 at kernel level 2.6.27.7-134.fc10.i686. A circumvention is to put noauto in the /etc/fstab entry fourth field and then mount it in /etc/rc.d/rc.local instead: that works without problems.

** Useful Links

Write barriers: http://lwn.net/Articles/283161/
ext4 in kernel: http://mirror.cict.fr/kernel-linux/scm/linux/kernel/git/rdunlap/linux-docs/Documentation/filesystems/ext4.txt

L.S.Lowe