File-systems, the LSI 22320-R card, and the so-called 2 TB limit

Author: L.S.Lowe. File: raid2tblimit. This update: 20070422. Part of Guide to the Local System.

This is a summary of what I needed to do to get an LSI22320RB-F card, also known as LSI22320-R, working on a large file-system of over 2 Terabytes.

I have a LSI22320RB-F card (as it says on the box) fitted in a Dell PowerEdge 1950, and attached to a transtec RAID (rebadged Infortrend EonStor). The operating system is Linux in one of the Red Hat distros (details below).

The SCSI driver is mpt (mptbase, mptscsi, etc), as identified automatically by the installation process.

The RAID has around 15 TB of space altogether: 24 disks of 750 GB each, in a single RAID-6 configuration. I wanted this in two logical drives of around 7.5 TB each.

Hardware and driver limits

When I configured the RAID to have logical drives bigger than 2 TiB, then I got I/O errors putting a GPT partition table on it using parted:
SCSI error : <0 0 10 1> return code = 0xb0000
end_request: I/O error, dev sdb, sector 14646149112
I was using parted and a GPT partition table because the DOS-style partition tables used by fdisk don't support LBA offsets or sizes greater than 232 512-byte sectors: in other words 2 TiB. But even parted was giving me problems.

Also, there were other problems with other utilities: for example, with dd:

         u=$((256*256*256*256))
         dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=$u
         dd: writing `/dev/sdb': Input/output error
whereas the following was OK:
         dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=$((u-1))
So it was breaking at the 2TiB border.

I checked with LSI technical support over the web, and they gave me some advice: change the LSI card firmware from the IME-version (which has mirroring support) to the IT-version (which has no mirroring support - I didn't need mirroring). This turned out to be ineffective: no change in symptoms. I reverted to the original IME-version firmware (which I had saved).

Other advice on the web was to use the 64-bit version of my Linux system. So after my initial system of Scientific Linux 4 (RHEL4 clone) 32-bit, I tried Fedora 7test2 64-bit. This didn't fix the problem. I finally settled on CentOS 5 (RHEL5 clone) 32-bit.

It turned out that to fix the problem was simpler than any of those suggestions. The trick was very simple: ensure that the data rate of the SCSI was 320 MBytes/second: that's 160 MHz, with the 16-bit wide path, as configured in the Infortrend RAID configuration tui. The RAID had been supplied with a default rate of 80 MHz. The change to 160 MHz was sufficient for some part of my setup (probably the linux MPT driver, but possibly the LSI firmware, or RAID firmware, I don't know) to operate in a mode which supports very large scsi block addressing. Switching to and fro between the two rates (with a reboot) confirmed this behaviour. Thanks to the support team at Transtec for suggesting this idea.

File systems and limits

The RAID logical drives were then formatted using mkfs -t ext3. In the end I used the whole of each logical drive (eg. /dev/sdb) rather than putting a parted GPT partition table on it and using (say) /dev/sdb1, because I just wanted one file-system per logical drive anyway.

For performance tests, see this performance page.

According to the old Red Hat Enterprise Linux (RHEL) limits page as of 12 March 2007, ext3 can handle 8TB file sizes and 8TB file-system sizes in RHEL4. This limit is no doubt due to 32-bit signed integers and 4kB block-sizes. Later ext3 versions use unsigned 32-bit integers so can handle 16TB file-system sizes with the same 4kB block-size: for example, the utilities in e2fsprogs used unsigned integers as of version 1.39, July 2006, which is in distros Fedora 6+ and RHEL5, and so (quote) can support filesystems between 2**31 and 2**32 blocks. According to the RHEL comparison chart, as of 20 April 2007, ext3 in RHEL5 supports file sizes of 2TB (sic!), and file-system sizes of 8TB (certified) and 16TB (theoretical). Hopefully the theoretical will change to certified shortly.

(There is a lot of misleading or out-of-date information on the web on this topic - see top of page for the date of this document).

L.S.Lowe