“Linux has a wealth of filesystems to choose from, but we are facing a number of challenges with scaling to the large storage subsystems that are becoming common in today’s data centers. Filesystems need to scale in their ability to address and manage large storage, and also in their ability to detect, repair and correct errors in the data stored on disk.”
— Linux kernel documentation,
Let’s see where we are and where we’re going with Linux file systems!
UNIX-family file systems share a common heritage and common features. I-nodes, blocks, block groups, the superblock, etc. UFS or the UNIX File System had been updated to Berkeley’s FFS or the Fast File System by the early 1990s when Linux appeared. FFS or a similar derivative was used by BSD, SunOS, Solaris, and the other major implementations.
Linux started with an FFS derivative suitable for small systems. The first Linux file system in common use for a significant time was Ext2 or the Second Extended File System. Yes, the original Linux file system had been extended, and Ext2 was a further extension. This was the early 1990s and Linux was seen at the time by its creator as a hobbyist system, so supporting disks larger than 512 MB was an extension.
Ext2 was the predominant Linux file system until the early 2000s. Then Ext3 took over. Ext3 extended Ext2 by adding a journal.
I know your data is valuable and you’re worried about losing it, but it usually isn’t the data itself that goes missing. The most common loss of information happens when the file system loses track of the file itself. The data is still in the data blocks and the i-node points to the data blocks and describes almost all of the metadata. But that’s almost all of the metadata, it doesn’t specify the file name and the directory where it is located.
Updates to a file system to change file and directory contents require many separate write operations per change. A power failure or system crash in the middle of this sequence will leave the file system data structure in an invalid intermediate state. The i-node and data blocks may remain valid but unreferenced. It might linger endlessly, but maybe the system got far enough to mark the i-node and data blocks as free and available to be overwritten.
The journal is a special area where the kernel maintains a “to-do list” of the needed change steps. The needed file system manipulations are listed in the journal as a sequence of steps that can be safely repeated during subsequent recovery. The formal term is that file system changes are now atomic in the classic Greek scientific sense — they are not divisible. Either the needed change did not get into the journal, or else the entire change was there and whatever remains at this recovery can be safely done. It might be needlessly repeating steps that were done at an earlier recovery (when the power dropped out again!), but the remaining sequence is safe to repeat.
By default, Ext3 uses the journal for the metadata only. There is still a chance of losing or corrupting some file content when power fails, and so you can mount Ext3 (and Ext4) with the
data=journal option to also journal the contents. However, that forces all data to be written twice, first to the journal and then to the file system, and performance will suffer. Carefully consider this risk and its tradeoffs. Most organizations are best served by journaling metadata only while maintaining a good backup and recovery system.
The next generation, Ext4, continued the evolution. It brings some significant advances in scalability, performance, and reliability.
As for scaling, an Ext4 file system can be up to 1 exbibyte (260 bytes, 1,024 PiB or 1,048,576 TiB) in size with up to 4 billion files each up to 16 TiB in size.
For performance, the old data block mapping scheme used through Ext3 has been replaced with extents, reducing fragmentation and improving I/O performance with large files. Large databases and some applications of media streaming and high-performance computing can benefit from pre-allocation supported in Ext4.
Reliability is improved by including checksums in the journal, so a partially written journal entry can be detected and ignored at the next recovery. This has a beneficial side effect of slightly improving performance. The
e2fsck file system consistency check can also run much faster on Ext4 because unallocated groups of data blocks and i-nodes are labeled as such and can be skipped.
In most operating systems you don’t choose the file system type. Well, there has been Windows and the “VFAT versus NTFS” choice, which basically comes down to backward compatibility versus scale plus security. But in Linux there are choices between entire families. This has been just about the Ext2—Ext3—Ext4 evolution. Ext4 is quite capable, but there are some more choices to consider, come back next time for an alternative designed from the beginning for fast I/O on very large file systems.
Your selection of file system sets some limits on file system size and performance, but there are choices you can make about how your kernel drives the hardware and handles the file system. Check out Learning Tree’s Linux optimization and troubleshooting course for suggestions on how to tune performance while maintaining reliability.
Keep in touch with Learning Tree’s new courses for an, upcoming 1-day course on Red Hat Enterprise Linux Migration coming in the second half of 2015. It will cover RHEL 5-6-7 and how to migrate to new versions.