Anda di halaman 1dari 8

Common threads: Advanced filesystem

implementor's guide, Part 7


Introducing ext3
Level: Introductory
Daniel Robbins (drobbins@gentoo.org), President/CEO, Gentoo Technologies, Inc.
01 Nov 2001
With the 2.4 release of Linux come a host of new filesystem possibilities, including
Reiserfs, XFS, GFS, and others. These filesystems sound cool, but what exactly can
they do, what are they good at, and exactly how do you go about safely using them in a
production Linux environment? Daniel Robbins answers these questions by showing
you how to set up these new advanced filesystems under Linux 2.4. In this installment,
Daniel takes a look at ext3, a new improved version of ext2 with journaling capabilities.

In the past few installments, we've taken a bit of a detour by looking at non-traditional filesystems
such as tmpfs and devfs. Now, it's time to get back to disk-based filesystems, and we do this by
taking a look at ext3. The ext3 filesystem, designed by Dr. Stephen Tweedie, is built on the
framework of the existing ext2 filesystem; in fact, ext3 is very similar to ext2 except for one small
(but important) difference -- it supports journaling. Yet even with this small addition, I think you'll
find that that ext3 has several surprising and intriguing capabilities. In this article, I'll give you a
good understanding of how ext3 compares to the other journaling filesystems currently available. In
my next article, we'll get ext3 up and running.
Understanding Ext3
So, how does ext3 compare to ReiserFS? In previous articles, I explained how ReiserFS is well
suited to handling small files (under 4K), and in certain situations, ReiserFS' small file performance
is ten to fifteen times greater than that of ext2 and ext3. However, while ReiserFS has many
strengths, it also has weaknesses. In the current implementation of ReiserFS (version 3.6), certain
file access patterns can actually result in significantly worse performance than ext2 and ext3,
particularly when reading large mail directories. Also, ReiserFS doesn't have a good track record of
NFS compatibility and has poor sparse file performance. In contrast, ext3 is a very well-rounded
filesystem. It's a lot like ext2; it's not going to give you the blazingly fast small-file performance
that ReiserFS gives you, but it's not going to give you any unexpected performance or functionality
hiccups either.
One of the nice things about ext3 is that because it is based on the ext2 code, ext2 and ext3's on-disk
format is identical; this means that a cleanly unmounted ext3 filesystem can be remounted as an
ext2 filesystem with absolutely no problems. And that's not all. Thanks to the fact that ext2 and ext3
use identical metadata, it's possible to perform in-place ext2 to ext3 filesystem upgrades. Yes, you
read that right. By upgrading a few key system utilities, installing a modern 2.4 kernel and typing in
a single tune2fs command per filesystem, you can convert your existing ext2 servers into
journaling ext3 systems. You can even do this while your ext2 filesystems are mounted. The
transition is safe, reversible, and incredibly easy, and unlike a conversion to XFS, JFS, or ReiserFS,
you don't need to back up and recreate your filesystems from scratch. Now, for a moment, consider
the thousands of production ext2 servers in existence that are just minutes away from an ext3
upgrade; then, you'll have a good grasp of ext3's importance to the Linux community.
If I had to describe ext3 in one word, I'd call it "comfortable". It's incredibly easy to ext3-enable an
existing ext2 system, and after you do, you're not going to run into any unexpected performance
quirks. And there's yet another way that ext3 excels in the comfort department; ext3 happens to be
one of the most reliable journaled filesystems available for Linux, as I explain below.

Ext3 reliability
In addition to being ext2-compatible, ext3 inherits other benefits by sharing ext2's metadata format.
For one, ext3 users gain access to a rock-solid fsck tool. You'll recall that one of the points of using
a journaling filesystem is to avoid the need for an exhaustive fsck in the first place; however if you
do end up getting corrupt metadata, either from a flaky kernel, bad hard drive, or something else,
you'll greatly appreciate the fact that ext3 inherits ext2's fsck. In contrast, ReiserFS' fsck is in its
infancy, and fixing flaky metadata when it does show up can be a difficult and dangerous process.
Metadata-only journaling
Interestingly, ext3 handles journaling very differently than ReiserFS and other journaling
filesystems do. With ReiserFS, XFS, and JFS, the filesystem driver journals metadata, but makes
no provisions for journaling data. With metadata-only journaling, your filesystem metadata is going
to be rock solid, and you will probably never need to perform an exhaustive fsck. However,
unexpected reboots and system lock-ups can result in significant corruption of recently-modified
data. Ext3 uses a couple of innovative solutions to avoid these problems, which we'll look at in a
bit.
But first, it's important to understand exactly how metadata-only journaling could end up biting
you. As an example, let's say that you were modifying a file called /tmp/myfile.txt when the
machine unexpectedly locked up, forcing a reboot. If you were using a metadata-only journaling
filesystem such as ReiserFS, XFS or JFS, your filesystem metadata would be easily repaired, thanks
to the metadata journal, and you wouldn't need to sit through a laborious fsck.
However, there's the distinct possibility that when you load /tmp/myfile.txt into a text editor, your
file will not simply be missing recent changes, but will contain a good amount of garbage and may
even be completely unreadable. This isn't something that will always happen, but it could happen
and often does.
Here's why. Typical journaled filesystems like ReiserFS, XFS, and JFS take extra special care of
metadata, but don't pay too much attention to data. In our above example, the filesystem driver was
in the process of modifying several filesystem blocks. The filesystem driver updated the appropriate
metadata, but didn't have time to flush the data from its caches to the new blocks on disk. Thus,
when you loaded up /tmp/myfile.txt into a text editor, part or all of the file contained garbage --
blocks of data that didn't get initialized in time before the system locked up.

The ext3 approach


Now that we have a good general understanding of this problem, let's look how ext3 implements
journaling. In ext3, the journaling code uses a special API called the Journaling Block Device layer,
or JBD. The JBD has been designed for the express purpose of implementing a journal on any kind
of block device. Ext3 implements its journaling by "hooking in" to the JBD API. For example, the
ext3 filesystem code will inform the JBD of modifications it is performing, and will also request
permission from the JBD before modifying certain data on disk. By doing so, the JBD is given the
appropriate opportunities to manage the journal on behalf of the ext3 filesystem driver. It's quite a
nice arrangement, and because the JBD is being developed as a separate, generic entity, it could be
used to add journaling capabilities to other filesystems in the future.
Here are a couple of neat things about the JBD-managed ext3 journal. For one, ext3's journal is
stored in an inode -- a file, basically. Depending on how you "ext3-enable" your filesystem, you
may or may not be able to see this file, located at /.journal. Of course, by storing the journal in an
inode, ext3 is able to add the needed journal to the filesystem without requiring incompatible
extensions to the ext2 metadata. This is one of the key ways that an ext3 filesystem maintains
backwards compatibility with ext2 metadata, and in turn, the ext2 filesystem driver.

Different journaling approaches


Not surprisingly, it turns out that there are a number of ways to implement a journal. For example, a
filesystem developer could design a journal that stores spans of bytes that need to be modified on
the host filesystem. The advantage of this approach is that your journal would be able to store lots
of tiny little modifications to the filesystem in a very efficient way, since it would only record the
individual bytes that need to be modified and nothing more.
JBD takes another, and in some ways better, approach. Rather than recording spans of bytes that
must be changed, JBD stores the complete modified filesystem blocks themselves. The ext3
filesystem driver also uses this approach and stores complete replicas of the modified blocks (either
1K, 2K, or 4K) in memory to track pending IO operations. At first, this may seem a bit wasteful.
After all, complete blocks contain modified data but may also contain unmodified (already on disk)
data as well.
The approach that the JBD uses is called physical journaling, which means that the JBD uses
complete physical blocks as the underlying currency for implementing the journal. In contrast, the
approach of only storing modified spans of bytes rather than complete blocks is called logical
journaling, and is the approach used by XFS. Because ext3 uses physical journaling, an ext3 journal
will have a larger relative on-disk footprint than, say, an XFS journal. But because ext3 uses
complete blocks internally and in the journal, ext3 doesn't deal with as much complexity as it would
if it were to implement logical journaling. In addition, the use of full blocks allows ext3 to perform
some additional optimizations, such as "squishing" multiple pending IO operations within a single
block into the same in-memory data structure. This, in turn, allows ext3 to write these multiple
changes to disk in a single write operation, rather than many. In addition, because the literal block
data is stored in memory, little or no massaging of the in-memory data is required before writing it
to disk, greatly reducing CPU overhead.

Ext3, protector of data


And now, we finally get to see how the ext3 filesystem effectively provides both metadata and data
journaling, avoiding the data corruption problem I described earlier in this article. In fact, ext3
actually has two methods to ensure data and metadata integrity.
Originally, ext3 was designed to perform full data and metadata journaling. In this mode (called
"data=journal" mode), the JBD journals all changes to the filesystem, whether they are made to data
or metadata. Because both data and metadata are journaled, JBD can use the journal to bring both
metadata and data back to a consistent state. The drawback of full data journaling is that it can be
slow, although you can reduce the performance penalty by setting up a relatively large journal.
Recently, a new journaling mode has been added to ext3 that provides the benefits of full journaling
but without introducing a severe performance penalty. This new mode works by journaling
metadata only. However, the ext3 filesystem driver keeps track of the particular data blocks that
correspond with each metadata update, grouping them into a single entity called a transaction.
When a transaction is applied to the filesystem proper, the data blocks are written to disk first. Once
they are written, the metadata changes are then written to the journal. By using this technique
(called "data=ordered" mode), ext3 can provide data and metadata consistency, even though only
metadata changes are recorded in the journal. ext3 uses this mode by default.

Conclusion
These days, a lot of people are trying to determine which Linux journaling filesystem is "best". In
truth, there is no one "right" filesystem for every application; each one has its own strengths. This is
one of the benefits from having so many next-generation Linux filesystems from which to choose.
So, instead of picking an arbitrary "best" filesystem and using it for every conceivable application,
it's far preferable to understand each filesystem's strengths and weaknesses so that you can make an
educated decision as to which one to use.
Ext3 has a number of strengths. It has been designed to be extremely easy to deploy. It's based on
the solid ext2 filesystem code and it inherits a great fsck tool. And ext3's journaling capabilities
have been specially designed to ensure the integrity of both metadata and data. All in all, ext3 is a
truly great filesystem, and a worthy successor to the now-venerable ext2 filesystem. Join me in my
next article, when we get ext3 up and running. Until then, you may want to check out the following
resources.

Resources
• Read Daniel's other articles in this series, where he describes:
• the benefits of journaling and ReiserFS (Part 1)
• setting up a ReiserFS system (Part 2)
• using the tmpfs virtual memory filesystem and bind mounts (Part 3)
• the benefits of devfs, the device management filesystem (Part 4)
• beginning the conversion to devfs (Part 5)
• completing the conversion to devfs using an init wrapper (Part 6)

• Read a complete transcript of Dr. Stephen Tweedie's Ext3, Journaling Filesystem


presentation, which was featured at the Ottawa Linux Symposium in July 2000.

• Find out more about using ext3 with 2.4 kernels at Andrew Morton's ext3 for 2.4 page.
Andrew Morton is the man responsible for porting ext3 to the 2.4 kernel, and provided
invaluable assistance in writing this article. If you can't wait until my next article, Andrew
has a very nice ext3 and 2.4 usage page that will show you how to get ext3 up and running
on your system in no time.

• To keep abreast of the latest ext3 developments, be sure to visit the ext3-users mailing list
archive. Of course, you can also subscribe.

• Take Daniel Robbins' free JFS fundamentals tutorial on developerWorks.

• Browse more Linux resources on developerWorks.

Common threads: Advanced filesystem


implementor's guide, Part 8
Surprises in ext3
Level: Introductory
Daniel Robbins (drobbins@gentoo.org), President/CEO, Gentoo Technologies, Inc.
01 Dec 2001
With the 2.4 release of Linux come many new filesystem possibilities, including
Reiserfs, XFS, GFS, and others. These filesystems sound cool, but what exactly can
they do, what are they good at, and exactly how do you go about safely using them in a
production Linux environment? Daniel Robbins answers these questions by showing
you how to set up these new advanced filesystems under Linux 2.4. In this installment,
Daniel continues his look at ext3, a new improved version of ext2 with journaling
capabilities. He reveals all the inside information on ext3, and demonstrates some
shockingly good ext3 data=journal interactive performance numbers.

I'm going to be honest. For this article, I was planning to show you how to get ext3 up and running
on your system. Although that's what I said I'd do, I'm not going to do it. Andrew Morton's
excellent "Using the ext3 filesystem in 2.4 kernels" page (see Resources later in this article) already
does a great job of explaining how to ext3-enable your system, so there's no need for me to repeat
all the basics here. Instead, I'm going to delve into some meatier ext3 topics, ones that I think you'll
find very useful. After you read this article, when you're ready to get ext3 up and running, head over
to Andrew's page.
2.4 kernel update
First, let's start with a 2.4 kernel update. I last discussed 2.4 kernel stability when I was covering
ReiserFS. Way back then, finding a stable 2.4 kernel was a challenge, and I recommended sticking
with the known and at that time bleeding-edge 2.4.4-ac9 kernel -- especially for anyone planning to
use the ReiserFS filesystem in a production environment. As you might guess, a lot has happened
since 2.4.4-ac9, and it's definitely time to start looking at newer kernels.
With kernel 2.4.10, the 2.4 series reached a new level of performance and scalability (something
that we've been anticipating for a long time). So, what happened to allow Linux 2.4 to finally grow
up? In an acronym, VM. Linus, recognizing that the 2.4 series wasn't performing spectacularly,
ripped out Linux's problematic VM code and replaced it with a lean and mean VM implementation
from Andrea Archangeli. Andrea's new VM implementation (which first appeared in 2.4.10) was
really great; it really sped up the kernel and made the entire system more responsive. 2.4.10 was
definitely a major turning point in 2.4 Linux kernel development; up until then, things weren't
looking very good, and many of us were wondering why we weren't FreeBSD developers. We all
should thank Linus for his heroism in making such a major (but sorely needed) change in the 2.4
stable kernel series.
Since Andrea's new VM code needed a bit of time to be integrated seamlessly with the rest of the
kernel, use 2.4.13+. Even better, use 2.4.16+, since the rock-solid ext3 filesystem code was finally
integrated into the official Linus kernel starting with the 2.4.15-pre2 release. There's no reason to
avoid using 2.4.16+ kernel, and it'll make your job of getting ext3 up and running that much easier.
If you do use a 2.4.16+ kernel, just remember that it's no longer necessary to apply the ext3 patch as
described on Andrew's page (see Resources). Linus already added it for you :)
You'll notice that I recommend using 2.4.16+ rather than 2.4.15+, and with good reason. With the
release of kernel 2.4.15-pre9, a really ugly filesystem corruption bug was introduced to the kernel.
It took until 2.4.16-pre1 for the problem to be identified and fixed, resulting in a span of kernels
(including 2.4.15) that should be avoided at all costs. Choosing a 2.4.16+ kernel allows you to avoid
this bad batch entirely.

Laptops...beware?
Ext3 has a stellar reputation for being a rock-solid filesystem, so I was surprised to learn that quite a
few laptop users were having filesystem corruption problems when they switched to ext3. In
general, it's tempting to react to these kinds of reports by avoiding ext3 entirely; however, after
asking around, I discovered that the disk corruption problems that people were experiencing had
nothing to do with ext3 itself, but were being caused by certain laptop hard drives.
The write cache
You may not know this, but most modern hard drives have something called a "write cache", used
by the hard drive to collect pending write operations. By putting pending writes into a cache, the
hard drive firmware can then reorder and group them so that they're written to disk in the fastest
possible way. The write cache is generally considered to be a very good thing (read Linus'
explanation and opinion of write caching in Resources).
Unfortunately, certain laptop hard drives now on the market have the dubious feature of ignoring
any official ATA request to flush their write cache to disk. This isn't a wonderful design feature,
although it has been allowed by the ATA spec up until recently. With these types of drives, there's
no way for the kernel to guarantee that a particular block has actually been recorded to the disk
platters. Although this sounds like a thorny problem, this particular issue by itself is probably not
the cause of the data corruption problems that people have been experiencing.
However, it gets worse. Some modern laptop hard drives have an even nastier habit of throwing
away their write cache whenever the system is rebooted or suspended. Obviously, if a hard drive
has both of these problems, it's going to regularly corrupt data, and there's nothing that Linux can
do to prevent it from doing so.
So, what's the solution? If you have a laptop, tread carefully. Back up all your important files before
making any major change to your filesystems. If you experience data corruption problems that seem
to fit the pattern of what I described above, particularly with ext3, then remember that it may be
your laptop hard drive that's at fault. In that case, you may want to contact your laptop manufacturer
and inquire about getting a replacement drive. Hopefully, in a few months time, these flaky hard
drives will be pulled from the market and we'll never need to worry about this issue again.
Now that I've scared you out of your minds, let's take a look at ext3's various data journaling
options.

Journaling options and write latency


Ext3 allows you to choose from one of three data journaling modes at filesystem mount time:
data=writeback, data=ordered, and data=journal.
To specify a journal mode, you can add the appropriate string (data=journal, for example) to
the options section of your /etc/fstab, or specify the -o data=journal command-line option
when calling mount directly. If you'd like to specify the data journaling method used for your root
filesystem (data=ordered is the default), you can to use a special kernel boot option called
rootflags. So, if you'd like to put your root filesystem into full data journaling mode, add
rootflags=data=journal to your kernel boot options.
data=writeback mode
In data=writeback mode, ext3 doesn't do any form of data journaling at all, providing you with
similar journaling found in the XFS, JFS, and ReiserFS filesystems (metadata only). As I explained
in my previous article, this could allow recently modified files to become corrupted in the event of
an unexpected reboot. Despite this drawback, data=writeback mode should give you the best
ext3 performance under most conditions.
data=ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it logically groups metadata
and data blocks into a single unit called a transaction. When it's time to write the new metadata out
to disk, the associated data blocks are written first. data=ordered mode effectively solves the
corruption problem found in data=writeback mode and most other journaled filesystems, and
it does so without requiring full data journaling. In general, data=ordered ext3 filesystems
perform slightly slower than data=writeback filesystems, but significantly faster than their full
data journaling counterparts.
When appending data to files, data=ordered mode provides all of the integrity guarantees
offered by ext3's full data journaling mode. However, if part of a file is being overwritten and the
system crashes, it's possible that the region being written will contain a combination of original
blocks interspersed with updated blocks. This is because data=ordered provides no guarantees
as to which blocks are overwritten first, so you can't assume that just because overwritten block x
was updated, that overwritten block x-1 was updated as well. Instead, data=ordered leaves the
write ordering up to the hard drive's write cache. In general, this limitation doesn't end up
negatively impacting people very often, since file appends are generally much more common than
file overwrites. For this reason, data=ordered mode is a good higher-performance replacement
for full data journaling.
data=journal mode
data=journal mode provides full data and metadata journaling. All new data is written to the
journal first, and then to its final location. In the event of a crash, the journal can be replayed,
bringing both data and metadata into a consistent state.
Theoretically, data=journal mode is the slowest journaling mode of all, since data gets written
to disk twice rather than once. However, it turns out that in certain situations, data=journal
mode can be blazingly fast. Andrew Morton, after hearing reports on LKML that ext3
data=journal filesystems were giving people unbelievably great interactive filesystem
performance, decided to put together a little test. First, he created simple shell script designed to
write data to a test filesystem as quickly as possible:

Rapid writing

while true
do
dd if=/dev/zero of=largefile bs=16384 count=131072
done

While data was being written to the test filesystem, he attempted to read 16Mb of data from another
ext2 filesystem on the same disk, timing the results:

Reading a 16Mb file

time cat 16-meg-file > /dev/null

The results were astounding. data=journal mode allowed the 16-meg-file to be read from 9 to
over 13 times faster than other ext3 modes, ReiserFS, and even ext2 (which has no journaling
overhead):
Written-to-filesystem 16-meg-read-time (seconds)
ext2 78
ReiserFS 67
ext3 data=ordered 93
ext3 data=writeback 74
ext3 data=journal 7
Andrew repeated this test, but tried to read a 16Mb file from the test filesystem (rather than a
different filesystem), and he got identical results. So, what does this mean? Somehow, ext3's
data=journal mode is incredibly well-suited to situations where data needs to be read from and
written to disk at the same time. Therefore, ext3's data=journal mode, which was assumed to
be the slowest of all ext3 modes in nearly all conditions, actually turns out to have a major
performance advantage in busy environments where interactive IO performance needs to be
maximized. Maybe data=journal mode isn't so sluggish after all!
Andrew is still trying to figure out exactly why data=journal mode is doing so much better
than everything else. When he does, he may be able to add the necessary tweaks to the rest of ext3
so that data=writeback and data=ordered modes see some benefit as well.

data=journal tweaks
Some people have had a particular performance problem when using ext3's data=journal mode on
busy servers -- busy NFS servers, in particular. Every thirty seconds, the server experiences a huge
storm of disk-writing activity, causing the system to nearly grind to a halt. If you experience this
problem, it's easy to fix. Simply type the following command as root to tweak Linux's dirty buffer-
flushing algorithm:

Tweaking bdflush

echo 40 0 0 0 60 300 60 0 0 > /proc/sys/vm/bdflush

These new bdflush settings will cause kupdate to run every 0.6 seconds rather than every 5 seconds.
In addition, they tell the kernel to flush a dirty buffer after 3 seconds rather than 30, the default. By
flushing recently-modified data to disk more regularly, these write storms can be avoided. It's
slightly less efficient to do things this way, since the kernel will have fewer opportunities to
combine writes. But for a busy server, writes will happen more consistently, and interactive
performance will be greatly improved.

Conclusion
We've now concluded our coverage of ext3. Join me in my next article as we explore the many
wonders of...XFS!

Anda mungkin juga menyukai