[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AB53F3E.4040007@it-sudparis.eu>
Date: Sat, 19 Sep 2009 22:29:50 +0200
From: jehan procaccia <jehan.procaccia@...sudparis.eu>
To: Theodore Tso <tytso@....edu>
CC: linux-ext4@...r.kernel.org, Eric Sandeen <sandeen@...hat.com>
Subject: Re: howto downgrade ext4 to ext3
Theodore Tso a écrit :
> On Fri, Sep 18, 2009 at 11:21:08PM +0200, jehan procaccia wrote:
>
>> I would love to test that option (-o nodelalloc) instead of move back to
>> ext3.
>> however I don't understand what it is ... Am I taking risk in term of
>> integrity of data if I set it ?, or just losing performances ?
>> anyway, I'am not sure it is available, when I search it in "man mount",
>> I can't find it, is it an undocumennted option ?
>>
>
> The mount man page is part of the util-linux package, and so it tends
> to get updated a bit slower than the kernel. The ext4 mount options
> are fully documented in the kernel documentation; so if you install
> the kernel-doc RPM, and look in the Documentation/filesystems/ext4.txt
> you'll get a comprehensive list of ext4 mount options. (Well, as
> comprehensive as we can make it; occasionally we forget to update it,
> but in general we've been pretty good at documenting everything.)
>
> (Checking....)
>
> Ugh, the description for nodelalloc in ext4.txt is pretty horrible;
> it doesn't even parse as a valid English sentence. I don't know how
> that slipped by me (Mingming, Eric; can either of you see if your
> respective companies can snag us a tech writer resource for a day or
> two?), but I'll get that one fixed up.
>
indeed, there's not a lot, and not very well understandable :
$ less /usr/share/doc/kernel-doc-2.6.18/Documentation/filesystems/ext4.txt
delalloc (*) Deferring block allocation until write-out time.
nodelalloc Disable delayed allocation. Blocks are allocation
when data is copied from user to page cache.
> Anyway, delayed allocation is a feature of ext4 which allow us to
> delay allocating blocks until the very last minute --- when the VM
> page writeback routine decides it times to write dirty pages to disk
> (aka "cleaning pages", or "when the page cleaner runs" --- yeah, OS
> programmers sometimes like to perpetuate some really horrible puns),
> or when a program explicitly forces a file to be written to disk via
> the fsync() system call. This allows the block allocator to make more
> intelligent decisions, which tends to avoid disk fragmentation and
> tends to increase performance. Delayed allocation is one of the
> reasons why simply mounting an uncoverted ext2 or ext3 filesystem
> using the ext4 file system driver can result in better performance.
>
>
OK, understood ...
> The problem is that in older kernel programs, we didn't properly
> account for quota. Since we don't attempt to allocate files until
> when the page cleaner runs, which could potentially be well after the
> program which wrote the program has exited, the out-of-quota error
> only gets noticed when the delayed allocation writepages function is
> trying to clean up dirty pages. This is a "should never happen
> situation", and to avoid causing the VM to loop forever to write pages
> where the write operation would never succeed, the writepages program
> prints an extremely scary message and --- and then throws away the
> user's data.
>
That chapter becomes a bit obscure to me ... If I well understood, you
described the situation I ran into ?
> By using the nodelalloc mount option, ext4 will try to allocate blocks
> while processing each and every write(2) system call. This allows
> quota to be checked right away, and if the user is over quota, the
> write system call will return an error right away. This is less
> efficient in terms of CPU usage, and the block allocater will not be
> able to do as good of a job, since it doesn't know how big the file
> will ultimately be when it is doing block-by-block allocation.
> However, it avoids the nasty bug that happens when the user has a
> over-quota situation in the delalloc writepage function --- and it's
> no worse than what ext3 does.
>
Ok, that's where I should go now by mounting with nodelalloc, lower
performances, but no more "should never happen situation" ;-) .
> In more modern kernels, we've added quota checking in the write(2)
> system call such that if we're not allocating the blocks right away,
> so we don't know where the block will be located on disk, we charge
> the block against user's quota right away, so the write(2) system call
> can signal the over quota situation to the user program.
> Unfortunately, these patches aren't present in the version of ext4
> that was backported to RHEL 5.4.
>
>
From which kernel version you " 've added quota checking in the
write(2) system call" ?,
the problem should not arise anymore with recent kernel, and still using
delalloc ? 2.6.30 should be OK ?
for RHEL, there's fedora project that has more recent kernel packages in
source RPMS:
kernel-2.6.29.4-167.fc11.src.rpm or kernel-2.6.31-33.fc12.src.rpm
probably recompiling these for rhel 5.4 could be a workaround instead of
using nodelalloc ?
>> but now, how can I check that there's no more pb on that specific
>> partition( /disk00)?
>> when kernel complains this way for example:
>> Sep 16 18:06:45 gizeh kernel: mpage_da_map_blocks block allocation
>> failed for inode 39419 at logical offset 0 with max blocks 2 with error
>> -122
>> Sep 16 18:06:45 gizeh kernel: This should not happen.!! Data will be lost
>> I've no indication from which partition that inode is. there's so many
>> error message like this that is won't be easy to tell that none comes
>> from /disk00 .
>>
>
> Well, error code 122 is EDQUOT, or "Quota exceeded". So it's very
> likely that this some other partition. This is a bug; we really
> should print the disk that was involved, and not just inode number.
> I'll fix that in future kernels (but of course that won't help you for
> RHEL 5.4). What you can do to prove this is to check a quota report,
> and see which users are over quota. You can then check all of your
> ext4 partitions to see which has an inode 39419 which is owned by one
> of your over-quota users, using debugfs:
>
> debugfs -c -R "stat <39419>" /dev/sdXXX
>
>
good, indeed, I only get -122 errors, and thanks to the search example I
noticed that those error do happened only for apparently over-quota
users, here's an example:
gizeh kernel: mpage_da_map_blocks block allocation failed for inode
3542694 at logical offset 0 with max blocks 1 with error -122
Message from syslogd@ at Sat Sep 19 21:08:03 2009 ...
[root@...eh ~]
$ debuge4fs -c -R "stat <3542694>" /dev/mapper/VolGroup02S2IA-LVVG02Users07
debuge4fs 1.41.5 (23-Apr-2009)
/dev/mapper/VolGroup02S2IA-LVVG02Users07: catastrophic mode - not
reading inode or group bitmaps
Inode: 3542694 Type: regular Mode: 0644 Flags: 0x80000
Generation: 2336084861 Version: 0x00000000:00000001
User: 42658 Group: 426 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4ab52c13:81a9f0d4 -- Sat Sep 19 21:08:03 2009
atime: 0x4ab52c13:816ce76c -- Sat Sep 19 21:08:03 2009
mtime: 0x4ab52c13:816ce76c -- Sat Sep 19 21:08:03 2009
crtime: 0x4ab52c13:812fde04 -- Sat Sep 19 21:08:03 2009
Size of extra inode fields: 28
BLOCKS:
[root@...eh ~]
$ getent passwd |grep 42658
karipha:x:42658:426:Karipha BOUMER:/mci/mast2008/karipha:/usr/local/bin/bash
[root@...eh ~]
$ quota -s karipha
Disk quotas for user karipha (uid 42658):
Filesystem blocks quota limit grace files quota
limit grace
/dev/mapper/VolGroup02S2IA-LVVG02Users07
603M* 489M 538M 39:07 6622 50000 55000
$ find /disk07 -inum 3542694
/disk07/mast2008/karipha/.recently-used.xbel
Other inodes incriminated showed the same result -> over-quota . So if
user data finally cannot be written, after all ... quota wouldn't allow
it anyway .
> Hope this helps you understand what's going on.
> - Ted
>
Yes, thanks for that detailed answer.
regards , jehan .
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists