linux-ext4 - Re: howto downgrade ext4 to ext3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AB53F3E.4040007@it-sudparis.eu>
Date:	Sat, 19 Sep 2009 22:29:50 +0200
From:	jehan procaccia <jehan.procaccia@...sudparis.eu>
To:	Theodore Tso <tytso@....edu>
CC:	linux-ext4@...r.kernel.org, Eric Sandeen <sandeen@...hat.com>
Subject: Re: howto downgrade ext4 to ext3

Theodore Tso a écrit :
> On Fri, Sep 18, 2009 at 11:21:08PM +0200, jehan procaccia wrote:
>   
>> I would love to test that option (-o nodelalloc) instead of move back to  
>> ext3.
>> however I don't understand what it is ... Am I taking risk in term of  
>> integrity of data if I set it ?, or just losing performances ?
>> anyway, I'am not sure it is available, when I search it in "man mount",  
>> I can't find it, is it an undocumennted option ?
>>     
>
> The mount man page is part of the util-linux package, and so it tends
> to get updated a bit slower than the kernel.  The ext4 mount options
> are fully documented in the kernel documentation; so if you install
> the kernel-doc RPM, and look in the Documentation/filesystems/ext4.txt
> you'll get a comprehensive list of ext4 mount options.  (Well, as
> comprehensive as we can make it; occasionally we forget to update it,
> but in general we've been pretty good at documenting everything.)
>
> (Checking....)
>
> Ugh, the description for nodelalloc in ext4.txt is pretty horrible;
> it doesn't even parse as a valid English sentence.  I don't know how
> that slipped by me (Mingming, Eric; can either of you see if your
> respective companies can snag us a tech writer resource for a day or
> two?), but I'll get that one fixed up.
>   
indeed, there's not a lot, and not very well understandable :
$ less /usr/share/doc/kernel-doc-2.6.18/Documentation/filesystems/ext4.txt
delalloc        (*)     Deferring block allocation until write-out time.
nodelalloc              Disable delayed allocation. Blocks are allocation
                        when data is copied from user to page cache.
> Anyway, delayed allocation is a feature of ext4 which allow us to
> delay allocating blocks until the very last minute --- when the VM
> page writeback routine decides it times to write dirty pages to disk
> (aka "cleaning pages", or "when the page cleaner runs" --- yeah, OS
> programmers sometimes like to perpetuate some really horrible puns),
> or when a program explicitly forces a file to be written to disk via
> the fsync() system call.  This allows the block allocator to make more
> intelligent decisions, which tends to avoid disk fragmentation and
> tends to increase performance.  Delayed allocation is one of the
> reasons why simply mounting an uncoverted ext2 or ext3 filesystem
> using the ext4 file system driver can result in better performance.
>
>   
OK, understood ...
> The problem is that in older kernel programs, we didn't properly
> account for quota.  Since we don't attempt to allocate files until
> when the page cleaner runs, which could potentially be well after the
> program which wrote the program has exited, the out-of-quota error
> only gets noticed when the delayed allocation writepages function is
> trying to clean up dirty pages.  This is a "should never happen
> situation", and to avoid causing the VM to loop forever to write pages
> where the write operation would never succeed, the writepages program
> prints an extremely scary message and --- and then throws away the
> user's data.
>   
That chapter becomes a bit obscure to me ... If I well understood, you 
described the situation I ran into ?
> By using the nodelalloc mount option, ext4 will try to allocate blocks
> while processing each and every write(2) system call.  This allows
> quota to be checked right away, and if the user is over quota, the
> write system call will return an error right away.  This is less
> efficient in terms of CPU usage, and the block allocater will not be
> able to do as good of a job, since it doesn't know how big the file
> will ultimately be when it is doing block-by-block allocation.
> However, it avoids the nasty bug that happens when the user has a
> over-quota situation in the delalloc writepage function --- and it's
> no worse than what ext3 does.
>   
Ok, that's where I should go now by mounting with nodelalloc, lower 
performances, but no more "should never happen situation" ;-) .
> In more modern kernels, we've added quota checking in the write(2)
> system call such that if we're not allocating the blocks right away,
> so we don't know where the block will be located on disk, we charge
> the block against user's quota right away, so the write(2) system call
> can signal the over quota situation to the user program.
> Unfortunately, these patches aren't present in the version of ext4
> that was backported to RHEL 5.4.
>
>   
 From which kernel version you " 've added quota checking in the 
write(2) system call" ?,
the problem should not arise anymore with recent kernel, and still using 
delalloc ? 2.6.30 should be OK ?
for RHEL, there's fedora project that has more recent kernel packages in 
source RPMS:
kernel-2.6.29.4-167.fc11.src.rpm or kernel-2.6.31-33.fc12.src.rpm
probably recompiling these for rhel 5.4 could be a workaround instead of 
using nodelalloc ?
>> but now, how can I check that there's no more pb on that specific  
>> partition( /disk00)?
>> when kernel complains this way for example:
>> Sep 16 18:06:45 gizeh kernel: mpage_da_map_blocks block allocation  
>> failed for inode 39419 at logical offset 0 with max blocks 2 with error  
>> -122
>> Sep 16 18:06:45 gizeh kernel: This should not happen.!! Data will be lost
>> I've no indication from which partition that inode is. there's so many  
>> error message like this that is won't be easy to tell that none  comes  
>> from /disk00 .
>>     
>
> Well, error code 122 is EDQUOT, or "Quota exceeded".  So it's very
> likely that this some other partition.  This is a bug; we really
> should print the disk that was involved, and not just inode number.
> I'll fix that in future kernels (but of course that won't help you for
> RHEL 5.4).  What you can do to prove this is to check a quota report,
> and see which users are over quota.  You can then check all of your
> ext4 partitions to see which has an inode 39419 which is owned by one
> of your over-quota users, using debugfs:
>
>    debugfs -c -R "stat <39419>" /dev/sdXXX
>
>   
good, indeed, I only get -122 errors, and thanks to the search example I 
noticed that those error do happened only for apparently over-quota 
users, here's an example:

gizeh kernel: mpage_da_map_blocks block allocation failed for inode 
3542694 at logical offset 0 with max blocks 1 with error -122
Message from syslogd@ at Sat Sep 19 21:08:03 2009 ...

[root@...eh ~]
$ debuge4fs -c -R "stat <3542694>" /dev/mapper/VolGroup02S2IA-LVVG02Users07
debuge4fs 1.41.5 (23-Apr-2009)
/dev/mapper/VolGroup02S2IA-LVVG02Users07: catastrophic mode - not 
reading inode or group bitmaps
Inode: 3542694   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 2336084861    Version: 0x00000000:00000001
User: 42658   Group:   426   Size: 0
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4ab52c13:81a9f0d4 -- Sat Sep 19 21:08:03 2009
 atime: 0x4ab52c13:816ce76c -- Sat Sep 19 21:08:03 2009
 mtime: 0x4ab52c13:816ce76c -- Sat Sep 19 21:08:03 2009
crtime: 0x4ab52c13:812fde04 -- Sat Sep 19 21:08:03 2009
Size of extra inode fields: 28
BLOCKS:

[root@...eh ~]
$ getent passwd |grep 42658
karipha:x:42658:426:Karipha BOUMER:/mci/mast2008/karipha:/usr/local/bin/bash
[root@...eh ~]
$ quota -s karipha
Disk quotas for user karipha (uid 42658):
     Filesystem  blocks   quota   limit   grace   files   quota   
limit   grace
/dev/mapper/VolGroup02S2IA-LVVG02Users07
                   603M*   489M    538M   39:07    6622   50000   55000

$ find /disk07 -inum 3542694
/disk07/mast2008/karipha/.recently-used.xbel

Other inodes incriminated showed the same result -> over-quota . So if 
user data finally cannot be written, after all ... quota wouldn't allow 
it anyway .

> Hope this helps you understand what's going on.
> 							- Ted
>   
Yes, thanks for that detailed answer.
regards , jehan .
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html