linux-ext4 - Re: [PATCH][RFC] resize2fs and uninit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AB28DD5.30607@redhat.com>
Date:	Thu, 17 Sep 2009 15:28:21 -0400
From:	Ric Wheeler <rwheeler@...hat.com>
To:	Will Drewry <redpig@...aspill.org>
CC:	Andreas Dilger <adilger@....com>, linux-ext4@...r.kernel.org
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On 09/16/2009 07:11 PM, Will Drewry wrote:
> On Wed, Sep 16, 2009 at 03:22:50PM -0600, Andreas Dilger wrote:
>    
>> On Sep 16, 2009  15:42 -0500, Will Drewry wrote:
>>      
>>> I'm interested in it for a few reasons:
>>> 1. it undermines the use of uninit_bg on large filesystems as
>>>     e2fsck -f will go back to normal speed, even without those block
>>>     groups being 'used'.  In my local example, it goes from 14s to 2m24s.
>>>        
>> Ah, my bad...  It definitely makes sense to mark new groups added
>> during online resize as {BLOCK,INODE}_UNINIT if that feature is
>> enabled for the filesystem.  The e2fsck slowdown after a resize is
>> only a one-time event (that e2fsck would mark the unused groups as
>> UNINIT again) but it makes sense to do it correctly the first time.
>>      
> Cool - didn't realize e2fsck would swap them back.  That only makes
> it seem like an even heavier burden if I know the backing store is
> zero-filled! :)
>
>    
>>> 2. it will spread the I/O cost out over time.  Online resizing often
>>>     means that you don't want to/can't unmount the fs.  However, a
>>>     large filesystem increase might result in gigabytes of 0s being
>>>     written to the backing store which could result in I/O throttling
>>>     that makes doing it online less useful.  It'd be nice to be able to
>>>     optionally amortize that cost as is done if the fs had been mke2fs -O
>>>     lazy_itable_init=1 at full size initially.
>>>        
>> While this is true, there is a non-zero risk of problems if the inode
>> table isn't zeroed, which is why lazy_itable_init isn't the default.
>> The risk is that if the group descriptor is invalid for some reason
>> (found by bad checksum, or some inode in use beyond itable_unused)
>> then the UNINIT and itable_unused fields will be ignored and a full
>> inode table scan for that group is done.
>>
>> If the itable isn't zeroed, then any old inodes (from a previous
>> filesystem, or garbage) will be "reattached" to the filesystem in
>> lost+found, and may cause a LOT of duplicate blocks processing (slow!).
>>      
> That makes things a lot clearer - thanks! I wasn't sure what the default
> action was, but it makes sense to assume that corruption would lead
> you to crawl the inode table regardless.  In which case, your best bet
> is to zero-fill it to minimize the weirdness.
>    

One note - the WRITE_SAME command in SCSI has long been used by array 
vendors to do relatively high performance zero fills.

It will actually write the disk (and that can be slow), but it won't do 
multiple transfers of the data block of zeroes from server to storage.

Note sure that is a useful point, but might be nice to take advantage of :-)

ric


>> If you had the time to work on the solution, it would be very useful,
>> and we could make lazy_itable_init the default.  What needs to be done
>> is have a thread that is created at filesystem mount that walks all the
>> groups and validates the GDT checksum, and zeroes inode tables and
>> bitmaps that are marked UNINIT w/o ZEROED.  For bonus points it could
>> check bitmap validity (later that might validate a bitmap checksum),
>> compute buddy bitmaps for groups that have free space, etc.
>>
>> The thread would exit once all of the groups have had the inode tables
>> zeroed, or if the filesystem is unmounted.  In the common case (i.e.
>> once all inode tables are zeroed), it would just walk the already-loaded
>> group descriptor table looking for the ZEROED flag and no IO is done,
>> assuming we don't check the on-disk bitmaps on each mount (that could
>> be done only periodically, with a timestamp in the superblock).
>>      
> I'd love to have this functionality so it's definitely going on my TODO
> list, but probably not for a while yet.  This is a great description of
> the needed code which will make it that much easier.
>
>    
>>> Would it seriously raise the risk of corruption if uninit_bg is already
>>> in use? Alternately, would a patch to that effect stand a chance of ever
>>> making it upstream?
>>>        
>> If the filesystem is already formatted with lazy_itable_init, then
>> doing further resizing w/o inode table zeroing is fine.
>>      
> Cool -- I'll start in on a patch to setup to add that support as a
> precursor to having a mount triggered itable zero'ing thread.  At least,
> then test filesystems and known zero-filled ones will benefit (as you
> pointed out!).
>
>    
>>      
>>> I've attached a version with it being flagged by a "-l" for lazy.
>>>        
>> It might make sense to avoid requiring the user to specify this,
>> rather remembering the option supplied at mke2fs time?  There is
>> the COMPAT_LAZY_BG superblock flag that might be usable for this,
>> though Ted might have some comments about any potential compatibility
>> issues.
>>      
> Cool - yeah I'd love to make use of the COMPAT_LAZY_BG flag since it
> seems that all (but e2p/features.c) references to it seem to be gone
> from the e2fsprogs source and the kernel.  I'm happy to rewrite it to do
> so and update mke2fs to set LAZY_BG when lazy_itable_init=1 is set.
>
>    
>> Other than that, the patch looks reasonable at first glance.
>>      
> Thanks!  If Ted has any feedback on the use of COMPAT_LAZY_BG, I'll
> rewrite it using that (or not).  Using COMPAT_LAZY_BG would also be nice
> because it would make it easier to decide when it's okay to online resize
> without initializing itables too (and would fit its initial purpose
> of being useful for sparse files)!
>
> cheers -
> will
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html