linux-ext4 - Re: [PATCH] ext4: Do not normalize request from fallocate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1303251051460.23176@localhost>
Date:	Mon, 25 Mar 2013 11:09:35 +0100 (CET)
From:	Lukáš Czerner <lczerner@...hat.com>
To:	"Theodore Ts'o" <tytso@....edu>
cc:	Lukas Czerner <lczerner@...hat.com>, linux-ext4@...r.kernel.org,
	gharm@...gle.com
Subject: Re: [PATCH] ext4: Do not normalize request from fallocate

On Sat, 23 Mar 2013, Theodore Ts'o wrote:

> Date: Sat, 23 Mar 2013 20:11:43 -0400
> From: Theodore Ts'o <tytso@....edu>
> To: Lukas Czerner <lczerner@...hat.com>
> Cc: linux-ext4@...r.kernel.org, gharm@...gle.com
> Subject: Re: [PATCH] ext4: Do not normalize request from fallocate
> 
> On Thu, Mar 21, 2013 at 04:50:45PM +0100, Lukas Czerner wrote:
> > 
> > Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
> > large fallocate requests were not physically contiguous. However it is
> > important to see why that is the case. Because the request is so big the
> > allocator will try to find free group to allocate from skipping block
> > groups which are used, which is fine. However it will only allocate
> > extents of 2^15-1 block (limitation of uninitialized extent size)
> > which will leave one block in each block group free which will make the
> > extent tree physically non-contiguous, however _only_ by one block which
> > is perfectly fine.
> 
> Well, it's actually really unfortunate.  The file ends up being more
> fragmented, and from an alignment point of view it's really horrid.
> For a RAID array with a power of 2 stripe size, or a flash device with
> a power of 2 erase block size, the result is actually quite
> spectacularly bad:

Sorry for being dense, but I am trying to understand why this is so
bad and what is the "expected" column there.

The physical offset of each extent bellow starts on the start of the
block group and it seems to me that it's perfectly aligned for every
power of two up to the block group size.

If the extent would start at the physical offset from the "expected"
column, than it would be misaligned.

Maybe I am missing something, or maybe I misunderstood the concept ?
But the only problem I see is that when we would like to use that
remaining one block, but that's expected and the only way to avoid
that is to allocate smaller extents instead as you suggested below
(16384 blocks).

Thanks!
-Lukas

> 
> File size of 1 is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   32766:     458752..    491518:  32767:             unwritten
>    1:    32767..   65533:     491520..    524286:  32767:     491519: unwritten
>    2:    65534..   98300:     589824..    622590:  32767:     524287: unwritten
>    3:    98301..  131067:     622592..    655358:  32767:     622591: unwritten
>    4:   131068..  163834:     655360..    688126:  32767:     655359: unwritten
>    5:   163835..  196601:     688128..    720894:  32767:     688127: unwritten
>    6:   196602..  229368:     720896..    753662:  32767:     720895: unwritten
>    7:   229369..  262135:     753664..    786430:  32767:     753663: unwritten
>    8:   262136..  262143:     786432..    786439:      8:     786431: unwritten,eof
> 1: 9 extents found
> 
> That being said, what we were doing before was quite bad, and you're
> quite right about your analysis here:
> 
> > This will never happen when we normalize the request because for some
> > reason (maybe bug) it will be normalized to much smaller request (2048
> > blocks) and those extents will then be merged together not leaving any
> > free block in between - hence physically contiguous. However the fact
> > that we're splitting huge requests into ton of smaller ones and then
> > merging extents together is very _very_ bad for fallocate performance.
> > 
> > The situation is even worst since with commit
> > ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
> > uninitialized extents so we end up with absolutely _huge_ extent tree
> > for bigger fallocate requests which is also bad for performance but not
> > only when fallocate itself, but even when working with the file
> > later on.
> 
> Without this patch, we currently do this for the same 1g file:
> 
> Filesystem type is: ef53
> File size of 2 is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    2047:     305152..    307199:   2048:             unwritten
>    1:     2048..    4095:     307200..    309247:   2048:             unwritten
>    	  	       .....
>  106:   217088..  219135:     522240..    524287:   2048:             unwritten
>  107:   219136..  221183:     591872..    593919:   2048:     524288: unwritten
>  108:   221184..  223231:     593920..    595967:   2048:             unwritten
>  		       .....
>  127:   260096..  262143:     632832..    634879:   2048:             unwritten,eof
> 2: 2 extents found
> 
> So I agree that what we're doing is poor, but the question is, can we
> do something which is better that either of these two results?
> 
> That is, can we improve mballoc so that we keep an fallocated gigabyte
> file as physically contiguous as possible, while using an optimal
> number of on-disk extents?   i.e., 9 extents of length 32767.
> 
> Failing that, can we create 20 extents of length 16384 or so?
> 
> 	      	     	       	       	  	 - Ted
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html