linux-ext4 - Re: [PATCH RFC 0/3] Block reservation for ext3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTinD=EE0yYFzBHdvu-fKCQhfMH+DetekHWd-RZc7@mail.gmail.com>
Date:	Wed, 13 Oct 2010 10:49:15 +0200
From:	"Amir G." <amir73il@...rs.sourceforge.net>
To:	Jan Kara <jack@...e.cz>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Theodore Tso <tytso@....edu>,
	Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

---------- Forwarded message ----------
From: Amir Goldstein <amir73il@...il.com>
Date: Wed, Oct 13, 2010 at 10:44 AM
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3
To: Jan Kara <jack@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Ted Ts'o
<tytso@....edu>, linux-ext4@...r.kernel.org




On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <jack@...e.cz> wrote:
>
> On Mon 11-10-10 14:59:45, Andrew Morton wrote:
> > On Mon, 11 Oct 2010 16:28:13 +0200 Jan Kara <jack@...e.cz> wrote:
> >
> > >   Doing allocation at mmap time does not really work - on each mmap we
> > > would have to map blocks for the whole file which would make mmap really
> > > expensive operation. Doing it at page-fault as you suggest in (2a) works
> > > (that's the second plausible option IMO) but the increased fragmentation
> > > and thus loss of performance is rather noticeable. I don't have current
> > > numbers but when I tried that last year Berkeley DB was like two or three
> > > times slower.
> >
> > ouch.
> >
> > Can we fix the layout problem?  Are reservation windows of no use here?
>  Reservation windows do not work for this load. The reason is that the
> page-fault order is completely random so we just spend time creating and
> removing tiny reservation windows because the next page fault doing
> allocation is scarcely close enough to fall into the small window.
>  The logic in ext3_find_goal() ends up picking blocks close together for
> blocks belonging to the same indirect block if we are lucky but they
> definitely won't be sequentially ordered. For Berkeley DB the situation is
> made worse by the fact that there are several database files and their
> blocks end up interleaved.
>  So we could improve the layout but we'd have to tweak the reservation
> logic and allocator and it's not completely clear to me how.
>  One thing to note is that currently, ext3 *is* in fact doing delayed
> allocation for writes via mmap. We just never called it like that and never
> bothered to do proper space estimation...
>
> > > > 3) Keep a global counter of sparse blocks which are mapped at mmap()
> > > > time, and update it as blocks are allocated, or when the region is
> > > > freed at munmap() time.
> > >   Here again I see the problem that mapping all file blocks at mmap time
> > > is rather expensive and so does not seem viable to me. Also the
> > > overestimation of needed blocks could be rather huge.
> >
> > When I did ext2 delayed allocation back in, err, 2001 I had
> > considerable trouble working out how many blocks to actually reserve
> > for a file block, because it also had to reserve the indirect blocks.
> > One file block allocation can result in reserving four disk blocks!
> > And iirc it was not possible with existing in-core data structures to
> > work out whether all four blocks needed reserving until the actual
> > block allocation had occurred.  So I ended up reserving the worst-case
> > number of indirects, based upon the file offset.  If the disk ran out
> > of "space" I'd do a forced writeback to empty all the reservations and
> > would then take a look to see if the disk was _really_ out of space.
> >
> > Is all of this an issue with this work?  If so, what approach did you
> > take?
>  Yeah, I've spotted exactly the same problem. How I decided to solve it in
> the end is that in memory we keep track of each indirect block that has
> delay-allocated buffer under it. This allows us to reserve space for each
> indirect block at most once (I didn't bother with making the accounting
> precise for double or triple indirect blocks so when I need to reserve
> space for indirect block, I reserve the whole path just to be sure). This
> pushes the error in estimation to rather acceptable range for reasonably
> common workloads - the error can still be 50% for workloads which use just
> one data block in each indirect block but even in this case the absolute
> number of blocks falsely reserved is small.
>  The cost is of course increased complexity of the code, the memory
> spent for tracking those indirect blocks (32 bytes per indirect block), and
> some time for lookups in the RB-tree of the structures. At least the nice
> thing is that when there are no delay-allocated blocks, there isn't any
> overhead (tree is empty).
>

How about allocating *only* the indirect blocks on page fault.
IMHO it seems like a fair mixture of high quota accuracy, low
complexity of the accounting code and low file fragmentation (only
indirect may be a bit further away from data).

In my snapshot patches I use the @create arg to get_blocks_handle() to
pass commands just like "allocate only indirect blocks".
The patch is rather simple. I can prepare it for ext3 if you like.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html