linux-kernel - Re: Status of buffered write path (deadlock fixes)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 08 Dec 2006 14:28:10 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Mark Fasheh <mark.fasheh@...cle.com>
CC:	Linux Memory Management <linux-mm@...ck.org>,
	linux-fsdevel@...r.kernel.org,
	linux-kernel <linux-kernel@...r.kernel.org>,
	OGAWA Hirofumi <hirofumi@...l.parknet.co.jp>,
	Andrew Morton <akpm@...gle.com>
Subject: Re: Status of buffered write path (deadlock fixes)

Mark Fasheh wrote:
> Hi Nick,
> 
> On Tue, Dec 05, 2006 at 05:52:02PM +1100, Nick Piggin wrote:
> 
>>Hi,
>>
>>I'd like to try to state where we are WRT the buffered write patches,
>>and ask for comments. Sorry for the wide cc list, but this is an
>>important issue which hasn't had enough review.
> 
> 
> I pulled broken-out-2006-12-05-01-0.tar.gz from ftp.kernel.org and applied
> the following patches to get a reasonable idea of what the final product
> would look like:
> 
> revert-generic_file_buffered_write-handle-zero-length-iovec-segments.patch
> revert-generic_file_buffered_write-deadlock-on-vectored-write.patch
> generic_file_buffered_write-cleanup.patch
> mm-only-mm-debug-write-deadlocks.patch
> mm-fix-pagecache-write-deadlocks.patch
> mm-fix-pagecache-write-deadlocks-comment.patch
> mm-fix-pagecache-write-deadlocks-xip.patch
> mm-fix-pagecache-write-deadlocks-mm-pagecache-write-deadlocks-efault-fix.patch
> mm-fix-pagecache-write-deadlocks-zerolength-fix.patch
> mm-fix-pagecache-write-deadlocks-stale-holes-fix.patch
> fs-prepare_write-fixes.patch
> 
> If this is incorrect, or I should apply further patches, please let me know.
> 
> Hopefully my feedback can be of use to you.

That looks right (there are fixes for the "cont" buffer scheme, but they
probably don't bother you).

>>Well the next -mm will include everything we've done so far. I won't
>>repost patches unless someone would like to comment on a specific one.
>>
>>I think the core generic_file_buffered_write is fairly robust, after
>>fixing the efault and zerolength iov problems picked up in testing
>>(thanks, very helpful!).
>>
>>So now I *believe* we have an approach that solves the deadlock and
>>doesn't expose transient or stale data, transient zeroes, or anything
>>like that.
> 
> 
> In generic_file_buffered_write() we now do:
> 
> 	status = a_ops->commit_write(file, page, offset,offset+copied);
> 
> Which tells the file system to commit only the amount of data that
> filemap_copy_from_user() was able to pull in, despite our zeroing of 
> the newly allocated buffers in the copied != bytes case. Shouldn't we be
> doing:
> 
>         status = a_ops->commit_write(file, page, offset,offset+bytes);
> 
> instead, thus preserving ordered writeout (for ext3, ocfs2, etc) for those
> parts of the page which are properly allocated and zero'd but haven't been
> copied into yet? I think that in the case of a crash just after the
> transaction is closed in ->commit_write(), we might lose those guarantees,
> exposing stale data on disk.

No, because we might be talking about buffers that haven't been newly
allocated, but are not uptodate (so the pagecache can contain junk).

We can't zero these guys and do the commit_write, because that exposes
transient zeroes to a concurrent reader.

What we *could* do, is to do the full length commit_write for uptodate
pages, even if the copy is short. But we still need to do a zero-length
commit_write if the page is not uptodate (this would reduce the number
of new cases that need to be considered).

Does a short commit_write cause a problem for filesystems? They can
still do any and all operations they would have with a full-length one.
But maybe it would be better to eliminate that case. OK.

How about a zero-length commit_write? In that case again, they should
be able to remain unchanged *except* that they are not to extend i_size
or mark the page uptodate.

>>Error handling is getting close, but there may be cases that nobody
>>has picked up, and I've noticed a couple which I'll explain below.
>>
>>I think we do the right thing WRT pagecache error handling: a
>>!uptodate page remains !uptodate, an uptodate page can handle the
>>write being done in several parts. Comments in the patches attempt
>>to explain how this works. I think it is pretty straightforward.
>>
>>But WRT block allocation in the case of errors, it needs more review.
>>
>>Block allocation:
>>- prepare_write can allocate blocks
>>- prepare_write doesn't need to initialize the pagecache on top of
>>  these blocks where it is within the range specified in prepare_write
>>  (because the copy_from_user will initialise it correctly)
>>- In the case of a !uptodate page, unless the page is brought uptodate
>>  (ie the copy_from_user completely succeeds) and marked dirty, then
>>  a read that sneaks in after we unlock the page (to retry the write)
>>  will try to bring it uptodate by pulling in the uninitialised blocks.
> 
> 
> For some reason, I'm not seeing where BH_New is being cleared in case with
> no errors or faults. Hopefully I'm wrong, but if I'm not I believe we need
> to clear the flag somewhere (perhaps in block_commit_write()?).

Hmm, it is a bit inconsistent. It seems to be anywhere from prepare_write
to block_write_full_page.

Where do filesystems need the bit? It would be nice to clear it in
commit_write if possible. Worst case we'll need a new bit.

> Ok, that's it for now. I have some thoughts regarding the asymmetry between
> ranges passed to ->prepare_write() and ->commit_write(), but I'd like to
> save those thoughts until I know whether my comments above uncovered real
> issues :)

Very helpful, thanks ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/