linux-kernel - Re: [PATCH]: Fix SMP-reordering race in mark_buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Thu, 3 Apr 2008 07:34:37 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Andrew Morton <akpm@...ux-foundation.org>
cc:	mikulas@...ax.karlin.mff.cuni.cz, viro@...iv.linux.org.uk,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



On Wed, 2 Apr 2008, Andrew Morton wrote:
> 
> You sure?  A pretty common case would be overwrite of an already-dirty page
> and from a quick read the only place where we modify bh.b_state is the
> set_buffer_uptodate() and clear_buffer_new() in __block_commit_write(),
> both of which could/should be converted to use the same trick.  Like
> __block_prepare_write(), which already does
> 
> 	if (!buffer_uptodate(bh))
> 		set_buffer_uptodate(bh);

Well, that particular optimization is safe, but it's safe because 
"uptodate" is sticky. Once it gets set, it is never reset.

But in general, it's simply a bad idea to do

	if (read)
		atomic-read-modify-write;

because it so often has races. This is pretty much exactly the same bug as 
we had not long ago with

	if (!waitqueue_empty(..))
		wake_up(..);

and for very similar reasons - the "read" part is very fast, yes, but it's 
also by definition not actually doing all the careful things that the 
atomic operation (whether a CPU-atomic one, or a software-written atomic 
with a spinlock one) does.

> What happened here was back in about, umm, 2001 we discovered one or two
> code paths which when optimised in this way led to overall-measurably (not
> just oprofile-measurably) improvements.  I don't recall which ones they
> were.
> 
> So we then said oh-goody and sprinkled the same pattern all over the place
> on the off-chance.  But I'm sure that over the ages we've let that
> optimisation rot (witness __block_commit_write() above).

And the problem is that 

	if (!buffer_uptodate(bh))
		set_buffer_uptodate(bh);

really isn't "the same" optimization at all as

	if (!buffer_dirty(bh) && test_and_set_buffer_dirty(bh)) {
			..

and the latter is simply fundamentally different.

> As I say, I expect we could fix this if we want to.  The key point here is
> that a page overwrite does not do lock_buffer(), so it should be possible
> to do the whole operation without modifying bh.b_state.  If we wish to do
> that.

Well, if we really want to do this op, then I'd rather make the code be 
really obvious what the smp_mb is about, but also make sure that we don't 
unnecessarily do *both* the smp_mb and the actual already-serialized bit 
operation.

But I'd be even happier if we only did these kinds of things when we have 
real performance-data that they help.

		Linus
---
 fs/buffer.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9819632..39ff144 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1181,7 +1181,20 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
 void mark_buffer_dirty(struct buffer_head *bh)
 {
 	WARN_ON_ONCE(!buffer_uptodate(bh));
-	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
+
+	/*
+	 * Very *carefully* optimize the it-is-already-dirty case.
+	 *
+	 * Don't let the final "is it dirty" escape to before we
+	 * perhaps modified the buffer.
+	 */
+	if (buffer_dirty(bh)) {
+		smp_mb();
+		if (buffer_dirty(bh))
+			return;
+	}
+
+	if (!test_set_buffer_dirty(bh))
 		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/