linux-kernel - Re: [PATCH] jbd_commit_transaction() races with journal_try_to_drop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1210639184.3661.43.camel@localhost.localdomain>
Date:	Mon, 12 May 2008 17:39:43 -0700
From:	Mingming Cao <cmm@...ibm.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Badari Pulavarty <pbadari@...ibm.com>, akpm@...ux-foundation.org,
	linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] jbd_commit_transaction() races with
	journal_try_to_drop_buffers() causing DIO failures

On Mon, 2008-05-12 at 17:54 +0200, Jan Kara wrote:
>   Hello,
> 
> On Fri 09-05-08 15:27:52, Mingming Cao wrote:
> > > > I was able to reproduce the customer problem involving DIO
> > > > (invalidate_inode_pages2) problem by writing simple testcase
> > > > to keep writing to a file using buffered writes and DIO writes
> > > > forever in a loop. I see DIO writes fail with -EIO.
> > > > 
> > > > After a long debug, found 2 cases how this could happen.
> > > > These are race conditions with journal_try_to_free_buffers()
> > > > and journal_commit_transaction().
> > > > 
> > > > 1) journal_submit_data_buffers() tries to get bh_state lock. If
> > > > try lock fails, it drops the j_list_lock and sleeps for
> > > > bh_state lock, while holding a reference on the buffer.
> > > > In the meanwhile, journal_try_to_free_buffers() can clean up the
> > > > journal head could call try_to_free_buffers(). try_to_free_buffers()
> > > > would fail due to the reference held by journal_submit_data_buffers()
> > > > - which in turn causes failues for DIO (invalidate_inode_pages2()).
> > > > 
> > > > 2) When the buffer is on t_locked_list waiting for IO to finish,
> > > > we hold a reference and give up the cpu, if we can't get
> > > > bh_state lock. This causes try_to_free_buffers() to fail.
> > > > 
> > > > Fix is to drop the reference on the buffer if we can't get
> > > > bh_state lock, give up the cpu and re-try the whole operation -
> > > > instead of waiting for the vh_state lock.
> > > > 
> > > > Does this look like a resonable fix ?
> > >   As Mingming pointed out there are few other places where we could hold
> > > the bh reference. Note also that we accumulate references to buffers in the
> > > wbuf[] list and we need that for submit_bh() which consumes one bh
> > > reference. Generally, it seems to me as a too fragile and impractical
> > > rule "nobody can hold bh reference when not holding page lock" which is
> > > basically what it comes down to if you really want to be sure that
> > > journal_try_to_free_buffers() succeeds. And also note that in principle
> > > there are other places which hold references to buffers without holding the
> > > page lock - for example writepage() in ordered mode (although this one is
> > > in practice hardly triggerable). So how we could fix at least the races
> > > with commit code is to implement launder_page() callback for ext3/4 which
> > > would wait for the previous transaction commit in case the page has buffers
> > > that are part of that commit (I don't want this logic in
> > > journal_try_to_free_buffers() as that is called also on memory-reclaim
> > > path, but journal_launder_page() is fine with me). 
> > 
> > I am not sure how we are going to gurantee that by the time
> > journal_try_to_free_buffers() get called, the page has buffers are not
> > as part of the current transaction commit(which could be different than
> > the one we waited in ext3_launder_page())?  
>   Hmm, you are right. It is not enough to just wait in ext3_launder_page()
> because we don't have a transaction for direct_IO started yet. But if we
> actually released buffers from the page there, it should be fine.

Does this match what you are thinking? It certainly slow down the DIO
path, but the positive side is it doesn't disturb the other code path...
thanks for your feedback!

--------------------------------------------

An unexpected EIO error gets returned when writing to a file
using buffered writes and DIO writes at the same time.

We found there are a number of places where journal_try_to_free_buffers()
could race with journal_commit_transaction(), the later still
helds the reference to the buffers on the t_syncdata_list or t_locked_list
, while journal_try_to_free_buffers() tries to free them, which resulting an EIO
error returns back to the dio caller. 

The logic fix is to retry freeing if journal_try_to_free_buffers() to failed
to free those data buffers while journal_commit_transaction() is still
reference those buffers.
This is done via implement ext3 launder_page() callback, instead of inside
journal_try_to_free_buffers() itself, so that it doesn't affecting other code
path calling journal_try_to_free_buffers and only dio path get affected.

Signed-off-by: Mingming Cao <cmm@...ibm.com>
Index: linux-2.6.26-rc1/fs/ext3/inode.c
===================================================================
--- linux-2.6.26-rc1.orig/fs/ext3/inode.c	2008-05-03 11:59:44.000000000 -0700
+++ linux-2.6.26-rc1/fs/ext3/inode.c	2008-05-12 12:41:27.000000000 -0700
@@ -1766,6 +1766,23 @@ static int ext3_journalled_set_page_dirt
 	return __set_page_dirty_nobuffers(page);
 }
 
+static int ext3_launder_page(struct page *page)
+{
+        int ret;
+	int retry = 5;
+
+	while (retry --) {
+		ret = ext3_releasepage(page, GFP_KERNEL);
+		if (ret == 1)
+			break;
+		else
+			schedule();
+	}
+
+        return ret;
+}
+
+
 static const struct address_space_operations ext3_ordered_aops = {
 	.readpage	= ext3_readpage,
 	.readpages	= ext3_readpages,
@@ -1778,6 +1795,7 @@ static const struct address_space_operat
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
 	.migratepage	= buffer_migrate_page,
+	.launder_page	= ext3_launder_page,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
@@ -1792,6 +1810,7 @@ static const struct address_space_operat
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
 	.migratepage	= buffer_migrate_page,
+	.launder_page	= ext3_launder_page,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
@@ -1805,6 +1824,7 @@ static const struct address_space_operat
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
+	.launder_page	= ext3_launder_page,
 };
 
 void ext3_set_aops(struct inode *inode)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/