[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110906153601.GA20571@thunk.org>
Date: Tue, 6 Sep 2011 11:36:01 -0400
From: Ted Ts'o <tytso@....edu>
To: Jan Kara <jack@...e.cz>
Cc: Ext4 Developers List <linux-ext4@...r.kernel.org>,
Allison Henderson <achender@...ux.vnet.ibm.com>
Subject: Re: [PATCH] ext4: only call ext4_jbd2_file_inode when an inode has
been extended
On Tue, Sep 06, 2011 at 12:02:53PM +0200, Jan Kara wrote:
> Umm, I don't quite understand the race. It seems to me we can block on
> lack of journal space during writeback only in ext4_da_writepages() where
> we do ext4_journal_start(). But at that point there shouldn't be any pages
> with PageWriteback set which were not submitted for IO. So it's not clear
> to me why PageWriteback bit does not get cleared when IO finishes and
> kjournald would then continue. Is it because IO completion is somehow
> blocked on journal?
No, you're right. I looked more closely at it and the problem is
because of the end_io handling. There was a thead blocked in
mpage_da_map_and_submit path but it was blocked in
ext4_get_wite_access(), and it was not holding PageWriteback at the
time.
> That would seem possible as I'm looking e.g. into
> ext4_convert_unwritten_extents(). But then any waiting for writeback from
> kjournald is prone to deadlocks. In fact regardless of what kjournald does
> there does not seem to be sane lock ordering since we have:
>
> transaction start -> PageWriteback -> transaction start
> ^ ^
> | end_io handling if I'm right
> enforced by write_cache_pages_da
>
> Which is really nasty. We cannot end page writeback until the page is
> readable from disk which means until we have properly updated extent tree.
> But for extent tree update we need a transaction. The only way out I can
> see is to reserve space for extent tree update in a transaction in
> writepages() and holding the transaction open until end_io time when we
> change extent tree and close the transaction. But I'm afraid that's going
> to suck under heavy load...
Yes, that's a problem with ext4_convert_unwritten_extents() being
called out of end_io handling, and of course dioread_nolock does a lot
more of that, hence why it shows up in that mode.
I think the long-term solution here is that we have to reserve space
and make the allocation decision at writepages() time, but we don't
actually modify any on-disk state, hence we don't have to hold a
transaction open. We just prevent those disk blocks from getting
allocated anywhere else, and we tentative assoicate physical blocks
with the logical block numbers, but in a memory structure only. Then
when the pages are written, we can drop PageWriteback. We don't have
to wait until the extent blocks are written to disk, so long as any
callers of ext4_map_blocks() get the right information (via an
in-memory cache that we would have to add to make this whole thing
first), and so long as fsync() not only calls filemap_fdatawrite(),
but also waits for the metadata updates to the extent tree/indirect
blocks have been completed.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists