linux-ext4 - Re: [PATCH] ext4: Rewrite ext4_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20110620103807.GA4810@quack.suse.cz>
Date:	Mon, 20 Jun 2011 12:38:07 +0200
From:	Jan Kara <jack@...e.cz>
To:	Ted Ts'o <tytso@....edu>
Cc:	Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org
Subject: Re: [PATCH] ext4: Rewrite ext4_page_mkwrite() to return locked page

On Sun 19-06-11 22:53:14, Ted Tso wrote:
> On Wed, Jun 08, 2011 at 04:55:49PM +0200, Jan Kara wrote:
> >   OK, so I read the code again and you are right that end_page_writeback()
> > will be called before the work is queued and thus there are no problems
> > with i_mutex (both with multiblock-IO code and with the old code). Sorry for
> > confusion. But reading the code I have a few questions:
> >   In multiblock io submission code we call end_page_writeback() and drop
> > page references before we queue work converting extent from uninitialized
> > to initialized. But what prevents mm from reclaiming the page (and possibly
> > loading zeros from disk) before we finish the conversion? 
> 
> We hold a reference on the page (i.e., via get_page()), which we don't
> drop until we are doing doing the conversion.  The workqueue function
> (ext4_end_io_work, in fs/ext4/page_io.c) calls ext4_free_io_end(),
> which drops the page reference.
  Hmm, let's go though the call chain and see where I miss something:
io_end is allocated and bio is initialized in io_submit_init() - we set
  bi_end_io to ext4_end_bio there
pages are added to io_end ext4_bio_write_page() - it creates io_page
  (p_count is set to 1, we grab page reference - get_page()), then
  io_submit_add_bh() adds io_page to io_end (p_count increased to 2), then
  we do put_io_page() in the end of ext4_bio_write_page() -> p_count == 1.
Eventually bio is submitted, io gets completed and ext4_end_bio() is
  called. It goes though all pages and does put_io_page(), thus p_count
  gets to 0, we clear PageWriteback and do put_page(). From this moment on,
  we hold no reference to the page AFAICS.

Note, there is a separate call chain from ext4_writepage(). There the
reference handling seems to be as you describe - i.e., we get reference in
ext4_set_bh_endio(), we call end_buffer_async_write() in
ext4_end_io_buffer_write() which clears PageWriteback and then we drop page
reference in ext4_free_io_end().

> > Related question is: Commit
> > bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of
> > inode reference when creating/freeing end IO work. But now I don't see what
> > prevents mm from reaping the inode before the workqueue gets to processing
> > it.
> 
> Actually it was commit f7ad6d2e9201a which removed the getting and
> putting of the inode reference, and it describes the issues involved.
  Ah, OK. Thanks for the pointer.

> > Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the
> > old IO submission code but apparently it forgot to return the old handling
> > of uninitialized buffers so we unconditionnaly call block_write_full_page()
> > without specifying end_io function. 
> 
> Yeah.  That was a bug, no question.  No one apparently ran into it (or
> at least no one bothered to report it to linux-ext4 or linux-kernel as
> far as I know) while it was the default (i.e., during 2.6.37), and as
> of 2.6.38 we are now using the new IO submission code by default
> again.  Given that no one has reported any problems with the new IO
> submission code, I was planning on completely removing the
> nomblk_io_submit mount option for the 3.1 kernel, which would fix the
> bug by virtue of nuking code in question.  :-)
  Yep, that would solve it as well :).

> > So to return to our original discussion regarding i_mutex - you are right
> > currently i_mutex does not cause a deadlock but to fix the first of the
> > above problems we need to somehow pin the page before all metadata is
> > properly updated, that metadata update requires i_mutex, and we will have
> > to be careful to pin the page in such a way so that memory reclaim cannot
> > get stuck reaping that page... If you can solve that, I'll be happy but
> > currently I'd find it more robust to just call end_page_writeback() after
> > we convert extents and avoid fs recursion from allocations.
> 
> So I think the current fs/ext4/page_io.c code solves the problem
> simply by grabbing a reference on the bug, which should prevent the mm
> layer from reclaiming the page.  Does that satisfy you?  Do you see
> anything that I may have missed?
  OK, if we fix the reference counting I'm now convinced that the code
would be correct. Fixing that does not seem to be completely simple - we
need to use p_count to detect when we can do end_page_writeback() when
blocksize < pagesize and io submission races with io completion (current
code is correct in that respect). But we need another mechanism detect when
we can safely put page reference... Separate reference counter would
certainly do but it might be an overkill. Possibly we could get additional
page reference for each io_page reference?

> P.S.  Upon further reflection, perhaps we need to drop a few more code
> comments in it so it's clearer to folks who try to go through this
> code path in the future.
  Well, I think it would be worthwhile to have a highlevel description
somewhere (i.e. what protects inode from freeing, what protects page from
freeing, how we manage PageWriteback, page references and extent
conversion). The actual code is not that hard to follow but it's not easy
to see the design behind it...

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html