lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180104113301.GE29010@quack2.suse.cz>
Date:   Thu, 4 Jan 2018 12:33:01 +0100
From:   Jan Kara <jack@...e.cz>
To:     Dan Williams <dan.j.williams@...el.com>
Cc:     Jan Kara <jack@...e.cz>, Linux MM <linux-mm@...ck.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>,
        linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: Filesystem crashes due to pages without buffers

On Wed 03-01-18 20:56:32, Dan Williams wrote:
> On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@...e.cz> wrote:
> > Hello,
> >
> > Over the years I have seen so far unexplained crashed in filesystem's
> > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > them (see [1] and [2] for relatively recent reports). This was confusing as
> > reclaim takes care not to strip buffers from a dirty page and both
> > filesystems do add buffers to a page when it is first written to - in
> > ->page_mkwrite() and ->write_begin callbacks.
> >
> > Recently I have come across a code path that is probably leading to this
> > inconsistent state and I'd like to discuss how to best fix the problem
> > because it's not obvious to me. Consider the following race:
> >
> > CPU1                                    CPU2
> >
> > addr = mmap(file1, MAP_SHARED, ...);
> > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > read(fd2, addr, len)
> >   do_direct_IO()
> >     page = dio_get_page()
> >       dio_refill_pages()
> >         iov_iter_get_pages()
> >           get_user_pages_fast()
> >             - page fault
> >               ->page_mkwrite()
> >                 block_page_mkwrite()
> >                   lock_page(page);
> >                   - attaches buffers to page
> >                   - makes sure blocks are allocated
> >                   set_page_dirty(page)
> >               - install writeable PTE
> >               unlock_page(page);
> >     submit_page_section(page)
> >       - submits bio with 'page' as a buffer
> >                                         kswapd reclaims pages:
> >                                         ...
> >                                         shrink_page_list()
> >                                           trylock_page(page) - this is the
> >                                             page CPU1 has just faulted in
> >                                           try_to_unmap(page)
> >                                           pageout(page);
> >                                             clear_page_dirty_for_io(page);
> >                                             ->writepage()
> >                                           - let's assume page got written
> >                                             out fast enough, alternatively
> >                                             we could get to the same path as
> >                                             soon as the page IO completes
> >                                           if (page_has_private(page)) {
> >                                             try_to_release_page(page)
> >                                               - reclaims buffers from the
> >                                                 page
> >                                            __remove_mapping(page)
> >                                              - fails as DIO code still
> >                                                holds page reference
> > ...
> >
> > eventually read completes
> >   dio_bio_complete(bio)
> >     set_page_dirty_lock(page)
> >       Bummer, we've just marked the page as dirty without having buffers.
> >       Eventually writeback will find it and filesystem will complain...
> >
> > Am I missing something?
> >
> > The problem here is that filesystems fundamentally assume that a page can
> > be written to only between ->write_begin - ->write_end (in this interval
> > the page is locked), or between ->page_mkwrite - ->writepage and above is
> > an example where this does not hold because when a page reference is
> > acquired through get_user_pages(), page can get written to by the holder of
> > the reference and dirtied even after it has been unmapped from page tables
> > and ->writepage has been called. This is not only a cosmetic issue leading
> > to assertion failure but it can also lead to data loss, data corruption, or
> > other unpleasant surprises as filesystems assume page contents cannot be
> > modified until either ->write_begin() or ->page_mkwrite gets called and
> > those calls are serialized by proper locking with problematic operations
> > such as hole punching etc.
> >
> > I'm not sure how to fix this problem. We could 'simulate' a writeable page
> > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> > virtual address of the fault, don't hold mmap_sem, etc., possibly
> > expensive, but it would make filesystems happy. Data stored by GUP user
> > (e.g. read by DIO in the above case) could still get lost if someone e.g.
> > punched hole under the buffer or otherwise messed with the underlying
> > storage of the page while DIO was running but arguably users could expect
> > such outcome.
> >
> > Another possible solution would be to make sure page is writeably mapped
> > until GUP user drops its reference. That would be arguably cleaner but
> > probably that would mean we have to track number of writeable GUP page
> > references separately (no space space in struct page is a problem here) and
> > block page_mkclean() until they are dropped. Also for long term GUP users
> > like Infiniband or V4L we'd have to come up with some solution as we should
> > not block page_mkclean() for so long.
> 
> Do we need to block page_mkclean, or could we defer buffer reclaiming
> to the last put of the page?

As I wrote to Dave the problem is no so much with reclaiming of buffers but
with the fact filesystems don't expect page can be dirtied after
page_mkclean() is finished.

> I think once we have the "register memory with lease" mechanism for
> Infiniband we could expand it to the page cache case. The problem is
> the regression this would cause with userspace that expects it can
> maintain file backed memory registrations indefinitely.
> 
> What are the implications of holding off page_mkclean or release
> buffers indefinitely?

Bad. You cannot write the page to disk until page_mkclean() finishes as
page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
that functionality there e.g. to make sure tail of the last page in the
file is properly zeroed out, storage with DIF/DIX can compute checksum of
the data safely before submitting it to the device etc.

> Is an indefinite / interruptible sleep waiting for the 'put' event of
> a get_user_pages() page unacceptable? The current case that the file
> contents will not be coherent with respect to in-flight RDMA, perhaps
> waiting for that to complete is better than cleaning buffers from the
> page prematurely.

Yeah, indefinite sleep is really a no-go.

> > As a side note DAX needs some solution for GUP users as well. The problems
> > are similar there in nature, just much easier to hit. So at least a
> > solution for long-term GUP users can (and I strongly believe should) be
> > shared between standard and DAX paths.
> 
> In the DAX case we rely on the fact that when the page goes idle we
> only need to worry about the filesytem block map changing, the page
> won't get reallocated somewhere else. We can't use page idle as an
> event in this case, however, if the page reference count is one then
> the DIO code can know that it has the page exclusively, so maybe DAX
> and non-DAX can share the page count == 1 event notification.

The races I describe do not need to involve truncate / hole punching. It is
just enough to race with page writeback. So page references are of no use
here. We would have to specifically track number of references acquired by
GUP or something like that. So what I wanted to share with DAX is the
long-term pin handling, the rest is unclear for now.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ