linux-kernel - Re: [PATCH v5] userfaultfd/shmem: fix MCOPY_ATOMIC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.2104081912360.19277@eggly.anvils>
Date:   Thu, 8 Apr 2021 20:08:07 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Axel Rasmussen <axelrasmussen@...gle.com>
cc:     Peter Xu <peterx@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Jerome Glisse <jglisse@...hat.com>,
        Joe Perches <joe@...ches.com>,
        Lokesh Gidra <lokeshgidra@...gle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Shaohua Li <shli@...com>, Shuah Khan <shuah@...nel.org>,
        Stephen Rothwell <sfr@...b.auug.org.au>,
        Wang Qing <wangqing@...o.com>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-fsdevel@...r.kernel.org, Linux MM <linux-mm@...ck.org>,
        linux-kselftest@...r.kernel.org, Brian Geffon <bgeffon@...gle.com>,
        Cannon Matthews <cannonmatthews@...gle.com>,
        "Dr . David Alan Gilbert" <dgilbert@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Michel Lespinasse <walken@...gle.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Oliver Upton <oupton@...gle.com>
Subject: Re: [PATCH v5] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE
 behavior

On Thu, 8 Apr 2021, Axel Rasmussen wrote:
> On Tue, Apr 6, 2021 at 4:49 PM Peter Xu <peterx@...hat.com> wrote:
> > On Mon, Apr 05, 2021 at 10:19:17AM -0700, Axel Rasmussen wrote:
...
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
...
> > > +
> > > +     if (is_file_backed) {
> > > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
> > > +             inode = dst_vma->vm_file->f_inode;
> > > +             offset = linear_page_index(dst_vma, dst_addr);
> > > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > +             ret = -EFAULT;
> > > +             if (unlikely(offset >= max_off))
> > > +                     goto out_unlock;
> >
> > Frankly I don't really know why this must be put into pgtable lock..  Since if
> > not required then it can be moved into UFFDIO_COPY path, as CONTINUE doesn't
> > need it iiuc.  Just raise it up as a pure question.
> 
> It's not clear to me either. shmem_getpage_gfp() does check this twice
> kinda like we're doing, but it doesn't ever touch the PTL. What it
> seems to be worried about is, what happens if a concurrent
> FALLOC_FL_PUNCH_HOLE happens somewhere in the middle of whatever
> manipulation we're doing? From looking at shmem_fallocate(), I think
> the basic point is that truncation happens while "inode_lock(inode)"
> is held, but neither shmem_mcopy_atomic_pte() or the new
> mcopy_atomic_install_ptes() take that lock.
> 
> I'm a bit hesitant to just remove it, run some tests, and then declare
> victory, because it seems plausible it's there to catch some
> semi-hard-to-induce race. I'm not sure how to prove that *isn't*
> needed, so my inclination is to just keep it?
> 
> I'll send a series addressing the feedback so far this afternoon, and
> I'll leave this alone for now - at least, it doesn't seem to hurt
> anything. Maybe Hugh or someone else has some more advice about it. If
> so, I'm happy to remove it in a follow-up.

It takes some thinking about, but the i_size check is required to be
under the pagetable lock, for the MAP_PRIVATE UFFDIO_COPY path, where
it is inserting an anonymous page into the file-backed vma (skipping
actually inserting a page into page cache, as an ordinary fault would).

Not because of FALLOC_FL_PUNCH_HOLE (which makes no change to i_size;
and it's okay if a race fills in the hole immediately afterwards),
but because of truncation (which must remove all beyond i_size).

In the MAP_SHARED case, with a locked page inserted into page cache,
the page lock is enough to exclude concurrent truncation.  But even
in that case the second i_size check (I'm looking at 5.12-rc's
shmem_mfill_atomic_pte(), rather than recent patches which might differ)
is required: because the first i_size check was done before the page
became visible in page cache, so a concurrent truncation could miss it).

Maybe that first check is redundant, though I'm definitely for doing it;
or maybe shmem_add_to_page_cache() would be better if it made that check
itself, under xas_lock (I think the reason it does not is historical).
The second check, in the MAP_SHARED case, does not need to be under
pagetable lock - the page lock on the page cache page is enough -
but probably Andrea placed it there to resemble the anonymous case.

You might then question, how come there is no i_size check in all of
mm/memory.c, where ordinary faulting is handled.  I'll answer that
the pte_same() check, under pagetable lock in wp_page_copy(), is
where the equivalent to userfaultfd's MAP_PRIVATE UFFDIO_COPY check
is made: if the page cache page has already been truncated, that pte
will have been cleared.

Or, if the page cache page is truncated an instant after wp_page_copy()
drops page table lock, then the unmap_mapping_range(,,, even_cows = 1)
which follows truncation has to clean it up.  Er, does that mean that
the i_size check I started off insisting is required, actually is not
required?  Um, maybe, but let's just keep it and say goodnight!

Hugh