[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <X8lADgHCRqlQi3Xa@redhat.com>
Date: Thu, 3 Dec 2020 14:44:14 -0500
From: Andrea Arcangeli <aarcange@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: Hugh Dickins <hughd@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions
on reads
On Thu, Dec 03, 2020 at 01:02:34PM -0500, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote:
> > On Wed, 2 Dec 2020, Peter Xu wrote:
> > > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > > > On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > > > >
> > > > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > > > > survive the pte invalidates in a way that remains associated to a
> > > > > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > > > > wants, but it can't interfere with all other mm sharing the shmem
> > > > > file) would be welcome...
> > > >
> > > > I think it has to be a new variety of swap-like non_swap_entry() pte,
> > > > see include/linux/swapops.h. Anything else would be more troublesome.
Agreed. Solving it by changing the unmapping of the ptes is also some
trouble but less troublesome than adding new bitmaps to the vma to
handle in vma_merge/vma_split.
> > But those ptes will be pte_present(), so you must provide a pfn,
>
> Could I ask why?
_PAGE_PROTNONE exists only for one reason, so pte_present returns true
and the page is as good as mapped, the pfn is real and mapped,
everything is up and running fine except _PAGE_PRESENT is not set in
the pte. _PAGE_PROTNONE doesn't unmap the pte, it only triggers faults
on a mapped pte.
When we set _PAGE_PROTNONE and clear _PAGE_PRESENT atomically, nothing
changes at all for all pte_present and all other VM common code,
except now you get page faults.
So numa hinting faults use that and it's a perfect fit for that,
because you just want to change nothing, but still be notified on
access.
Here instead you need to really unmap the page and lose any pfn or
page reference and everything along with it, just one bit need to
survive the unmap: the _PAGE_UFFD_WP bit.
I tend to agree this needs to work more similarly to the migration
entry like Hugh suggested or an entirely new mechanism to keep "vma
specific state" alive, for filebacked mappings that get zapped but
that have still a vma intact.
The vma removal in munmap() syscall, is then the only point where the
pte is only allowed to be cleared for good and only then the pte can
be freed.
Not even MADV_DONTNEED should be allowed to zero out the pte, it can
drop everything but that single bit.
> Meanwhile, this reminded me another option - besides _PAGE_PROTNONE, can we use
> _PAGE_SPECIAL? That sounds better at least from its naming that it tells it's
> a special page already. We can also leverage existing pte_special() checks here
> and there, then mimic what we do with pte_devmap(), maybe?
That's also for mapped pages VM_PFNMAP or similar I think.
By memory the !pte_present case for filebacked vmas only exists as
migration entry so I think we'll just add a branch to that case so
that it can disambiguate the migration entry from the _PAGE_UFFDP_WP
bit.
So we can reserve one bit in the migration entry that is always
enforced zero when it is a migration entry.
When that bit is set on a non-present page in a filebacked vma, it
will mean _UFFD_PAGE_WP is set for that vaddr in that mm.
Then we need to pass a parameter in all pte zapping operations to tell
if the unmap event is happening because the vma has been truncated, or
if it's happening for any other reason.
If it's happening for any other reasons (page truncate, MADV_DONTNEED,
memory pressure to swap/writepage) we just convert any present pte
with _UFFD_PAGE_WP set, to the non-migration entry non-present pte
with the reserved migration entry bit set.
If the present pte has no _UFFD_PAGE_WP then it'll be zapped as usual
regardless of VM_UFFD_WP in the vma or not.
If the pte zapping is instead invoked because of a vma truncation, and
it means it's the last unmap operation on that vaddr, the pte zap
logic will be told to ignore the _UFFD_PAGE_WP in any present pte so
the ptes will be zeroed out and later freed as needed.
Thanks,
Andrea
Powered by blists - more mailing lists