linux-kernel - Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HX1YbDArmPsGUH+7b6dNxDa3Xo+fd5npGgz=RFehSw6iA@mail.gmail.com>
Date:   Wed, 20 Jul 2022 13:58:06 -0700
From:   James Houghton <jthoughton@...gle.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        Muchun Song <songmuchun@...edance.com>,
        David Hildenbrand <david@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Jue Wang <juew@...gle.com>,
        Manish Mishra <manish.mishra@...anix.com>,
        "Dr . David Alan Gilbert" <dgilbert@...hat.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Fri, Jul 15, 2022 at 10:20 AM Peter Xu <peterx@...hat.com> wrote:
>
> On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote:
> > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <peterx@...hat.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > > > The changes here are very similar to the changes made to
> > > > hugetlb_no_page, where we do a high-granularity page table walk and
> > > > do accounting slightly differently because we are mapping only a piece
> > > > of a page.
> > > >
> > > > Signed-off-by: James Houghton <jthoughton@...gle.com>
> > > > ---
> > > >  fs/userfaultfd.c        |  3 +++
> > > >  include/linux/hugetlb.h |  6 +++--
> > > >  mm/hugetlb.c            | 54 +++++++++++++++++++++-----------------
> > > >  mm/userfaultfd.c        | 57 +++++++++++++++++++++++++++++++----------
> > > >  4 files changed, 82 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index e943370107d0..77c1b8a7d0b9 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > > >       if (!ptep)
> > > >               goto out;
> > > >
> > > > +     if (hugetlb_hgm_enabled(vma))
> > > > +             goto out;
> > > > +
> > >
> > > This is weird.  It means we'll never wait for sub-page mapping enabled
> > > vmas.  Why?
> > >
> >
> > `ret` is true in this case, so we're actually *always* waiting.
>
> Aha!  Then I think that's another problem, sorry. :) See Below.
>
> >
> > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> > > means we'll stop waiting for all shared hugetlbfs uffd page faults..
> > >
> > > I'd expect in the in-house postcopy tests you should see vcpu threads
> > > spinning on the page faults until it's serviced.
> > >
> > > IMO we still need to properly wait when the pgtable doesn't have the
> > > faulted address covered.  For sub-page mapping it'll probably need to walk
> > > into sub-page levels.
> >
> > Ok, SGTM. I'll do that for the next version. I'm not sure of the
> > consequences of returning `true` here when we should be returning
> > `false`.
>
> We've put ourselves onto the wait queue, if another concurrent
> UFFDIO_CONTINUE happened and pte is already installed, I think this thread
> could be waiting forever on the next schedule().
>
> The solution should be the same - walking the sub-page pgtable would work,
> afaict.
>
> [...]
>
> > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > >        * registered, we firstly wr-protect a none pte which has no page cache
> > > >        * page backing it, then access the page.
> > > >        */
> > > > -     if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > > > +     if (!hugetlb_pte_none_mostly(dst_hpte))
> > > >               goto out_release_unlock;
> > > >
> > > > -     if (vm_shared) {
> > > > -             page_dup_file_rmap(page, true);
> > > > -     } else {
> > > > -             ClearHPageRestoreReserve(page);
> > > > -             hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > > > +     if (new_mapping) {
> > >
> > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> > > that was going to be mapped.
> > >
> > > Is it a must we get this only from the caller?  Can we know we're doing
> > > sub-page mapping already here and make a decision with e.g. dst_hpte?
> > >
> > > It looks weird to me to pass this explicitly from the caller, especially
> > > that's when we don't really have the pgtable lock so I'm wondering about
> > > possible race conditions too on having stale new_mapping values.
> >
> > The only way to know what the correct value for `new_mapping` should
> > be is to know if we had to change the hstate-level P*D to non-none to
> > service this UFFDIO_CONTINUE request. I'll see if there is a nice way
> > to do that check in `hugetlb_mcopy_atomic_pte`.
> > Right now there is no
>
> Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or
> something alike)?

This works in the hugetlb_fault case, because in the hugetlb_fault
case, we install the largest PTE possible. If we are mapping a page
for the first time, we will use an hstate-sized PTE. But for
UFFDIO_CONTINUE, we may be installing a 4K PTE as the first PTE for
the whole hpage.

>
> > race, because we synchronize on the per-hpage mutex.
>
> Yeah not familiar with that mutex enough to tell, as long as that mutex
> guarantees no pgtable update (hmm, then why we need the pgtable lock
> here???) then it looks fine.

Let me take a closer look at this. I'll have a more detailed
explanation for the next version of the RFC.

>
> [...]
>
> > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > >       copied = 0;
> > > >       page = NULL;
> > > >       vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > +     if (use_hgm)
> > > > +             vma_altpagesize = PAGE_SIZE;
> > >
> > > Do we need to check the "len" to know whether we should use sub-page
> > > mapping or original hpage size?  E.g. any old UFFDIO_CONTINUE code will
> > > still want the old behavior I think.
> >
> > I think that's a fair point; however, if we enable HGM and the address
> > and len happen to be hstate-aligned
>
> The address can, but len (note! not "end" here) cannot?

They both (dst_start and len) need to be hpage-aligned, otherwise we
won't be able to install hstate-sized PTEs. Like if we're installing
4K at the beginning of a 1G hpage, we can't install a PUD, because we
only want to install that 4K.

>
> > , we basically do the same thing as
> > if HGM wasn't enabled. It could be a minor performance optimization to
> > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
> > the page tables are set up, the end result would be the same.
>
> Thanks,

Thanks!

>
> --
> Peter Xu
>