lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZAozq9a25utZtsaD@x1n>
Date:   Thu, 9 Mar 2023 14:29:47 -0500
From:   Peter Xu <peterx@...hat.com>
To:     James Houghton <jthoughton@...gle.com>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        Hugh Dickins <hughd@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        David Hildenbrand <david@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Jiaqi Yan <jiaqiyan@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] mm: rmap: merge HugeTLB mapcount logic with THPs

On Thu, Mar 09, 2023 at 10:05:12AM -0800, James Houghton wrote:
> On Wed, Mar 8, 2023 at 2:10 PM Peter Xu <peterx@...hat.com> wrote:
> >
> > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote:
> > > HugeTLB pages may soon support being mapped with PTEs. To allow for this
> > > case, merge HugeTLB's mapcount scheme with THP's.
> > >
> > > The first patch of this series comes from the HugeTLB high-granularity
> > > mapping series[1], though with some updates, as the original version
> > > was buggy[2] and incomplete.
> > >
> > > I am sending this change as part of this smaller series in hopes that it
> > > can be more thoroughly scrutinized.
> > >
> > > I haven't run any THP performance tests with this series applied.
> > > HugeTLB pages don't currently support being mapped with
> > > `compound=false`, but this mapcount scheme will make collapsing
> > > compound=false mappings in HugeTLB pages quite slow. This can be
> > > optimized with future patches (likely by taking advantage of HugeTLB's
> > > alignment guarantees).
> > >
> > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
> > > the use of each subpage's mapcount. If this series is applied, Matthew's
> > > new scheme will automatically apply to HugeTLB pages.
> >
> > Is this the plan?
> >
> > I may have not followed closely on the latest development of Matthew's
> > idea.  The thing is if the design requires ptes being installed / removed
> > at the same time for the whole folio, then it may not work directly for HGM
> > if HGM wants to support at least postcopy, iiuc, because if we install the
> > whole folio ptes at the same time it seems to beat the whole purpose of
> > having HGM..
> 
> My understanding is that it doesn't *require* all the PTEs in a folio
> to be mapped at the same time. I don't see how it possibly could,
> given that UFFDIO_CONTINUE exists (which can already create PTE-mapped
> THPs today). It would be faster to populate all the PTEs at the same
> time (you would only need to traverse the page table once for the
> entire group to see if you should be incrementing mapcount).
> 
> Though, with respect to unmapping, if PTEs aren't all unmapped at the
> same time, then you could end up with a case where mapcount is still
> incremented but nothing is really mapped. I'm not really sure what
> should be done there, but this problem applies to PTE-mapped THPs the
> same way that it applies to HGMed HugeTLB pages.
> 
> > The patch (especially patch 1) looks good.  So it's a pure question just to
> > make sure we're on the same page.  IIUC your other mapcount proposal may
> > work, but it still needs to be able to take care of ptes in less-than-folio
> > sizes whatever it'll look like at last.
> 
> By my "other mapcount proposal", I assume you mean the "using the
> PAGE_SPECIAL bit to track if mapcount has been incremented or not". It
> really only serves as an optimization for Matthew's scheme (see below
> [2] for some more thoughts), and it doesn't have to only apply to
> HugeTLB.
> 
> I originally thought[1] that Matthew's scheme would be really painful
> for postcopy for HGM without this optimization, but it's actually not
> so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing
> from the end to the beginning, like in [1]:
> 
> First CONTINUE: pvmw finds an empty PUD, so quickly returns false.
> Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs,
> then finds a present PTE (from the first CONTINUE).
> Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs.
> ...
> 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs.
> 
> So it'll be slow, but it won't have to check 262k empty PTEs per
> CONTINUE (though you could make this possible with MADV_DONTNEED).
> Even with an HGM implementation that only allows PTE-mapping of
> HugeTLB pages, it should still behave just like this, too.
> 
> > A trivial comment on patch 2 since we're at it: does "a future plan on some
> > arch to support 512GB huge page" justify itself?  It would be better
> > justified, IMHO, when that support is added (and decided to use HGM)?
> 
> That's fine with me. I'm happy to drop that patch.
> 
> > What I feel like is missing (rather than patch 2 itself) is some guard to
> > make sure thp mapcountings will not be abused with new hugetlb sizes
> > coming.
> >
> > How about another BUG_ON() squashed into patch 1 (probably somewhere in
> > page_add_file|anon_rmap()) to make sure folio_size() is always smaller than
> > COMPOUND_MAPPED / 2)?
> 
> Sure, I can add that.
> 
> Thanks, Peter!
> 
> - James
> 
> [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@mail.gmail.com/
> 
> [2]: Some details on what the optimization might look like:
> 
> So an excerpt of Matthew's scheme would look something like this:
> 
>     /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */
>     if (!folio_has_ptes(folio, vma))
>       atomic_inc(folio->_mapcount);
> 
> where folio_has_ptes() is defined like:
> 
>     if (!page_vma_mapped_walk(...))
>       return false;
>     page_vma_mapped_walk_done(...);
>     return true;
> 
> You might be able to optimize folio_has_ptes() with a block like this
> at the beginning:
> 
>     if (folio_is_naturally_aligned(folio, vma)) {
>       /* optimization for naturally-aligned folios. */
>       if (folio_test_hugetlb(folio)) {
>         /* check hstate-level PTE, and do a similar check as below. */
>       }
>       /* for naturally-aligned THPs: */
>       pmdp = mm_find_pmd(...); /* or just pass it in. */
>       pmd = READ_ONCE(*pmdp);
>       BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd));
>       if (pmd_special(pmd))
>         return true;
>       /* we already hold the PTL for the PTE. */
>       ptl = pmd_lock(mm, pmdp);
>       /* test and set pmd_special */
>       pmd_unlock(ptl)
>       return if_we_set_pmd_special;
>     }
> 
> (pmd_special() doesn't currently exist.) If HugeTLB walking code can
> be merged with generic mm, then HugeTLB wouldn't have a special case
> at all here.

I see what you mean now, thanks.  That looks fine.  I just suspect the
pte_special trick will still be needed if this will start to apply to HGM,
as it seems to not suite perfectly with a large folio size, still.  The
MADV_DONTNEED worst case of having it loop over ~folio_size() times of none
pte is still possible.

-- 
Peter Xu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ