[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y2UzSOqJr9glD39i@monkey>
Date: Fri, 4 Nov 2022 08:44:08 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: Peter Xu <peterx@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>,
James Houghton <jthoughton@...gle.com>,
Miaohe Lin <linmiaohe@...wei.com>,
David Hildenbrand <david@...hat.com>,
Muchun Song <songmuchun@...edance.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Nadav Amit <nadav.amit@...il.com>,
Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe
for pmd unshare
On 11/04/22 11:02, Peter Xu wrote:
> Hi, Mike,
>
> On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> > On 10/30/22 17:29, Peter Xu wrote:
> > > Resolution
> > > ==========
> > >
> > > What this patch proposed is, besides using the vma lock, we can also use
> > > RCU to protect the pgtable page from being freed from under us when
> > > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup.
> > > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > > including another mm's.
> > >
> > > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > > always safe to access and de-reference, along with the pgtable lock that
> > > was bound to the pgtable page.
> > >
> > > Patch Layout
> > > ============
> > >
> > > Patch 1 is a trivial cleanup that I noticed when working on this. Please
> > > shoot if anyone think I should just post it separately, or hopefully I can
> > > still just carry it over.
> > >
> > > Patch 2 is the gut of the patchset, describing how we should use the helper
> > > huge_pte_offset() correctly. Only a comment patch but should be the most
> > > important one, as the follow up patches are just trying to follow the rule
> > > it setup here.
> > >
> > > The rest patches resolve all the call sites of huge_pte_offset() to make
> > > sure either it's with the vma lock (which is perfectly good enough for
> > > safety in this case; the last patch commented on all those callers to make
> > > sure we won't miss a single case, and why they're safe). Besides, each of
> > > the patch will add rcu protection to one caller of huge_pte_offset().
> > >
> > > Tests
> > > =====
> > >
> > > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > > here, with error "Unexpected number of free huge pages line 207"; haven't
> > > really got time to look into it).
> >
> > Do not worry about the madvise test failure, that is caused by a recent
> > change.
> >
> > Unless I am missing something, the basic strategy in this series is to
> > wrap calls to huge_pte_offset and subsequent ptep access with
> > rcu_read_lock/unlock calls. I must embarrassingly admit that it has
> > been a loooong time since I had to look at rcu usage and may not know
> > what I am talking about. However, I seem to recall that one needs to
> > somehow flag the data items being protected from update/freeing. I
> > do not see anything like that in the huge_pmd_unshare routine where
> > pmd page pointer is updated. Or, is it where the pmd page pointer is
> > referenced in huge_pte_offset?
>
> Right. The RCU proposed here is trying to protect the pmd pgtable page
> that will normally be freed in rcu pattern. Please refer to
> tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
> it's released with RCU API:
>
> call_rcu(&batch->rcu, tlb_remove_table_rcu);
>
Thanks! That is the piece of the puzzle I was missing.
> I mentioned fast-gup just to refererence on the same usage as fast-gup has
> the same risk if without RCU or similar protections that is IPI-based, but
> I definitely can be even clearer, and I will enrich the cover letter in the
> next post.
>
> In short, my understanding is pgtable pages (including the shared PUD page
> for hugetlb) needs to be freed with caution because there can be softwares
> that are walking the pages with no locks. In our case, even though
> huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
> always having the same mmap lock as when the pgtable needs to be freed, so
> it's similar to having no lock here, imo. Then huge_pte_offset() needs to
> be protected just like what we do with fast-gup.
>
> Please also feel free to refer to the comment chunk at the start of
> asm-generic/tlb.h for more information on the mmu gather API.
>
> >
> > Please ignore if you are certain of this rcu usage, otherwise I will
> > spend some time reeducating myself.
Sorry for any misunderstanding. I am very happy with the RFC and the
work you have done. I was just missing the piece about rcu
synchronization when the page table was removed.
> I'm not certain, and I'd like to get any form of comment. :)
>
> Sorry if this RFC version is confusing, but if it can try to at least
> explain what the problem we have and if we can agree on the problem first
> then that'll already be a step forward to me. So far that's more important
> than how we resolve it, using RCU or vma lock or anything else.
>
> For a non-rfc series, I think I need to be more careful on some details,
> e.g., the RCU protection for pgtable page is only used when the arch
> supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at
> least for pmd sharing enabled archs, but I'm actually wrong:
>
> arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
> arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE
>
> arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
> arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP
> arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
>
> I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
> need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
> with local_irq_disable() on RISCV only for what this patchset wanted to
> do). In the next version, I plan to add a helper, let's name it
> huge_pte_walker_lock() for now, and it should be one of the three options:
>
> - if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op
> - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
> - else: it should be local_irq_disable()
>
> With that, I think we'll strictly follow what we have with fast-gup, at the
> meantime it should add zero overhead on archs that does not have pmd sharing.
>
> Hope above helps a bit on extending the missing pieces of the cover
> letter. Or again if anything missing I'd be more than glad to know..
--
Mike Kravetz
Powered by blists - more mailing lists