linux-kernel - Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y2UplvB0dgQYIZWm@x1n>
Date:   Fri, 4 Nov 2022 11:02:46 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        James Houghton <jthoughton@...gle.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        David Hildenbrand <david@...hat.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe
 for pmd unshare

Hi, Mike,

On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> On 10/30/22 17:29, Peter Xu wrote:
> > Resolution
> > ==========
> > 
> > What this patch proposed is, besides using the vma lock, we can also use
> > RCU to protect the pgtable page from being freed from under us when
> > huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
> > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > including another mm's.
> > 
> > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > always safe to access and de-reference, along with the pgtable lock that
> > was bound to the pgtable page.
> > 
> > Patch Layout
> > ============
> > 
> > Patch 1 is a trivial cleanup that I noticed when working on this.  Please
> > shoot if anyone think I should just post it separately, or hopefully I can
> > still just carry it over.
> > 
> > Patch 2 is the gut of the patchset, describing how we should use the helper
> > huge_pte_offset() correctly. Only a comment patch but should be the most
> > important one, as the follow up patches are just trying to follow the rule
> > it setup here.
> > 
> > The rest patches resolve all the call sites of huge_pte_offset() to make
> > sure either it's with the vma lock (which is perfectly good enough for
> > safety in this case; the last patch commented on all those callers to make
> > sure we won't miss a single case, and why they're safe).  Besides, each of
> > the patch will add rcu protection to one caller of huge_pte_offset().
> > 
> > Tests
> > =====
> > 
> > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > here, with error "Unexpected number of free huge pages line 207"; haven't
> > really got time to look into it).
> 
> Do not worry about the madvise test failure, that is caused by a recent
> change.
> 
> Unless I am missing something, the basic strategy in this series is to
> wrap calls to huge_pte_offset and subsequent ptep access with
> rcu_read_lock/unlock calls.  I must embarrassingly admit that it has
> been a loooong time since I had to look at rcu usage and may not know
> what I am talking about.  However, I seem to recall that one needs to
> somehow flag the data items being protected from update/freeing.  I
> do not see anything like that in the huge_pmd_unshare routine where
> pmd page pointer is updated.  Or, is it where the pmd page pointer is
> referenced in huge_pte_offset?

Right.  The RCU proposed here is trying to protect the pmd pgtable page
that will normally be freed in rcu pattern.  Please refer to
tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
it's released with RCU API:

	call_rcu(&batch->rcu, tlb_remove_table_rcu);

I mentioned fast-gup just to refererence on the same usage as fast-gup has
the same risk if without RCU or similar protections that is IPI-based, but
I definitely can be even clearer, and I will enrich the cover letter in the
next post.

In short, my understanding is pgtable pages (including the shared PUD page
for hugetlb) needs to be freed with caution because there can be softwares
that are walking the pages with no locks.  In our case, even though
huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
always having the same mmap lock as when the pgtable needs to be freed, so
it's similar to having no lock here, imo.  Then huge_pte_offset() needs to
be protected just like what we do with fast-gup.

Please also feel free to refer to the comment chunk at the start of
asm-generic/tlb.h for more information on the mmu gather API.

> 
> Please ignore if you are certain of this rcu usage, otherwise I will
> spend some time reeducating myself.

I'm not certain, and I'd like to get any form of comment. :)

Sorry if this RFC version is confusing, but if it can try to at least
explain what the problem we have and if we can agree on the problem first
then that'll already be a step forward to me.  So far that's more important
than how we resolve it, using RCU or vma lock or anything else.

For a non-rfc series, I think I need to be more careful on some details,
e.g., the RCU protection for pgtable page is only used when the arch
supports MMU_GATHER_RCU_TABLE_FREE.  I thought that's always supported at
least for pmd sharing enabled archs, but I'm actually wrong:

arch/arm64/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
arch/riscv/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
arch/x86/Kconfig:       select ARCH_WANT_HUGE_PMD_SHARE

arch/arm/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
arch/arm64/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE
arch/powerpc/Kconfig:   select MMU_GATHER_RCU_TABLE_FREE
arch/s390/Kconfig:      select MMU_GATHER_RCU_TABLE_FREE
arch/sparc/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE if SMP
arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
arch/x86/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE        if PARAVIRT

I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
with local_irq_disable() on RISCV only for what this patchset wanted to
do).  In the next version, I plan to add a helper, let's name it
huge_pte_walker_lock() for now, and it should be one of the three options:

  - if !ARCH_WANT_HUGE_PMD_SHARE:      it's no-op
  - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
  - else:                              it should be local_irq_disable()

With that, I think we'll strictly follow what we have with fast-gup, at the
meantime it should add zero overhead on archs that does not have pmd sharing.

Hope above helps a bit on extending the missing pieces of the cover
letter.  Or again if anything missing I'd be more than glad to know..

Thanks,

-- 
Peter Xu