linux-kernel - Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <70376d57-7924-8ac9-9e93-1831248115a0@redhat.com>
Date:   Wed, 23 Nov 2022 10:40:40 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Peter Xu <peterx@...hat.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Cc:     Rik van Riel <riel@...riel.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        James Houghton <jthoughton@...gle.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset()
 thread-safe for pmd unshare

On 18.11.22 02:10, Peter Xu wrote:
> Based on latest mm-unstable (96aa38b69507).
> 
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, so this series also depends on that one.
> Hopefully this series can make it a more complete resolution for pmd
> unsharing.
> 
> PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> think I'm already more confident than many of the RFCs I posted.
> 
> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> the changelog.  The whole idea is still the same, though.
> 
> Problem
> =======
> 
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable.  It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
> 
> huge_pte_offset() is always called with mmap lock held with either read or
> write.
> 
> For normal memory types that's far enough, since any pgtable removal
> requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> has the pmd unshare feature, it means not only the pgtable page can be gone
> from under us when we're doing a walking, but also the pgtable page we're
> walking (even after unshared, in this case it can only be the huge PUD page
> which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> possible because even though freeing the pgtable page requires mmap write
> lock, it doesn't help us when we're walking on another mm's pgtable, so
> it's still on risk even if we're with the current->mm's mmap lock.
> 
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed.  It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.
> 
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons.  E.g. it
> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> because even if we took the vma lock, we're walking on another mm's vma!
> Taking vma lock for all the vmas are probably not gonna work.
> 
> I have totally no report showing that I can trigger such a race, but from
> code wise I never see anything that stops the race from happening.  This
> series is trying to resolve that problem.

Let me try understand the basic problem first:

hugetlb walks page tables semi-lockless: while we hold the mmap lock, we 
don't grab the page table locks. That's very hugetlb specific handling 
and I assume hugetlb uses different mechanisms to sync against 
MADV_DONTNEED, concurrent page fault s... but that's no news. hugetlb is 
weird in many ways :)

So, IIUC, you want a mechanism to synchronize against PMD unsharing. 
Can't we use some very basic locking for that?

Using RCU / disabling local irqs seems a bit excessive because we *are* 
holding the mmap lock and only care about concurrent unsharing

-- 
Thanks,

David / dhildenb