[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70376d57-7924-8ac9-9e93-1831248115a0@redhat.com>
Date: Wed, 23 Nov 2022 10:40:40 +0100
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Cc: Rik van Riel <riel@...riel.com>,
Muchun Song <songmuchun@...edance.com>,
Andrew Morton <akpm@...ux-foundation.org>,
James Houghton <jthoughton@...gle.com>,
Nadav Amit <nadav.amit@...il.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare
On 18.11.22 02:10, Peter Xu wrote:
> Based on latest mm-unstable (96aa38b69507).
>
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, so this series also depends on that one.
> Hopefully this series can make it a more complete resolution for pmd
> unsharing.
>
> PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
> think I'm already more confident than many of the RFCs I posted.
>
> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> the changelog. The whole idea is still the same, though.
>
> Problem
> =======
>
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable. It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
>
> huge_pte_offset() is always called with mmap lock held with either read or
> write.
>
> For normal memory types that's far enough, since any pgtable removal
> requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
> has the pmd unshare feature, it means not only the pgtable page can be gone
> from under us when we're doing a walking, but also the pgtable page we're
> walking (even after unshared, in this case it can only be the huge PUD page
> which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
> possible because even though freeing the pgtable page requires mmap write
> lock, it doesn't help us when we're walking on another mm's pgtable, so
> it's still on risk even if we're with the current->mm's mmap lock.
>
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed. It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.
>
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons. E.g. it
> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> because even if we took the vma lock, we're walking on another mm's vma!
> Taking vma lock for all the vmas are probably not gonna work.
>
> I have totally no report showing that I can trigger such a race, but from
> code wise I never see anything that stops the race from happening. This
> series is trying to resolve that problem.
Let me try understand the basic problem first:
hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
don't grab the page table locks. That's very hugetlb specific handling
and I assume hugetlb uses different mechanisms to sync against
MADV_DONTNEED, concurrent page fault s... but that's no news. hugetlb is
weird in many ways :)
So, IIUC, you want a mechanism to synchronize against PMD unsharing.
Can't we use some very basic locking for that?
Using RCU / disabling local irqs seems a bit excessive because we *are*
holding the mmap lock and only care about concurrent unsharing
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists