[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1eff312b-1aca-6afb-3587-f65e698b3f8c@redhat.com>
Date: Wed, 30 Nov 2022 10:46:24 +0100
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Cc: James Houghton <jthoughton@...gle.com>,
Jann Horn <jannh@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Rik van Riel <riel@...riel.com>,
Nadav Amit <nadav.amit@...il.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Muchun Song <songmuchun@...edance.com>,
Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for
pmd unshare
On 29.11.22 20:35, Peter Xu wrote:
> Based on latest mm-unstable (9ed079378408).
>
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
>
> Comparing to previous rfcv2, the major change is I dropped the new pgtable
> lock but only use vma lock for locking. The major reason is I overlooked
> that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
> pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of
> the huge_pte_offset() call sites do need to take pgtable lock. It means
> the valid users for the new RCU lock will be very limited.
Thanks.
>
> It's possible that in the future we can rework the pgtable free to only
> free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
> to be within tlb_remove_table_rcu()), then the RCU lock will make more
> sense. For now, make it simple by fixing the races first.
Good.
>
> Since this version attached a reproducer (below) and also removed the RCU
> (especially, the fallback irqoff) solution, removing RFC tag.
Very nice, thanks.
>
> Old versions:
>
> rfcv1: https://lore.kernel.org/r/20221030212929.335473-1-peterx@redhat.com
> rfcv2: https://lore.kernel.org/r/20221118011025.2178986-1-peterx@redhat.com
>
> Problem
> =======
>
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable. It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
>
> huge_pte_offset() is always called with mmap lock held with either read or
> write. It was assumed to be safe but it's actually not. One race
> condition can easily trigger by: (1) firstly trigger pmd share on a memory
> range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> another thread unshare the pmd range, and the pgtable page is prone to lost
> if the other shared process wants to free it completely (by either munmap
> or exit mm).
So just that I understand correctly:
Two processes, #A and #B, share a page table. Process #A runs two
threads, #A1 and #A2.
#A1 walks that shared page table (using huge_pte_offset()), for example,
to resolve a page fault. Concurrently, #A2 triggers unsharing of that
page table (replacing it by a private page table), for example, using
munmap().
So #A1 will eventually read/write the shared page table while we're
placing a private page table. Which would be fine (assuming no unsharing
would be required by #A1), however, if #B also concurrently drops the
reference to the shared page table (), the shared page table could
essentially get freed while #A1 is still walking it.
I suspect, looking at the reproducer, that the page table deconstructor
was called. Will the page table also actually get freed already? IOW,
could #A1 be reading/writing a freed page?
>
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed. It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.
Agreed.
>
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons especially
> on huge_pte_offset() usage.
>
> One more thing to mention is that besides the vma lock, i_mmap_rwsem can
> also be used to protect the pgtable page (along with its pgtable lock) from
> being freed from under us. IOW, huge_pte_offset() callers need to either
> hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
>
> A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
> very hard to trigger, one needs to apply another kernel delay patch too,
> see below):
Thanks,
David / dhildenb
Powered by blists - more mailing lists