linux-kernel - Re: [PATCH 00/10] mm/hugetlb: Make huge_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1eff312b-1aca-6afb-3587-f65e698b3f8c@redhat.com>
Date:   Wed, 30 Nov 2022 10:46:24 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Peter Xu <peterx@...hat.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Cc:     James Houghton <jthoughton@...gle.com>,
        Jann Horn <jannh@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Muchun Song <songmuchun@...edance.com>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for
 pmd unshare

On 29.11.22 20:35, Peter Xu wrote:
> Based on latest mm-unstable (9ed079378408).
> 
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
> 
> Comparing to previous rfcv2, the major change is I dropped the new pgtable
> lock but only use vma lock for locking.  The major reason is I overlooked
> that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
> pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs.  OTOH many of
> the huge_pte_offset() call sites do need to take pgtable lock.  It means
> the valid users for the new RCU lock will be very limited.

Thanks.

> 
> It's possible that in the future we can rework the pgtable free to only
> free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
> to be within tlb_remove_table_rcu()), then the RCU lock will make more
> sense.  For now, make it simple by fixing the races first.

Good.

> 
> Since this version attached a reproducer (below) and also removed the RCU
> (especially, the fallback irqoff) solution, removing RFC tag.

Very nice, thanks.

> 
> Old versions:
> 
> rfcv1:  https://lore.kernel.org/r/20221030212929.335473-1-peterx@redhat.com
> rfcv2:  https://lore.kernel.org/r/20221118011025.2178986-1-peterx@redhat.com
> 
> Problem
> =======
> 
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable.  It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
> 
> huge_pte_offset() is always called with mmap lock held with either read or
> write.  It was assumed to be safe but it's actually not.  One race
> condition can easily trigger by: (1) firstly trigger pmd share on a memory
> range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> another thread unshare the pmd range, and the pgtable page is prone to lost
> if the other shared process wants to free it completely (by either munmap
> or exit mm).

So just that I understand correctly:

Two processes, #A and #B, share a page table. Process #A runs two 
threads, #A1 and #A2.

#A1 walks that shared page table (using huge_pte_offset()), for example, 
to resolve a page fault. Concurrently, #A2 triggers unsharing of that 
page table (replacing it by a private page table), for example, using 
munmap().

So #A1 will eventually read/write the shared page table while we're 
placing a private page table. Which would be fine (assuming no unsharing 
would be required by #A1), however, if #B also concurrently drops the 
reference to the shared page table (), the shared page table could 
essentially get freed while #A1 is still walking it.

I suspect, looking at the reproducer, that the page table deconstructor 
was called. Will the page table also actually get freed already? IOW, 
could #A1 be reading/writing a freed page?


> 
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed.  It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.

Agreed.

> 
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons especially
> on huge_pte_offset() usage.
> 
> One more thing to mention is that besides the vma lock, i_mmap_rwsem can
> also be used to protect the pgtable page (along with its pgtable lock) from
> being freed from under us.  IOW, huge_pte_offset() callers need to either
> hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
> 
> A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
> very hard to trigger, one needs to apply another kernel delay patch too,
> see below):

Thanks,

David / dhildenb