linux-kernel - Re: [PATCH 00/10] mm/hugetlb: Make huge_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y4eDixDUnreiyIXq@x1n>
Date:   Wed, 30 Nov 2022 11:23:39 -0500
From:   Peter Xu <peterx@...hat.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        James Houghton <jthoughton@...gle.com>,
        Jann Horn <jannh@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Muchun Song <songmuchun@...edance.com>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for
 pmd unshare

On Wed, Nov 30, 2022 at 10:46:24AM +0100, David Hildenbrand wrote:
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write.  It was assumed to be safe but it's actually not.  One race
> > condition can easily trigger by: (1) firstly trigger pmd share on a memory
> > range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> > another thread unshare the pmd range, and the pgtable page is prone to lost
> > if the other shared process wants to free it completely (by either munmap
> > or exit mm).
> 
> So just that I understand correctly:
> 
> Two processes, #A and #B, share a page table. Process #A runs two threads,
> #A1 and #A2.
> 
> #A1 walks that shared page table (using huge_pte_offset()), for example, to
> resolve a page fault. Concurrently, #A2 triggers unsharing of that page
> table (replacing it by a private page table),

Not yet replacing it, just unsharing.

If the replacement happened we shouldn't trigger a bug either because
huge_pte_offset() will return the private pgtable page instead.

> for example, using munmap().

munmap() may not work because it needs mmap lock, so it'll wait until #A1
completes huge_pte_offset() walks and release mmap lock read.

Many of other things can trigger unshare, though.  In the reproducer I used
MADV_DONTNEED.

> 
> So #A1 will eventually read/write the shared page table while we're placing
> a private page table. Which would be fine (assuming no unsharing would be
> required by #A1), however, if #B also concurrently drops the reference to
> the shared page table (), the shared page table could essentially get freed
> while #A1 is still walking it.
> 
> I suspect, looking at the reproducer, that the page table deconstructor was
> called. Will the page table also actually get freed already? IOW, could #A1
> be reading/writing a freed page?

If with the existing code base, I think it could.

If with RCU lock, it couldn't, but still since the pgtable lock is freed
even if the page is not, we'll still hit weird issues when accessing the
lock.

And with vma lock it should be all safe.

-- 
Peter Xu