linux-kernel - Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZOTGvfO31pleXrPF@x1n>
Date:   Tue, 22 Aug 2023 10:31:25 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     Jann Horn <jannh@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        David Hildenbrand <david@...hat.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Qi Zheng <zhengqi.arch@...edance.com>,
        Yang Shi <shy828301@...il.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Alistair Popple <apopple@...dia.com>,
        Ralph Campbell <rcampbell@...dia.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Steven Price <steven.price@....com>,
        SeongJae Park <sj@...nel.org>,
        Lorenzo Stoakes <lstoakes@...il.com>,
        Huang Ying <ying.huang@...el.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Zack Rusin <zackr@...are.com>, Jason Gunthorpe <jgg@...pe.ca>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Minchan Kim <minchan@...nel.org>,
        Christoph Hellwig <hch@...radead.org>,
        Song Liu <song@...nel.org>,
        Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
        Russell King <linux@...linux.org.uk>,
        "David S. Miller" <davem@...emloft.net>,
        Michael Ellerman <mpe@...erman.id.au>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ux.ibm.com>,
        Claudio Imbrenda <imbrenda@...ux.ibm.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Vishal Moola <vishal.moola@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>, Zi Yan <ziy@...dia.com>,
        Zach O'Keefe <zokeefe@...gle.com>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        sparclinux@...r.kernel.org,
        linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
        linux-s390 <linux-s390@...r.kernel.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp()
 versus uffd

Hi, Hugh, Jann,

On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote:
> On Mon, 21 Aug 2023, Jann Horn wrote:
> > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@...gle.com> wrote:
> > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> > > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> > > thought it had emptied: page lock on the huge page is enough to protect
> > > against WP faults (which find the PTE has been cleared), but not enough
> > > to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> > >
> > > retract_page_tables() protects against this by checking !vma->anon_vma;
> > > but we know that MADV_COLLAPSE needs to be able to work on private shmem
> > > mappings, even those with an anon_vma prepared for another part of the
> > > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> > > mappings which are userfaultfd_armed().  Whether it needs to work on
> > > private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> > > but assume that it does.
> > 
> > I think we couldn't rely on anon_vma here anyway, since holding the
> > mmap_lock in read mode doesn't prevent concurrent creation of an
> > anon_vma?
> 
> We would have had to do the same as in retract_page_tables() (which
> doesn't even have mmap_lock for read): recheck !vma->anon_vma after
> finally acquiring ptlock.  But the !anon_vma limitation is certainly
> not acceptable here anyway.
> 
> > 
> > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > it gives any protection against this case itself, but because ptlock
> > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > In other cases, continue to minimize the pmd_lock() hold time.
> > 
> > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > I also can't find anything other than userfaultfd that would insert
> > pages into regions that are khugepaged-compatible, so I guess this
> > works?
> 
> I'm as sure as I can be that it's solely because userfaultfd breaks
> the usual rules here (and in fairness, IIRC Andrea did ask my permission
> before making it behave that way on shmem, COWing without a source page).
> 
> Perhaps something else will want that same behaviour in future (it's
> tempting, but difficult to guarantee correctness); for now, it is just
> userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> expecting uffd to add more such exceptional modes in future).
> 
> > 
> > I guess an alternative would be to use a spin_trylock() instead of the
> > current pmd_lock(), and if that fails, temporarily drop the page table
> > lock and then restart from step 2 with both locks held - and at that
> > point the page table scan should be fast since we expect it to usually
> > be empty.
> 
> That's certainly a good idea, if collapse on userfaultfd_armed private
> is anything of a common case (I doubt, but I don't know).  It may be a
> better idea anyway (saving a drop and retake of ptlock).
> 
> I gave it a try, expecting to end up with something that would lead
> me to say "I tried it, but it didn't work out well"; but actually it
> looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> and no more complicated (as Peter rightly observes) than the original.
> 
> It's up to you and Peter, and whoever has strong feelings about it,
> to choose between them: I don't mind (but I shall be sad if someone
> demands that I indent that comment deeper - I'm not a fan of long
> multi-line comments near column 80).

No strong opinion here, either.  Just one trivial comment/question below on
the new patch (if that will be preferred)..

> 
> 
> [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd
> 
> Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> thought it had emptied: page lock on the huge page is enough to protect
> against WP faults (which find the PTE has been cleared), but not enough
> to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> 
> retract_page_tables() protects against this by checking !vma->anon_vma;
> but we know that MADV_COLLAPSE needs to be able to work on private shmem
> mappings, even those with an anon_vma prepared for another part of the
> mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> mappings which are userfaultfd_armed().  Whether it needs to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.
> 
> Now trylock pmd lock without dropping ptlock (suggested by jannh): if
> that fails, drop and retake ptlock around taking pmd lock, and just in
> the uffd private case, go back to recheck and empty the page table.
> 
> Reported-by: Jann Horn <jannh@...gle.com>
> Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/
> Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
> Signed-off-by: Hugh Dickins <hughd@...gle.com>
> ---
>  mm/khugepaged.c | 39 +++++++++++++++++++++++++++++----------
>  1 file changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 40d43eccdee8..ad1c571772fe 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	struct page *hpage;
>  	pte_t *start_pte, *pte;
>  	pmd_t *pmd, pgt_pmd;
> -	spinlock_t *pml, *ptl;
> +	spinlock_t *pml = NULL, *ptl;
>  	int nr_ptes = 0, result = SCAN_FAIL;
>  	int i;
>  
> @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  				haddr, haddr + HPAGE_PMD_SIZE);
>  	mmu_notifier_invalidate_range_start(&range);
>  	notified = true;
> -	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> -	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> -		goto abort;
> +	spin_lock(ptl);

.. here will the ptl always be valid?

That comes from the previous round of pte_offset_map_lock(), and I assume
after this whole "thp collapse without write lock" work landed, it has the
same lifecycle with the *pte pointer, so can be invalid right after the rcu
read lock released; mmap read lock isn't strong enough to protect the ptl,
not anymore.

Maybe it's all fine because the thp collapse path is the solo path(s) that
will release the pte pgtable page without write mmap lock (so as to release
the ptl too when doing so), and we at least still hold the page lock, so
the worst case is the other concurrent "thp collapse" will still serialize
with this one on the huge page lock. But that doesn't look as solid as
fetching again the ptl from another pte_offset_map_nolock().  So still just
raise this question up.  It's possible I just missed something.

> +recheck:
> +	start_pte = pte_offset_map(pmd, haddr);
> +	VM_BUG_ON(!start_pte);	/* mmap_lock + page lock should prevent this */
>  
>  	/* step 2: clear page table and adjust rmap */
>  	for (i = 0, addr = haddr, pte = start_pte;
> @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		nr_ptes++;
>  	}
>  
> -	pte_unmap_unlock(start_pte, ptl);
> +	pte_unmap(start_pte);
>  
>  	/* step 3: set proper refcount and mm_counters. */
>  	if (nr_ptes) {
>  		page_ref_sub(hpage, nr_ptes);
>  		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> +		nr_ptes = 0;
>  	}
>  
> -	/* step 4: remove page table */
> +	/* step 4: remove empty page table */
> +	if (!pml) {
> +		pml = pmd_lockptr(mm, pmd);
> +		if (pml != ptl && !spin_trylock(pml)) {
> +			spin_unlock(ptl);
> +			spin_lock(pml);
> +			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +	/*
> +	 * pmd_lock covers a wider range than ptl, and (if split from mm's
> +	 * page_table_lock) ptl nests inside pml. The less time we hold pml,
> +	 * the better; but userfaultfd's mfill_atomic_pte() on a private VMA
> +	 * inserts a valid as-if-COWed PTE without even looking up page cache.
> +	 * So page lock of hpage does not protect from it, so we must not drop
> +	 * ptl before pgt_pmd is removed, so uffd private needs rechecking.
> +	 */
> +			if (userfaultfd_armed(vma) &&
> +			    !(vma->vm_flags & VM_SHARED))
> +				goto recheck;
> +		}
> +	}
>  
> -	/* Huge page lock is still held, so page table must remain empty */
> -	pml = pmd_lock(mm, pmd);
> -	if (ptl != pml)
> -		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>  	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>  	pmdp_get_lockless_sync();
>  	if (ptl != pml)
> @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	}
>  	if (start_pte)
>  		pte_unmap_unlock(start_pte, ptl);
> +	if (pml && pml != ptl)
> +		spin_unlock(pml);
>  	if (notified)
>  		mmu_notifier_invalidate_range_end(&range);
>  drop_hpage:
> -- 
> 2.35.3


-- 
Peter Xu