linux-kernel - Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y21xlgG6sxP6q5K9@monkey>
Date:   Thu, 10 Nov 2022 13:48:06 -0800
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Linux-MM <linux-mm@...ck.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
        David Hildenbrand <david@...hat.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Peter Xu <peterx@...hat.com>, Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Wei Chen <harperchen1110@...il.com>, stable@...r.kernel.org
Subject: Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb
 MADV_DONTNEED processing

On 11/10/22 12:59, Nadav Amit wrote:
> On Nov 7, 2022, at 5:19 PM, Mike Kravetz <mike.kravetz@...cle.com> wrote:
> 
> > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
> > tables associated with the address range.  For hugetlb vmas,
> > zap_page_range will call __unmap_hugepage_range_final.  However,
> > __unmap_hugepage_range_final assumes the passed vma is about to be removed
> > and deletes the vma_lock to prevent pmd sharing as the vma is on the way
> > out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
> > missing vma_lock prevents pmd sharing and could potentially lead to issues
> > with truncation/fault races.
> 
> I understand the problem in general. Please consider my feedback as partial
> though.

Thanks for taking a look!

> 
> > @@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > 			  unsigned long end, struct page *ref_page,
> > 			  zap_flags_t zap_flags)
> > {
> > +	bool final = zap_flags & ZAP_FLAG_UNMAP;
> > +
> 
> Not sure why caching final in local variable helps.

No particular reason.  It can be eliminated.

> 
> > 
> > void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > 			  unsigned long end, struct page *ref_page,
> > 			  zap_flags_t zap_flags)
> > {
> > +	struct mmu_notifier_range range;
> > 	struct mmu_gather tlb;
> > 
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > +				start, end);
> > +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> > 	tlb_gather_mmu(&tlb, vma->vm_mm);
> > +
> > 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
> 
> Is there a reason for not using range.start and range.end?

After calling adjust_range_if_pmd_sharing_possible, range.start - range.end
could be much greater than the range we actually want to unmap.  The range
gets adjusted to account for pmd sharing if that is POSSIBLE.  It does not
know for sure if we will actually 'unshare a pmd'.

I suppose adjust_range_if_pmd_sharing_possible could be modified to actually
check if unmapping will result in unsharing, but it does not do that today.

> 
> It is just that every inconsistency is worrying…
> 
> > 
> > @@ -1734,6 +1734,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> > 	lru_add_drain();
> > 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > 				address, address + size);
> > +	if (is_vm_hugetlb_page(vma))
> > +		adjust_range_if_pmd_sharing_possible(vma, &range.start,
> > +							&range.end);
> > 	tlb_gather_mmu(&tlb, vma->vm_mm);
> > 	update_hiwater_rss(vma->vm_mm);
> > 	mmu_notifier_invalidate_range_start(&range);
> > @@ -1742,6 +1745,12 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> > 	tlb_finish_mmu(&tlb);
> > }
> > 
> > +void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
> > +		unsigned long size)
> > +{
> > +	__zap_page_range_single(vma, address, size, NULL);
> 
> Ugh. So zap_vma_range() would actually be emitted as a wrapper function that
> only calls __zap_page_range_single() (or worse __zap_page_range_single()
> which is large would be inlined), unless you use LTO.
> 
> Another option is to declare __zap_page_range_size() in the header and move
> this one to the header as inline wrapper.

Ok, I will keep that in mind.

However, Peter seems to prefer just exposing the current zap_page_range_single.  
The issue with exposing zap_page_range_single as it is today, is that we also
need to expose 'struct zap_details' which currently is not visible outside
mm/memory.c.
-- 
Mike Kravetz