linux-kernel - Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Yw0F0/Fnpf6l9fgO@monkey>
Date:   Mon, 29 Aug 2022 11:30:43 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        akpm@...ux-foundation.org, songmuchun@...edance.com,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE
 size hugetlb page

On 08/27/22 19:29, Aneesh Kumar K.V wrote:
> Mike Kravetz <mike.kravetz@...cle.com> writes:
> 
> >
> > On 08/25/22 09:25, David Hildenbrand wrote:
> >> > Is the primary concern the locking?  If so, I am not sure we have an issue.
> >> > As mentioned in your commit message, current code will use
> >> > pte_offset_map_lock().  pte_offset_map_lock uses pte_lockptr, and pte_lockptr
> >> > will either be the mm wide lock or pmd_page lock.  To me, it seems that
> >> > either would provide correct synchronization for CONT-PTE entries.  Am I
> >> > missing something or misreading the code?
> >> > 
> >> > I started looking at code cleanup suggested by David.  Here is a quick
> >> > patch (not tested and likely containing errors) to see if this is a step
> >> > in the right direction.
> >> > 
> >> > I like it because we get rid of/combine all those follow_huge_p*d
> >> > routines.
> >> > 
> >> 
> >> Yes, see comments below.
> >> 
> >> > From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001
> >> > From: Mike Kravetz <mike.kravetz@...cle.com>
> >> > Date: Wed, 24 Aug 2022 15:59:15 -0700
> >> > Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in
> >> >  follow_page_mask
> >> > 
> >> > At the beginning of follow_page_mask, there currently is a call to
> >> > follow_huge_addr which 'may' handle hugetlb pages.  ia64 is the only
> >> > architecture which (incorrectly) provides a follow_huge_addr routine
> >> > that does not return error.  Instead, at each level of the page table a
> >> > check is made for a hugetlb entry.  If a hugetlb entry is found, a call
> >> > to a routine associated with that page table level such as
> >> > follow_huge_pmd is made.
> >> > 
> >> > All the follow_huge_p*d routines are basically the same.  In addition
> >> > huge page size can be derived from the vma, so we know where in the page
> >> > table a huge page would reside.  So, replace follow_huge_addr with a
> >> > new architecture independent routine which will provide the same
> >> > functionality as the follow_huge_p*d routines.  We can then eliminate
> >> > the p*d_huge checks in follow_page_mask page table walking as well as
> >> > the follow_huge_p*d routines themselves.>
> >> > follow_page_mask still has is_hugepd hugetlb checks during page table
> >> > walking.  This is due to these checks and follow_huge_pd being
> >> > architecture specific.  These can be eliminated if
> >> > hugetlb_follow_page_mask can be overwritten by architectures (powerpc)
> >> > that need to do follow_huge_pd processing.
> >> 
> >> But won't the
> >> 
> >> > +	/* hugetlb is special */
> >> > +	if (is_vm_hugetlb_page(vma))
> >> > +		return hugetlb_follow_page_mask(vma, address, flags);
> >> 
> >> code route everything via hugetlb_follow_page_mask() and all these
> >> (beloved) hugepd checks would essentially be unreachable?
> >> 
> >> At least my understanding is that hugepd only applies to hugetlb.
> >> 
> >> Can't we move the hugepd handling code into hugetlb_follow_page_mask()
> >> as well?
> >> 
> >> I mean, doesn't follow_hugetlb_page() also have to handle that hugepd
> >> stuff already ... ?
> >> 
> >> [...]
> >
> > I think so, but I got a little confused looking at the hugepd handling code.
> > Adding Aneesh who added support to follow_page_mask in the series at:
> > https://lore.kernel.org/linux-mm/1494926612-23928-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com/
> >
> > I believe you are correct in that follow_hugetlb_page must handle as well.
> >
> > One source of my confusion is the following in follow_huge_pd:
> >
> > 	/*
> > 	 * hugepage directory entries are protected by mm->page_table_lock
> > 	 * Use this instead of huge_pte_lockptr
> > 	 */
> > 	ptl = &mm->page_table_lock;
> > 	spin_lock(ptl);
> >
> > Yet, if follow_hugetlb_page handles hugepd then it is using huge_pte_lockptr
> > to get the lock pointer and is wrong?
> >
> > Hoping Aneesh can help clear up the confusion.
> 
> 
> I agree it is all confusing. At some point, the goal was to teach
> generic kernel page table walking code about hugepd entries. But looking
> at this again and considering we only have hugepd entries for hugetlb,
> may be the effort is not worth the complexity it adds.
> 
> ie,  instead of teaching generic page table walk about different hugetlb
> page table layouts we special case using is_vm_hugetlb_page(vma)
> wherever we can.

Thanks for your comments Aneesh.

I give David credit for suggesting that is would be simpler to just special
case hugetlb mappings here.  Based on your comments, I believe an arch
independent 'hugetlb_follow_page_mask' routine would handle all cases and
we can remove the 'if (p*d_huge)' blocks and 'if (is_hugepd)' blocks of
code from follow_page_mask.

Such a routine would call huge_pte_offset() which can be/is overwritten by
arch specific code.  In fact, the powerpc version of this already handles
hugepd entries.

> With respect to huge_pte_lockptr, it is tricky (hugepd entries are not
> PMD_SIZE) 
> 
> static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> 					   struct mm_struct *mm, pte_t *pte)
> {
> 	if (huge_page_size(h) == PMD_SIZE)
> 		return pmd_lockptr(mm, (pmd_t *) pte);
> 	VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> 	return &mm->page_table_lock;
> }

Ok, so I think you confirmed that huge_pte_lockptr would work for hugepd
entries as they are never PMD_SIZE.

I will be sure to cc you on the proposed changes.  Thanks,
-- 
Mike Kravetz