linux-kernel - Re: [PATCH v2 2/4] arm64: hugetlb: Fix huge_ptep_get_and

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5477d161-12e7-4475-a6e9-ff3921989673@arm.com>
Date: Wed, 19 Feb 2025 08:58:44 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Anshuman Khandual <anshuman.khandual@....com>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>,
 Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
 "James E.J. Bottomley" <James.Bottomley@...senPartnership.com>,
 Helge Deller <deller@....de>, Madhavan Srinivasan <maddy@...ux.ibm.com>,
 Michael Ellerman <mpe@...erman.id.au>, Nicholas Piggin <npiggin@...il.com>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 Naveen N Rao <naveen@...nel.org>, Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>,
 Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
 Alexander Gordeev <agordeev@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>,
 Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
 "David S. Miller" <davem@...emloft.net>,
 Andreas Larsson <andreas@...sler.com>, Arnd Bergmann <arnd@...db.de>,
 Muchun Song <muchun.song@...ux.dev>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Uladzislau Rezki <urezki@...il.com>, Christoph Hellwig <hch@...radead.org>,
 David Hildenbrand <david@...hat.com>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>,
 Mark Rutland <mark.rutland@....com>, Dev Jain <dev.jain@....com>,
 Kevin Brodsky <kevin.brodsky@....com>,
 Alexandre Ghiti <alexghiti@...osinc.com>
Cc: linux-arm-kernel@...ts.infradead.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH v2 2/4] arm64: hugetlb: Fix huge_ptep_get_and_clear() for
 non-present ptes

On 19/02/2025 08:45, Anshuman Khandual wrote:
> 
> 
> On 2/17/25 19:34, Ryan Roberts wrote:
>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
>> CONT_PMD_SIZE). So the function has to figure out the size from the
>> huge_pte pointer. This was previously done by walking the pgtable to
>> determine the level and by using the PTE_CONT bit to determine the
>> number of ptes at the level.
>>
>> But the PTE_CONT bit is only valid when the pte is present. For
>> non-present pte values (e.g. markers, migration entries), the previous
>> implementation was therefore erroniously determining the size. There is
>> at least one known caller in core-mm, move_huge_pte(), which may call
>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
>> this case. Additionally the "regular" ptep_get_and_clear() is robust to
>> being called for non-present ptes so it makes sense to follow the
>> behaviour.
>>
>> Fix this by using the new sz parameter which is now provided to the
>> function. Additionally when clearing each pte in a contig range, don't
>> gather the access and dirty bits if the pte is not present.
>>
>> An alternative approach that would not require API changes would be to
>> store the PTE_CONT bit in a spare bit in the swap entry pte for the
>> non-present case. But it felt cleaner to follow other APIs' lead and
>> just pass in the size.
>>
>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
>> entry offset field (layout of non-present pte). Since hugetlb is never
>> swapped to disk, this field will only be populated for markers, which
>> always set this bit to 0 and hwpoison swap entries, which set the offset
>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
>> memory in that high half was poisoned (I think!). So in practice, this
>> bit would almost always be zero for non-present ptes and we would only
>> clear the first entry if it was actually a contiguous block. That's
>> probably a less severe symptom than if it was always interpretted as 1
>> and cleared out potentially-present neighboring PTEs.
>>
>> Cc: stable@...r.kernel.org
>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++---------------------
>>  1 file changed, 17 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 06db4649af91..614b2feddba2 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>>  			     unsigned long pgsize,
>>  			     unsigned long ncontig)
>>  {
>> -	pte_t orig_pte = __ptep_get(ptep);
>> -	unsigned long i;
>> -
>> -	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
>> -		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
>> -
>> -		/*
>> -		 * If HW_AFDBM is enabled, then the HW could turn on
>> -		 * the dirty or accessed bit for any page in the set,
>> -		 * so check them all.
>> -		 */
>> -		if (pte_dirty(pte))
>> -			orig_pte = pte_mkdirty(orig_pte);
>> -
>> -		if (pte_young(pte))
>> -			orig_pte = pte_mkyoung(orig_pte);
>> +	pte_t pte, tmp_pte;
>> +	bool present;
>> +
>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	present = pte_present(pte);
> 
> pte_present() may not be evaluated for standard huge pages at [PMD|PUD]_SIZE
> e.g when ncontig = 1 in the argument.

Sorry I'm not quite sure what you're suggesting here? Are you proposing that
pte_present() should be moved into the loop so that we only actually call it
when we are going to consume it? I'm happy to do that if it's the preference,
but I thought it was neater to hoist it out of the loop.

> 
>> +	while (--ncontig) {
> 
> Should this be converted into a for loop instead just to be in sync with other
> similar iterators in this file.
> 
> for (i = 1; i < ncontig; i++, addr += pgsize, ptep++)
> {
> 	tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> 	if (present) {
> 		if (pte_dirty(tmp_pte))
> 			pte = pte_mkdirty(pte);
> 		if (pte_young(tmp_pte))
> 			pte = pte_mkyoung(pte);
> 	}
> }

I think the way you have written this it's incorrect. Let's say we have 16 ptes
in the block. We want to iterate over the last 15 of them (we have already read
pte 0). But you're iterating over the first 15 because you don't increment addr
and ptep until after you've been around the loop the first time. So we would
need to explicitly increment those 2 before entering the loop. But that is only
neccessary if ncontig > 1. Personally I think my approach is neater...

> 
>> +		ptep++;
>> +		addr += pgsize;
>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +		if (present) {
>> +			if (pte_dirty(tmp_pte))
>> +				pte = pte_mkdirty(pte);
>> +			if (pte_young(tmp_pte))
>> +				pte = pte_mkyoung(pte);
>> +		}
>>  	}
>> -	return orig_pte;
>> +	return pte;
>>  }
>>  
>>  static pte_t get_clear_contig_flush(struct mm_struct *mm,
>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>>  {
>>  	int ncontig;
>>  	size_t pgsize;
>> -	pte_t orig_pte = __ptep_get(ptep);
>> -
>> -	if (!pte_cont(orig_pte))
>> -		return __ptep_get_and_clear(mm, addr, ptep);
>> -
>> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>  
>> +	ncontig = num_contig_ptes(sz, &pgsize);
>>  	return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>>  }
>>