linux-kernel - Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de85dfb2-641e-4309-89b9-72699fc02698@csgroup.eu>
Date: Wed, 22 May 2024 09:32:17 +0000
From: Christophe Leroy <christophe.leroy@...roup.eu>
To: Nicholas Piggin <npiggin@...il.com>, Andrew Morton
	<akpm@...ux-foundation.org>, Jason Gunthorpe <jgg@...dia.com>, Peter Xu
	<peterx@...hat.com>, Oscar Salvador <osalvador@...e.de>, Michael Ellerman
	<mpe@...erman.id.au>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "linuxppc-dev@...ts.ozlabs.org"
	<linuxppc-dev@...ts.ozlabs.org>
Subject: Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead
 of HUGEPD



Le 22/05/2024 à 03:13, Nicholas Piggin a écrit :
> On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote:
>>
>>
>> Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
>>> On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
>>>> On book3s/64, the only user of hugepd is hash in 4k mode.
>>>>
>>>> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>>>>
>>>> Rework hash-4k to use contiguous PMD and PUD instead.
>>>>
>>>> In that setup there are only two huge page sizes: 16M and 16G.
>>>>
>>>> 16M sits at PMD level and 16G at PUD level.
>>>>
>>>> pte_update doesn't know page size, lets use the same trick as
>>>> hpte_need_flush() to get page size from segment properties. That's
>>>> not the most efficient way but let's do that until callers of
>>>> pte_update() provide page size instead of just a huge flag.
>>>>
>>>> Signed-off-by: Christophe Leroy <christophe.leroy@...roup.eu>
>>>> ---
>>>>    arch/powerpc/include/asm/book3s/64/hash-4k.h  | 15 --------
>>>>    arch/powerpc/include/asm/book3s/64/hash.h     | 38 +++++++++++++++----
>>>>    arch/powerpc/include/asm/book3s/64/hugetlb.h  | 38 -------------------
>>>>    .../include/asm/book3s/64/pgtable-4k.h        | 34 -----------------
>>>>    .../include/asm/book3s/64/pgtable-64k.h       | 20 ----------
>>>>    arch/powerpc/include/asm/hugetlb.h            |  4 ++
>>>>    .../include/asm/nohash/32/hugetlb-8xx.h       |  4 --
>>>>    .../powerpc/include/asm/nohash/hugetlb-e500.h |  4 --
>>>>    arch/powerpc/include/asm/page.h               |  8 ----
>>>>    arch/powerpc/mm/book3s64/hash_utils.c         | 11 ++++--
>>>>    arch/powerpc/mm/book3s64/pgtable.c            | 12 ------
>>>>    arch/powerpc/mm/hugetlbpage.c                 | 19 ----------
>>>>    arch/powerpc/mm/pgtable.c                     |  2 +-
>>>>    arch/powerpc/platforms/Kconfig.cputype        |  1 -
>>>>    14 files changed, 43 insertions(+), 167 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> index 6472b08fa1b0..c654c376ef8b 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> @@ -74,21 +74,6 @@
>>>>    #define remap_4k_pfn(vma, addr, pfn, prot)	\
>>>>    	remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>>>>    
>>>> -#ifdef CONFIG_HUGETLB_PAGE
>>>> -static inline int hash__hugepd_ok(hugepd_t hpd)
>>>> -{
>>>> -	unsigned long hpdval = hpd_val(hpd);
>>>> -	/*
>>>> -	 * if it is not a pte and have hugepd shift mask
>>>> -	 * set, then it is a hugepd directory pointer
>>>> -	 */
>>>> -	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
>>>> -	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
>>>> -		return true;
>>>> -	return false;
>>>> -}
>>>> -#endif
>>>> -
>>>>    /*
>>>>     * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
>>>>     * a matter of returning the PTE bits that need to be modified. On 64K PTE,
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>>>> index faf3e3b4e4b2..509811ca7695 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>>>> @@ -4,6 +4,7 @@
>>>>    #ifdef __KERNEL__
>>>>    
>>>>    #include <asm/asm-const.h>
>>>> +#include <asm/book3s/64/slice.h>
>>>>    
>>>>    /*
>>>>     * Common bits between 4K and 64K pages in a linux-style PTE.
>>>> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>>>>    			    pte_t *ptep, unsigned long pte, int huge);
>>>>    unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
>>>>    /* Atomic PTE updates */
>>>> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> -					 unsigned long addr,
>>>> -					 pte_t *ptep, unsigned long clr,
>>>> -					 unsigned long set,
>>>> -					 int huge)
>>>> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
>>>> +						 unsigned long set)
>>>>    {
>>>>    	__be64 old_be, tmp_be;
>>>> -	unsigned long old;
>>>>    
>>>>    	__asm__ __volatile__(
>>>>    	"1:	ldarx	%0,0,%3		# pte_update\n\
>>>> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>>    	: "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
>>>>    	  "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
>>>>    	: "cc" );
>>>> +
>>>> +	return be64_to_cpu(old_be);
>>>> +}
>>>> +
>>>> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> +					 unsigned long addr,
>>>> +					 pte_t *ptep, unsigned long clr,
>>>> +					 unsigned long set,
>>>> +					 int huge)
>>>> +{
>>>> +	unsigned long old;
>>>> +
>>>> +	old = hash__pte_update_one(ptep, clr, set);
>>>> +
>>>> +	if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
>>>> +		unsigned int psize = get_slice_psize(mm, addr);
>>>> +		int nb, i;
>>>> +
>>>> +		if (psize == MMU_PAGE_16M)
>>>> +			nb = SZ_16M / PMD_SIZE;
>>>> +		else if (psize == MMU_PAGE_16G)
>>>> +			nb = SZ_16G / PUD_SIZE;
>>>> +		else
>>>> +			nb = 1;
>>>> +
>>>> +		for (i = 1; i < nb; i++)
>>>> +			hash__pte_update_one(ptep + i, clr, set);
>>>> +	}
>>>>    	/* huge pages use the old page table lock */
>>>>    	if (!huge)
>>>>    		assert_pte_locked(mm, addr);
>>>>    
>>>> -	old = be64_to_cpu(old_be);
>>>>    	if (old & H_PAGE_HASHPTE)
>>>>    		hpte_need_flush(mm, addr, ptep, old, huge);
>>>>    
>>>
>>> Nice series, I don't know this hugepd code very well but I'll try.
>>> Why do you have to replicate the PTE entry here? The hash table refill
>>> should always be working on the first PTE of the page otherwise we have
>>> bigger problems.
>>
>> I don't know how book3s/64 works exactly, but on nohash, when you get a
>> TLB miss exception the only thing you have is the address and you don't
>> know yes it is a hugepage so you get the PTE as if it was a 4k page and
>> it is only when you read that PTE that you know it is a hugepage.
>>
>> Ok, on book3s/64 the page size seems to be encoded inside the segment so
>> maybe it is a bit different but anyway the TLB miss exception (or DSI ?)
>> can happen at any address.
> 
> Right.
> 
> If you think of the hash page table as a software loaded TLB (which
> is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x
> calls find the Linux pte and load that translation into hash page table.
> 
> One of the hard parts is keeping them coherent with low overhead. This
> requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means
> it might be in the hash table. So Linux PTE and hash PTE have to be
> 1:1 in general.
> 
> There are probably cases where we could get away from 1:1, but I would
> much prefer not to. Maybe read-only access would be okay though. But
> the hash_page will have to always operate on the 0th pte, which I think
> we get via segment size masking, same for any set / update / clear of
> the pte.
> 
>>>
>>> What paths look at the N > 0 PTEs of a contiguous page entry?
>>>
>>
>> pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous
>> PTE based on the address handed to pte_index(), as if it was a standard
>> (4k or 64k) page.
>>
>> pte_index() doesn't know it is a hugepage, that's the reason why we need
>> to duplicate the entry.
> 
>  From the mm/ side of things, hugetlb page tables are always walked via
> the huge vma which knows the page size and could align address... I
> guess except for fast gup? Which should be read-only. So okay you do
> need to replicate huge ptes for fast gup at least. Any others?
> 
> There's going to need to be a little more to it. __hash_page_huge sets
> PTE accessed and dirty for example, so if we allow any PTE readers to
> check the non-0th pte we would have to do something about that.
> 
> How do you deal with dirty/accessed bits for other subarchs?

All nohash bail out of TLB miss handler when accessing a page which 
doesn have the ACCESSED bit or writing a page which doesn't have DIRTY 
bit, see commit 2c74e2586bb9 ("powerpc/40x: Rework 40x PTE access and 
TLB miss") and other commits it refers to.

Same for the 603 which is the nohash version of book3s/32, see commits 
f8b58c64eaef ("powerpc/603: let's handle PAGE_DIRTY directly") and 
84de6ab0e904 ("powerpc/603: don't handle PAGE_ACCESSED in TLB miss 
handlers.").

Only the hash version of book3s/32 still updated PTE in miss handler, 
see 
https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/mm/book3s32/hash_low.S#L146 
but there are no hugepages on book3s/32


> 
> We could just remove the hash_page setting of those bits and just cause
> a fault and require Linux mm to set them. At least for hugepages we
> could do that probably without any real performance worry.
> 
> Thanks,
> Nick