[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <87zidc1w7j.fsf@skywalker.in.ibm.com>
Date: Tue, 13 Jun 2017 07:32:24 +0530
From: "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
To: Ram Pai <linuxram@...ibm.com>
Cc: linuxppc-dev@...ts.ozlabs.org, linux-kernel@...r.kernel.org,
benh@...nel.crashing.org, paulus@...ba.org, mpe@...erman.id.au,
khandual@...ux.vnet.ibm.com, bsingharora@...il.com,
dave.hansen@...el.com, hbabu@...ibm.com
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
Ram Pai <linuxram@...ibm.com> writes:
> On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
>> Ram Pai <linuxram@...ibm.com> writes:
>>
>> > Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
>> > memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
>> > keys.
>> >
>> > The patch does the following change to the 64K PTE format
>> >
>> > H_PAGE_BUSY moves from bit 3 to bit 7
>> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>> > of the pte.
>> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
>> > second part of the pte.
>> >
>> > The second part of the PTE will hold
>> > a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
>> > and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
>> > pte.
>> >
>> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>> > is initialized to 0xF indicating a invalid slot. if a hashpage does
>> > get allocated to the 0xF slot, it is released and not used. In
>> > other words, even though 0xF is a valid slot we discard it and
>> > consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
>> > opportunity to not depend on a bit in the primary PTE in order to
>> > determine the validity of a slot.
>>
>> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
>> slot is valid or not. For 4K hptes, we do need this right ? ie, only
>> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot
>
> for 64k hptes; you are right, we do not use 0xF to
> track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.
>
> for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
> value of the slot. H_PAGE_HASHPTE tells if there exists any valid
> 4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
> us if they are valid values.
>
> However in either case we do not need H_PAGE_COMBO. That flag is not
> used for ptes. But we continue to use that flag for pmd to track
> hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
> the 64K pagesize case.
Really ? May be i am missing that in the patch. H_PAGE_COMBO indicate
whether a 64K linux page is mapped via 4k hash page table enries. I
don't see you changing that in __hash_page_4k()
we still continue to do.
new_pte = old_pte | H_PAGE_BUSY | _PAGE_ACCESSED | H_PAGE_COMBO;
/*
* Check if the pte was already inserted into the hash table
* as a 64k HW page, and invalidate the 64k HPTE if so.
*/
if (!(old_pte & H_PAGE_COMBO)) {
flush_hash_page(vpn, rpte, MMU_PAGE_64K, ssize, flags);
>
>>
>>
>> >
>> > When we release a 0xF slot we also release a legitimate primary
>> > slot and unmap that entry. This is to ensure that we do get
>> > a legimate non-0xF slot the next time we retry for a slot.
>>
>> Can you explain this more ? what is a primary slot here ?
>
> I may not be using the right terminology here. But when I say slot, i
> mean the four bits that tell the position of the hpte in the hash
> buckets. Bit 0, indicates if it the primary or secondary hash bucket.
> and bit 1,2,3 indicates the entry in the hash bucket.
>
> So when i say primary slot, I mean a entry in the primary hash bucket.
>
> The idea is, when hpte_insert returns a hpte which is cached in the 7slot
> of the secondary bucket i.e 0xF, we discard it, and also release a
> random entry from the primary bucket, so that on retry we can get that
> entry.
>
>
>>
>> >
>> > Though treating 0xF slot as invalid reduces the number of available
>> > slots and make have a effect on the performance, the probabilty
>> > of hitting a 0xF is extermely low.
>> >
>> > Compared to the current scheme, the above described scheme reduces
>> > the number of false hash table updates significantly and has the
>> > added advantage of releasing four valuable PTE bits for other
>> > purpose.
>> >
>> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
>> > Ellermen and myself.
>> >
>> > 4K PTE format remain unchanged currently.
>> >
>> > Signed-off-by: Ram Pai <linuxram@...ibm.com>
>> > ---
>> > arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 +++++
>> > arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
>> > arch/powerpc/include/asm/book3s/64/hash.h | 8 ++-
>> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
>> > arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
>> > arch/powerpc/mm/hash64_4k.c | 12 ++---
>> > arch/powerpc/mm/hash64_64k.c | 73 +++++++++------------------
>> > arch/powerpc/mm/hash_utils_64.c | 38 +++++++++++++-
>> > arch/powerpc/mm/hugetlbpage-hash64.c | 16 +++---
>> > 9 files changed, 107 insertions(+), 98 deletions(-)
>> >
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > index b4b5e6b..223d318 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > @@ -16,6 +16,18 @@
>> > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
>> > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>> >
>> > +
>> > +/*
>> > + * Only supported by 4k linux page size
>>
>> that comment is confusing, you can drop that, it is part of hash-4k.h
>> which means these defines are local to 4k linux page size config.
>>
>> > + */
>
> right. ok.
>
>> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> > +#define H_PAGE_F_GIX_SHIFT 56
>> > +
>> > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>> > +
>> > +
>> > /* PTE flags to conserve for HPTE identification */
>> > #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
>> > H_PAGE_F_SECOND | H_PAGE_F_GIX)
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > index 9732837..87ee22d 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > @@ -10,23 +10,21 @@
>> > * 64k aligned address free up few of the lower bits of RPN for us
>> > * We steal that here. For more deatils look at pte_pfn/pfn_pte()
>> > */
>> > -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
>> > -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>> > +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
>> > +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>> > +
>> > +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
>> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>> > +
>> > /*
>> > * We need to differentiate between explicit huge page and THP huge
>> > * page, since THP huge page also need to track real subpage details
>> > */
>> > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
>> >
>> > -/*
>> > - * Used to track subpage group valid if H_PAGE_COMBO is set
>> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
>> > - */
>> > -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
>> > -
>> > /* PTE flags to conserve for HPTE identification */
>> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
>> > - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
>> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
>> > +
>>
>> You drop H_PAGE_COMBO here. Is that intentional ?
>
> Yes. that is intentional. See a downside? We dont depend on
> H_PAGE_COMBO for 64K pagesize PTE anymore.
>
>>
>> We have code which does
>>
>> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
>> MMU_PAGE_64K, ssize,
>> flags) == -1)
>> old_pte &= ~_PAGE_HPTEFLAGS;
>>
>>
>> I guess they expect slot details to be cleared. With the above
>> _PAGE_HPTEFLAGS that is not true.
>
> since we dont depend on COMBO flag, it should be fine. Let me know if
> you see a downside.
>
>>
>>
>> > /*
>> > * we support 16 fragments per PTE page of 64K size.
>> > */
>> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
>> > unsigned long *hidxp;
>> >
>> > rpte.pte = pte;
>> > - rpte.hidx = 0;
>> > - if (pte_val(pte) & H_PAGE_COMBO) {
>> > - /*
>> > - * Make sure we order the hidx load against the H_PAGE_COMBO
>> > - * check. The store side ordering is done in __hash_page_4K
>> > - */
>> > - smp_rmb();
>> > - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>> > - rpte.hidx = *hidxp;
>> > - }
>> > + /*
>> > + * The store side ordering is done in __hash_page_4K
>> > + */
>> > + smp_rmb();
>> > + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>> > + rpte.hidx = *hidxp;
>> > return rpte;
>> > }
>> >
>> > static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
>> > {
>> > - if ((pte_val(rpte.pte) & H_PAGE_COMBO))
>> > - return (rpte.hidx >> (index<<2)) & 0xf;
>> > - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
>> > + return ((rpte.hidx >> (index<<2)) & 0xfUL);
>> > }
>> >
>> > #define __rpte_to_pte(r) ((r).pte)
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>> > index 4e957b0..7742782 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>> > @@ -8,11 +8,9 @@
>> > *
>> > */
>> > #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
>> > -#define H_PAGE_F_GIX_SHIFT 56
>> > -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>> > -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>> > -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> > -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>> > +
>> > +#define INIT_HIDX (~0x0UL)
>>
>> But then you use it like
>>
>> rpte.hidx = INIT_HIDX;
>>
>> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
>> there ?
>
> No. We are initializing all the entries in the second-part-of-the-pte to
> 0xF. which essentially boils down to ~0x0UL.
that you unsigned long casting is what i was suggesting to remove. We
may want to treat it as a 4 bits every where ?
>
>>
>> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
>>
>> Can you do that as a static inline ?
>
> ok.
>
>>
>> >
>> > #ifdef CONFIG_PPC_64K_PAGES
>> > #include <asm/book3s/64/hash-64k.h>
>> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > index 6981a52..cfb8169 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
>> > extern int __hash_page_64K(unsigned long ea, unsigned long access,
>> > unsigned long vsid, pte_t *ptep, unsigned long trap,
>> > unsigned long flags, int ssize);
>> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
>> > + unsigned int subpg_index, unsigned long slot);
>> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
>> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
>> > +
>> > struct mm_struct;
>> > unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
>> > extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
>> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
>> > index 44fe483..b832ed3 100644
>> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
>> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
>> > @@ -213,7 +213,7 @@ struct flag_info {
>> > .val = H_PAGE_4K_PFN,
>> > .set = "4K_pfn",
>> > }, {
>> > -#endif
>> > +#else
>> > .mask = H_PAGE_F_GIX,
>> > .val = H_PAGE_F_GIX,
>> > .set = "f_gix",
>> > @@ -224,6 +224,7 @@ struct flag_info {
>> > .val = H_PAGE_F_SECOND,
>> > .set = "f_second",
>> > }, {
>> > +#endif /* CONFIG_PPC_64K_PAGES */
>> > #endif
>> > .mask = _PAGE_SPECIAL,
>> > .val = _PAGE_SPECIAL,
>> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
>> > index 6fa450c..5b16416 100644
>> > --- a/arch/powerpc/mm/hash64_4k.c
>> > +++ b/arch/powerpc/mm/hash64_4k.c
>> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> > pte_t *ptep, unsigned long trap, unsigned long flags,
>> > int ssize, int subpg_prot)
>> > {
>> > + real_pte_t rpte;
>> > unsigned long hpte_group;
>> > unsigned long rflags, pa;
>> > unsigned long old_pte, new_pte;
>> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> > * need to add in 0x1 if it's a read-only user page
>> > */
>> > rflags = htab_convert_pte_flags(new_pte);
>> > + rpte = __real_pte(__pte(old_pte), ptep);
>> >
>> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
>> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
>> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> > /*
>> > * There MIGHT be an HPTE for this pte
>> > */
>> > - hash = hpt_hash(vpn, shift, ssize);
>> > - if (old_pte & H_PAGE_F_SECOND)
>> > - hash = ~hash;
>> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
>> > -
>> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
>>
>>
>> This confused me, the rest of the code assume slot as the 4 bit value.
>> But get_hidx_slot is not exactly returning that.
>
> get_hidx_slot() is returning a 4bit value. no?
It is not
unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
int ssize, real_pte_t rpte, unsigned int subpg_index)
{
unsigned long hash, slot, hidx;
hash = hpt_hash(vpn, shift, ssize);
hidx = __rpte_to_hidx(rpte, subpg_index);
if (hidx & _PTEIDX_SECONDARY)
hash = ~hash;
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
slot += hidx & _PTEIDX_GROUP_IX;
return slot;
}
-aneesh
Powered by blists - more mailing lists