lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f59228dd-2186-c882-3774-c9778918cd31@arm.com>
Date:   Tue, 4 Jul 2023 12:09:31 +0100
From:   Ryan Roberts <ryan.roberts@....com>
To:     Catalin Marinas <catalin.marinas@....com>
Cc:     Will Deacon <will@...nel.org>, Ard Biesheuvel <ardb@...nel.org>,
        Marc Zyngier <maz@...nel.org>,
        Oliver Upton <oliver.upton@...ux.dev>,
        James Morse <james.morse@....com>,
        Suzuki K Poulose <suzuki.poulose@....com>,
        Zenghui Yu <yuzenghui@...wei.com>,
        Andrey Ryabinin <ryabinin.a.a@...il.com>,
        Alexander Potapenko <glider@...gle.com>,
        Andrey Konovalov <andreyknvl@...il.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Vincenzo Frascino <vincenzo.frascino@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Matthew Wilcox <willy@...radead.org>,
        Yu Zhao <yuzhao@...gle.com>,
        Mark Rutland <mark.rutland@....com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings

On 03/07/2023 16:17, Catalin Marinas wrote:
> Hi Ryan,
> 
> Some comments below. I did not have time to trim down the quoted text,
> so you may need to scroll through it.

Thanks for the review!

Looking at the comments, I think they all relate to implementation. Does that
imply that you are happy with the shape/approach?

Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
option that defaults to disabled and also adding a command line option to
disable it when compiled in. I think that makes sense for now at least to reduce
risk of performance regression?

> 
> On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>>  arch/arm64/mm/Makefile           |   3 +-
>>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>>  3 files changed, 466 insertions(+), 8 deletions(-)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 31df4d73f9ac..17ea534bc5b0 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>  				    unsigned long addr, pte_t *ptep,
>>  				    pte_t old_pte, pte_t new_pte);
>>  
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>> +}
>> +
>> +static inline bool contpte_is_enabled(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +
>> +	return mm != &init_mm;
>> +}
>> +
>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 */
>> +
>> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
>> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
>> +
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && !pte_cont(pte) &&
>> +	    valign && palign)
>> +		__contpte_try_fold(mm, addr, ptep, pte);
> 
> I would use pte_valid() here instead. pte_present() also includes the
> PTE_PROT_NONE option which we don't really care about as it's not
> accessible.

Yep good point. I'll audit all of this and make the appropriate changes for v2.

> 
> I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
> whether we can use other bits from the pte even if they clash with other
> valid permissions. Since the pte is not valid, in theory we could as
> long as nothing corrupts the (like a cont bit). The background to this
> is multiple migrate types (not just NUMA) for the MTE tag carveout
> reuse.

ACK.

> 
>> +}
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && pte_cont(pte))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
> 
> Same here and probably most other places where pte_present() is used in
> this patch.

ACK.

> 
>> +
>>  /*
>>   * The below functions constitute the public API that arm64 presents to the
>>   * core-mm to manipulate PTE entries within the their page tables (or at least
>> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   * versions will automatically and transparently apply the contiguous bit where
>>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> - * private versions, which are prefixed with double underscore.
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>>   */
>>  
>>  #define ptep_get ptep_get
>>  static inline pte_t ptep_get(pte_t *ptep)
>>  {
>> -	return __ptep_get(ptep);
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>>  }
>>  
>>  static inline void set_pte(pte_t *ptep, pte_t pte)
>>  {
>> -	__set_pte(ptep, pte);
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
>> +	 * (since it requires tlb maintenance). set_pte() is not used in core
>> +	 * code, so this should never even be called. Regardless do our best to
>> +	 * service any call and emit a warning if there is any attempt to set a
>> +	 * pte on top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
> 
> Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
> only checks the old entry.

Originally, it was my intent that PTE_CONT bit would be totally private to this
layer and the bit should never leak to the generic code (i.e. ptep_get() would
clear it before returning the pte and all functions that accept a pte would WARN
if the bit was set on entry.

However, this approach proved problematic for accounting; I have a separate
change that logs the amount of memory mapped as contpte in
/proc/<pid>/smaps[_rollup]. For this to work, the PTE_CONT bit must be leaked to
the generic code (ptep_get() no longer explicitly clears it). But if we
deliberately leak it, then its possible that it will be set in functions that
take a pte, which would lead to incorrect behavior (potentially leading to a
contpte range that has some PTE_CONT bits set and others cleared). This happens
because there is generic code that follows a pattern like this:

  pte = ptep_get_and_clear(ptep)
  pte = modify_some_bits(pte)
  set_pte_at(pte)

To solve this, I'm explicitly clearing CONT_PTE from any pte that is passed in
to one of these functions.

> 
>>  }
>>  
>>  #define set_ptes set_ptes
>>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr)
>>  {
>> -	__set_ptes(mm, addr, ptep, pte, nr);
>> +	pte = pte_mknoncont(pte);
>> +
>> +	if (!contpte_is_enabled(mm))
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +	else if (nr == 1) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>> +	} else
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  }
>>  
>>  static inline void pte_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	return __ptep_get_and_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_test_and_clear_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> 
> Since I've seen this construct a few times, you may want to turn it into
> a specific check: pte_valid_cont().

ACK - will do for v2.

> 
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_clear_flush_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__ptep_set_wrprotect(mm, addr, ptep);
>> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty)
>>  {
>> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
> 
> As in a few other places, it's not clear to me why the pte_mknoncont()
> is needed. Here I expect 'entry' to be cont if *ptep is cont.

See explanation above.

> 
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> 
> Also wondering, can we have this check on 'entry' rather than
> 'orig_pte'? And maybe a warning if the cont bit differs between them.

No - the idea is that this API layer has exclusicve control over whether
PTE_CONT is set in the pgtable. Upper layers should never pass a pte with
CONT_PTE set (except for the corner case described above which we deal with by
explicitly clearing PTE_CONT from the passed in pte).

So the check must be on orig_pte - we are checking if a contpte range is present
over the pte we are about to modify. If it is, then we need to handle it
carefully (potentially by unfolding it first; handled by
contpte_ptep_set_access_flags()). If there is no contprte range, then we can
just handle it the "normal" way.

> 
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>  }
>>  
>>  #endif /* !__ASSEMBLY__ */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..70b6aba09b5d 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -2,7 +2,8 @@
>>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>> -				   context.o proc.o pageattr.o fixmap.o
>> +				   context.o proc.o pageattr.o fixmap.o \
>> +				   contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..e8e4a298fd53
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,334 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, int nr)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr = addr;
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>> +		__pte_clear(mm, addr, ptep);
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +}
>> +
>> +static bool ptep_any_present(pte_t *ptep, int nr)
> 
> Valid?

ACK

> 
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++) {
>> +		if (pte_present(__ptep_get(ptep)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte, bool fold)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> I presume this can be unsafe if any of the ptes in the range differ, so
> we need some higher level check. 

Sorry I'm not quite sure what you mean here? The higher level check is where we
look at the current value of the target PTE; if PTE_CONT is set then we know it
is part of a contpte range. We are careful that PTE_CONT is set consistently for
all (valid) PTEs in a contpte range, so we only need to check 1 entry. There is
no risk of racing here because we are always serialized by the PTL.

> But that means we now have three loops
> for folding, one to check, another to clear and the last one via
> __set_ptes(). I guess we can't collapse the first two loops in a 'try'
> function as we need to do the cleaning (and would have to re-instate the
> old entries if they can't be made contiguous).

Yes 3 loops, and I don't see how you would reduce that. The good news is that
this folding path should be rarely taken; most qualifying ranges will be set via
set_ptes() so they are ritten "pre-folded". We only attempt to fold after
setting the pte at the _end_ of the range (see contpte_try_fold()), and the
checker loop in __contpte_try_fold() will usually exit on the second iteration
if the memory is not physically contiguous.

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contig range is fully covered
>> +	 * by a single folio, and ensure that all the ptes are present with
>> +	 * contiguous PFNs and matching prots.
>> +	 */
>> +
>> +	struct page *page = pte_page(pte);
>> +	struct folio *folio = page_folio(page);
>> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	pte_t subpte;
>> +	pte_t *orig_ptep;
>> +	int i;
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		subpte = __ptep_get(ptep);
>> +		subpte = pte_mkold(pte_mkclean(subpte));
> 
> IIUC, this function assumes ptes that only differ by the dirty status
> can be contiguous. That's probably ok, with a chance of the dirty status
> spreading to the adjacent ptes in the fold function. Maybe add a comment
> on why this is ok (or why it doesn't happen).

Conceptually a contpte range only has a single access and dirty bit. So when
folding, we or all the access bits and all the dirty bits from the constituent
ptes to determine the single access and dirty bits for the contpte mapping. And
when unfolding, we take the single access and dirty bit for the contpte mapping
and apply those values to every individual entry.

So yes, we ignore the access and dirty values for the subptes when evaluating
whether a contiguous range exists. I'll add a comment.

> 
>> +
>> +		if (!pte_present(subpte) ||
>> +		    pte_pfn(subpte) != pfn ||
>> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>> +			return;
>> +	}
>> +
>> +	contpte_fold(mm, addr, orig_ptep, pte, true);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
>> +	 */
>> +
>> +	contpte_fold(mm, addr, ptep, pte, false);
>> +}
> 
> So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
> for the unfold case. Maybe it's clearer if we just have two separate
> functions (or document why the pte_mk*() are needed).

No that's not the case. In the unfold case, we need to "collect" the single
access and dirty bit from the contpte mapping (these may be in any of the
entries), and set the final values for all individual ptes during unfolding.

The obvious side effect here is that if any one page is dirty at fold time, the
whole range will be marked as dirty after folding, then at unfolding all pages
will be marked as dirty (same goes for access). This is the same concern that I
raised in the cover letter. I don't think this is a problem from the kernel's
point of view; the kernel will compress the per-page access/dirty info to
per-folio and we only fold if the whole range is covered by a single folio. But
user space could observe this "over-dirtying" through /proc/<pid>/pagemap. I'm
not sure if that's a problem in practice?

> 
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		/*
>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +		 * where some of the ptes in the range may be cleared but others
>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>> +		 */
>> +		if (pte_val(pte) == 0)
>> +			continue;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet.
>> +	 */
>> +
>> +	pte_t orig_pte;
>> +	pgprot_t orig_prot;
>> +	pte_t *ptep;
>> +	unsigned long pfn;
>> +	pte_t pte;
>> +	pgprot_t prot;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return orig_pte;
> 
> I haven't looked through all the patches, so not entirely sure when this
> function is called. 

ptep_get_lockless() is one of the mm optional arch interfaces. arm64 doesn't
currently implement it, because ptep_get() (READ_ONCE()) is safe without the
lock being held. But with the introduction of contpte mappings, there are cases
now where we have to read the whole contpte range to gather access and dirty,
which obviously isn't atomic. And doing that without the PTL is harder than when
we have the PTL, so I've implemented ptep_get_lockless() so we can assume the
PTL is held in ptep_get() and do the simple thing there (the common case).

> But since you mention that the range may be
> folded/unfolded, how do we deal with pte_cont() racing with something
> setting the contig bit?

ptep_get_lockless() is inherrently racy. My intedntion was that we just need to
ensure we read a pte or contpte range that is consistent with itself.

> 
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_present(pte) || !pte_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
> 
> It needs better documenting, I don't understand what the retry here is
> for (presumably to handle some races). Do we care about some memory
> ordering as well? __pte_get() only takes care of reading the ptep once.

The intention is that the loop keeps retrying until it scans a whole contpte
range that is consistent with itself (i.e. PTE_CONT bit is set in all, pfn
increments monotonically and pgprots are all the same). If any of those
considtions are not true, it indicates we are racing with an update and need to
retry until its consistent. I'd need to think a bit more on whether we need
anything special for memory ordering...

To be honest, I'm not a big fan of this function. As far as I can tell, the only
user of ptep_get_lockless() that cares about access/dirty is ptdump. Perhaps we
can re-spec this to not return access/dirty info (that would simplify it back to
a READ_ONCE()), then figure out a way to hold the PTL for ptdump and use
ptep_get() which will return the access/dirty info correctly. Do you think
something like that could work?

> 
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end = addr + (nr << PAGE_SHIFT);
>> +	unsigned long pfn = pte_pfn(pte);
>> +	pgprot_t prot = pte_pgprot(pte);
>> +	pte_t orig_pte;
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		/*
>> +		 * If operating on a partial contiguous range then we must first
>> +		 * unfold the contiguous range if it was previously folded.
>> +		 * Otherwise we could end up with overlapping tlb entries.
>> +		 */
>> +		if (nr != CONT_PTES)
>> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +		/*
>> +		 * If we are replacing ptes that were contiguous or if the new
>> +		 * ptes are contiguous and any of the ptes being replaced are
>> +		 * present, we need to clear and flush the range to prevent
>> +		 * overlapping tlb entries.
>> +		 */
>> +		orig_pte = __ptep_get(ptep);
>> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
>> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
>> +			ptep_clear_flush_range(mm, addr, ptep, nr);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int i;
>> +	int young = 0;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = ptep_get(ptep);
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>> +	 * contig range. But if the write bit is changing, we need to go through
>> +	 * the full unfold/fold cycle.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
> 
> Depending on the architecture version, pte_write() either checks a
> software only bit or it checks the DBM one.
> 
>> +		/*
>> +		 * No need to flush here; This is always "more permissive" so we
>> +		 * can only be _adding_ the access or dirty bit. And since the
>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>> +		 * is a SW bit, there can be no confusion. For HW access
>> +		 * management, we technically only need to update the flag on a
>> +		 * single pte in the range. But for SW access management, we
>> +		 * need to update all the ptes to prevent extra faults.
>> +		 */
> 
> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> perspective but clean) may be cached in the TLB and we do need flushing.

I don't follow; The Arm ARM says:

  IPNQBP When an Access flag fault is generated, the translation table entry
         causing the fault is not cached in a TLB.

So the entry can only be in the TLB if AF is already 1. And given the dirty bit
is SW, it shouldn't affect the TLB state. And this function promises to only
change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).

So I'm not sure what case you are describing here?

> 
>> +		ptep = contpte_align_down(ptep);
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +	} else {
>> +		/*
>> +		 * No need to flush in __ptep_set_access_flags() because we just
>> +		 * flushed the whole range in __contpte_try_unfold().
>> +		 */
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>> +	}
>> +
>> +	return 1;
>> +}
> 

Thanks,
Ryan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ