linux-kernel - Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com>
Date: Tue, 26 Mar 2024 17:27:08 +0100
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>, Mark Rutland <mark.rutland@....com>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Muchun Song <muchun.song@...ux.dev>
Cc: linux-arm-kernel@...ts.infradead.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

On 15.02.24 13:17, Ryan Roberts wrote:
> With the introduction of contpte mapping support for arm64, that
> architecture's implementation of ptep_get_lockless() has become very
> complex due to the need to gather access and dirty bits from across all
> of the ptes in the contpte block. This requires careful implementation
> to ensure the returned value is consistent (because its not possible to
> read all ptes atomically), but even in the common case when there is no
> racing modification, we have to read all ptes, which gives an ~O(n^2)
> cost if the core-mm is iterating over a range, and performing a
> ptep_get_lockless() on each pte.
> 
> Solve this by introducing ptep_get_lockless_norecency(), which does not
> make any guarantees about access and dirty bits. Therefore it can simply
> read the single target pte.
> 
> At the same time, convert all call sites that previously used
> ptep_get_lockless() but don't care about access and dirty state.
> 

I'd probably split that part off.

> We may want to do something similar for ptep_get() (i.e.
> ptep_get_norecency()) in future; it doesn't suffer from the consistency
> problem because the PTL serializes it with any modifications, but does
> suffer the same O(n^2) cost.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
> ---
>   include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>   kernel/events/core.c    |  2 +-
>   mm/hugetlb.c            |  2 +-
>   mm/khugepaged.c         |  2 +-
>   mm/memory.c             |  2 +-
>   mm/swap_state.c         |  2 +-
>   mm/swapfile.c           |  2 +-
>   7 files changed, 40 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a36cf4e124b0..9dd40fdbd825 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
> 
> -/*
> - * We require that the PTE can be read atomically.
> - */
>   #ifndef ptep_get_lockless
> +/**
> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
> + *                     dirty bits are guaranteed to accurately reflect the state
> + *                     of the pte at the time of the call.
> + * @ptep: Page table pointer for pte to get.
> + *
> + * If young and dirty information is not required, use
> + * ptep_get_lockless_norecency() which can be faster on some architectures.
> + *
> + * May be overridden by the architecture; otherwise, implemented using
> + * ptep_get(), on the assumption that it is atomic.
> + *
> + * Context: Any.
> + */

I think we usually say "Any context.". But I would just do it like idr.h:

"Any context. It is safe to call this function without locking in your 
code."

.. but is this true? We really want to say "without page table lock". 
Because there must be some way to prevent against concurrent page table 
freeing. For example, GUP-fast disables IRQs, whereby page table freeing 
code frees using RCU.

>   static inline pte_t ptep_get_lockless(pte_t *ptep)
>   {
>   	return ptep_get(ptep);
>   }
>   #endif
> 
> +#ifndef ptep_get_lockless_norecency
> +/**
> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
> + *				 Young and dirty bits may not be accurate.
> + * @ptep: Page table pointer for pte to get.
> + *
> + * Prefer this over ptep_get_lockless() when young and dirty information is not
> + * required since it can be faster on some architectures.
> + *
> + * May be overridden by the architecture; otherwise, implemented using the more
> + * precise ptep_get_lockless().
> + *
> + * Context: Any.

Same comment.

> + */
> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
> +{
> +	return ptep_get_lockless(ptep);
> +}
> +#endif

[...]

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 68283e54c899..41dc44eb8454 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   	}
> 
>   	if (pte) {
> -		pte_t pteval = ptep_get_lockless(pte);
> +		pte_t pteval = ptep_get_lockless_norecency(pte);
> 
>   		BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>   	}
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2771fc043b3b..1a6c9ed8237a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   			}
>   		}
> 
> -		vmf.orig_pte = ptep_get_lockless(pte);
> +		vmf.orig_pte = ptep_get_lockless_norecency(pte);
>   		if (!is_swap_pte(vmf.orig_pte))
>   			continue;


Hm, I think you mentioned that we want to be careful with vmf.orig_pte.

-- 
Cheers,

David / dhildenb