[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <34a4bf01-6324-482b-a011-32974d68e02f@redhat.com>
Date: Wed, 9 Apr 2025 14:09:48 +1000
From: Gavin Shan <gshan@...hat.com>
To: Xavier <xavier_qy@....com>, dev.jain@....com, akpm@...ux-foundation.org,
baohua@...nel.org, ryan.roberts@....com, catalin.marinas@....com,
ioworker0@...il.com
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
will@...nel.org
Subject: Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant
operations
Hi Xavier,
On 4/8/25 6:58 PM, Xavier wrote:
> This commit optimizes the contpte_ptep_get function by adding early
> termination logic. It checks if the dirty and young bits of orig_pte
> are already set and skips redundant bit-setting operations during
> the loop. This reduces unnecessary iterations and improves performance.
>
> Signed-off-by: Xavier <xavier_qy@....com>
> ---
> arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..034d153d7d19 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>
I'm wandering how it can work. More details are given below.
> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
> + do { \
> + int _start = start; \
> + pte_t *_ptep = ptep; \
> + while (_start++ < CONT_PTES) { \
> + if (pte_##flag(__ptep_get(_ptep++))) { \
> + orig_pte = pte_mk##flag(orig_pte); \
> + break; \
> + } \
> + } \
> + } while (0)
> +
CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
collects the flag for ptep[start, CONT_PTES].
> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> {
> /*
> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> for (i = 0; i < CONT_PTES; i++, ptep++) {
> pte = __ptep_get(ptep);
>
> - if (pte_dirty(pte))
> + if (pte_dirty(pte)) {
> orig_pte = pte_mkdirty(orig_pte);
> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
> + break;
> + }
>
> - if (pte_young(pte))
> + if (pte_young(pte)) {
> orig_pte = pte_mkyoung(orig_pte);
> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
> + break;
> + }
> }
>
> return orig_pte;
There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
conflicting to the outer loop, which iterates ptep[0, CONT_PTES].
Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
early with luck, we don't have to step into another cache line. Note that extra checks needs
more CPU cycles.
Thanks,
Gavin
Powered by blists - more mailing lists