[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72ff1deb.b3a7.1963f5cec0c.Coremail.xavier_qy@163.com>
Date: Thu, 17 Apr 2025 00:09:36 +0800 (CST)
From: Xavier <xavier_qy@....com>
To: "Ryan Roberts" <ryan.roberts@....com>
Cc: dev.jain@....com, ioworker0@...il.com, 21cnbao@...il.com,
akpm@...ux-foundation.org, catalin.marinas@....com, david@...hat.com,
gshan@...hat.com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, will@...nel.org, willy@...radead.org,
ziy@...dia.com
Subject: Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce
redundant operations
At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@....com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>> termination logic. It checks if the dirty and young bits of orig_pte
>> are already set and skips redundant bit-setting operations during
>> the loop. This reduces unnecessary iterations and improves performance.
>>
>> Signed-off-by: Xavier <xavier_qy@....com>
>> ---
>> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>> 1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> + for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> + if (pte_##flag(__ptep_get(ptep))) { \
>> + orig_pte = pte_mk##flag(orig_pte); \
>> + break; \
>> + } \
>> + }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
I will test the performance differences among these several implementations.
If the differences are not significant, I will try to implement it with the simpler method.
At the same time, according to Barry's suggestion, I will add some more general
test cases to verify whether the optimization is effective.
Since I'm a bit busy recently, it may take me a few days to complete this task.
>> +
>> pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> {
>> /*
>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> for (i = 0; i < CONT_PTES; i++, ptep++) {
>> pte = __ptep_get(ptep);
>>
>> - if (pte_dirty(pte))
>> + if (pte_dirty(pte)) {
>> orig_pte = pte_mkdirty(orig_pte);
>> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> + break;
>> + }
>>
>> - if (pte_young(pte))
>> + if (pte_young(pte)) {
>> orig_pte = pte_mkyoung(orig_pte);
>> + CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> + break;
>> + }
>> }
>>
>> return orig_pte;
>
>If we decide this is all worth the trouble, then I think we can (and *should*,
>in order to be consistent) pull a similar stunt in contpte_ptep_get_lockless().
OK. This function seems to be a bit more complicated. I'll give it a try to
figure out how to optimize it.
--
Thanks,
Xavier
Powered by blists - more mailing lists