[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5d50fc6f.b19a.1963f250940.Coremail.xavier_qy@163.com>
Date: Wed, 16 Apr 2025 23:08:33 +0800 (CST)
From: Xavier <xavier_qy@....com>
To: "Catalin Marinas" <catalin.marinas@....com>
Cc: ryan.roberts@....com, dev.jain@....com, ioworker0@...il.com,
21cnbao@...il.com, akpm@...ux-foundation.org, david@...hat.com,
gshan@...hat.com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, will@...nel.org, willy@...radead.org,
ziy@...dia.com
Subject: Re:Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce
redundant operations
Hi Catalin,
At 2025-04-16 20:47:46, "Catalin Marinas" <catalin.marinas@....com> wrote:
>On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
>> Patch V3 has changed the while loop to a for loop according to the suggestions
>> of Dev.
>
>For some reason, my email (office365) rejected all these patches (not
>even quarantined), I only got the replies. Anyway, I can get them from
>the lore archive.
>
>> Meanwhile, to improve efficiency, the definition of local variables has
>> been removed. This macro is only used within the current function and there
>> will be no additional risks. In order to verify the optimization performance of
>> Patch V3, a test function has been designed. By repeatedly calling mlock in a
>> loop, the kernel is made to call contpte_ptep_get extensively to test the
>> optimization effect of this function.
>> The function's execution time and instruction statistics have been traced using
>> perf, and the following are the operation results on a certain Qualcomm mobile
>> phone chip:
>>
>> Instruction Statistics - Before Optimization
>> # count event_name # count / runtime
>> 20,814,352 branch-load-misses # 662.244 K/sec
>> 41,894,986,323 branch-loads # 1.333 G/sec
>> 1,957,415 iTLB-load-misses # 62.278 K/sec
>> 49,872,282,100 iTLB-loads # 1.587 G/sec
>> 302,808,096 L1-icache-load-misses # 9.634 M/sec
>> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>>
>> Total test time: 31.485237 seconds.
>>
>> Instruction Statistics - After Optimization
>> # count event_name # count / runtime
>> 19,340,524 branch-load-misses # 688.753 K/sec
>> 38,510,185,183 branch-loads # 1.371 G/sec
>> 1,812,716 iTLB-load-misses # 64.554 K/sec
>> 47,673,923,151 iTLB-loads # 1.698 G/sec
>> 675,853,661 L1-icache-load-misses # 24.068 M/sec
>> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>>
>> Total test time: 28.108048 seconds.
>
>We'd need to reproduce these numbers on other platforms as well and with
>different page sizes. I hope Ryan can do some tests next week.
Of course, it would be even better if any reviewers could verify it with some
real-world scenarios.
>
>Purely looking at the patch, I don't like the complexity. I'd rather go
>with your v1 if the numbers are fairly similar (even if slightly slower).
The implementation of this macro is indeed a bit complicated. If it
doesn't affect the performance, I will try to make it simpler.
>
>However, I don't trust microbenchmarks like calling mlock() in a loop.
>It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
>before mlock() to make the best out of the rewritten contpte_ptep_get().
>Are there any real world workloads that would benefit from such change?
>
>As it stands, I think this patch needs better justification.
>
Indeed, I used mlock() because it is an example that can simply achieve
a large number of calls to contpte_ptep_get() in the user space. Of course,
there are many other scenarios where this function will be called, such as
madvise and mprotect, and so on. But essentially, there is no difference,
and all of them are aimed at verifying the performance improvement
brought by the contpte_ptep_get() function. It's true that the previous
test cases only tested the best-case scenario by dirtying the data.
Next, I will continue to test the impact of the new patch on performance
in scenarios where there are no "young" and "dirty" flags, or where there
is only a "young" flag.
--
Thanks,
Xavier
Powered by blists - more mailing lists