[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_-m8s5EUrL4DAME@arm.com>
Date: Wed, 16 Apr 2025 13:47:46 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Xavier <xavier_qy@....com>
Cc: ryan.roberts@....com, dev.jain@....com, ioworker0@...il.com,
21cnbao@...il.com, akpm@...ux-foundation.org, david@...hat.com,
gshan@...hat.com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, will@...nel.org, willy@...radead.org,
ziy@...dia.com
Subject: Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce
redundant operations
On Tue, Apr 15, 2025 at 04:22:04PM +0800, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev.
For some reason, my email (office365) rejected all these patches (not
even quarantined), I only got the replies. Anyway, I can get them from
the lore archive.
> Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
>
> Instruction Statistics - Before Optimization
> # count event_name # count / runtime
> 20,814,352 branch-load-misses # 662.244 K/sec
> 41,894,986,323 branch-loads # 1.333 G/sec
> 1,957,415 iTLB-load-misses # 62.278 K/sec
> 49,872,282,100 iTLB-loads # 1.587 G/sec
> 302,808,096 L1-icache-load-misses # 9.634 M/sec
> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>
> Total test time: 31.485237 seconds.
>
> Instruction Statistics - After Optimization
> # count event_name # count / runtime
> 19,340,524 branch-load-misses # 688.753 K/sec
> 38,510,185,183 branch-loads # 1.371 G/sec
> 1,812,716 iTLB-load-misses # 64.554 K/sec
> 47,673,923,151 iTLB-loads # 1.698 G/sec
> 675,853,661 L1-icache-load-misses # 24.068 M/sec
> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>
> Total test time: 28.108048 seconds.
We'd need to reproduce these numbers on other platforms as well and with
different page sizes. I hope Ryan can do some tests next week.
Purely looking at the patch, I don't like the complexity. I'd rather go
with your v1 if the numbers are fairly similar (even if slightly slower).
However, I don't trust microbenchmarks like calling mlock() in a loop.
It was hand-crafted to dirty the whole buffer (making ptes young+dirty)
before mlock() to make the best out of the rewritten contpte_ptep_get().
Are there any real world workloads that would benefit from such change?
As it stands, I think this patch needs better justification.
Thanks.
--
Catalin
Powered by blists - more mailing lists