[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad57ec62-00a8-4f7e-aad1-f28cc14d6af4@arm.com>
Date: Wed, 16 Apr 2025 13:48:57 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Xavier <xavier_qy@....com>, dev.jain@....com, ioworker0@...il.com,
21cnbao@...il.com
Cc: akpm@...ux-foundation.org, catalin.marinas@....com, david@...hat.com,
gshan@...hat.com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, will@...nel.org, willy@...radead.org,
ziy@...dia.com
Subject: Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant
operations
On 15/04/2025 09:22, Xavier wrote:
> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there
> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:
Xavier, for some reason your emails aren't hitting my inbox - I'm only seeing
the replies from others. I'll monitor lore but appologies if I'm slow to respond
- that's the reason.
Please start the first line of the commit with "arm64/mm" instead of "mm/contpte".
Also I noticed that Andrew put this into mm-new last night. I'd prefer that this
go via the arm64 tree, if we decide we want it.
>
> Instruction Statistics - Before Optimization
> # count event_name # count / runtime
> 20,814,352 branch-load-misses # 662.244 K/sec
> 41,894,986,323 branch-loads # 1.333 G/sec
> 1,957,415 iTLB-load-misses # 62.278 K/sec
> 49,872,282,100 iTLB-loads # 1.587 G/sec
> 302,808,096 L1-icache-load-misses # 9.634 M/sec
> 49,872,282,100 L1-icache-loads # 1.587 G/sec
>
> Total test time: 31.485237 seconds.
>
> Instruction Statistics - After Optimization
> # count event_name # count / runtime
> 19,340,524 branch-load-misses # 688.753 K/sec
> 38,510,185,183 branch-loads # 1.371 G/sec
> 1,812,716 iTLB-load-misses # 64.554 K/sec
> 47,673,923,151 iTLB-loads # 1.698 G/sec
> 675,853,661 L1-icache-load-misses # 24.068 M/sec
> 47,673,923,151 L1-icache-loads # 1.698 G/sec
>
> Total test time: 28.108048 seconds.
>
> Function Statistics - Before Optimization
> Arch: arm64
> Event: cpu-cycles (type 0, config 0)
> Samples: 1419716
> Event count: 99618088900
>
> Overhead Symbol
> 21.42% lock_release
> 21.26% lock_acquire
> 20.88% arch_counter_get_cntvct
> 14.32% _raw_spin_unlock_irq
> 6.79% contpte_ptep_get
> 2.20% test_contpte_perf
> 1.82% follow_page_pte
> 0.97% lock_acquired
> 0.97% rcu_is_watching
> 0.89% mlock_pte_range
> 0.84% sched_clock_noinstr
> 0.70% handle_softirqs.llvm.8218488130471452153
> 0.58% test_preempt_disable_long
> 0.57% _raw_spin_unlock_irqrestore
> 0.54% arch_stack_walk
> 0.51% vm_normal_folio
> 0.48% check_preemption_disabled
> 0.47% stackinfo_get_task
> 0.36% try_grab_folio
> 0.34% preempt_count
> 0.32% trace_preempt_on
> 0.29% trace_preempt_off
> 0.24% debug_smp_processor_id
>
> Function Statistics - After Optimization
> Arch: arm64
> Event: cpu-cycles (type 0, config 0)
> Samples: 1431006
> Event count: 118856425042
>
> Overhead Symbol
> 22.59% lock_release
> 22.13% arch_counter_get_cntvct
> 22.08% lock_acquire
> 15.32% _raw_spin_unlock_irq
> 2.26% test_contpte_perf
> 1.50% follow_page_pte
> 1.49% arch_stack_walk
> 1.30% rcu_is_watching
> 1.09% lock_acquired
> 1.07% sched_clock_noinstr
> 0.88% handle_softirqs.llvm.12507768597002095717
> 0.88% trace_preempt_off
> 0.76% _raw_spin_unlock_irqrestore
> 0.61% check_preemption_disabled
> 0.52% trace_preempt_on
> 0.50% mlock_pte_range
> 0.43% try_grab_folio
> 0.41% folio_mark_accessed
> 0.40% vm_normal_folio
> 0.38% test_preempt_disable_long
> 0.28% contpte_ptep_get
> 0.27% __traceiter_android_rvh_preempt_disable
> 0.26% debug_smp_processor_id
> 0.24% return_address
> 0.20% __pte_offset_map_lock
> 0.19% unwind_next_frame_record
>
> If there is no problem with my test program, it can be seen that there is a
> significant performance improvement both in the overall number of instructions
> and the execution time of contpte_ptep_get.
>
> If any reviewers have time, you can also test it on your machines for comparison.
> I have enabled THP and hugepages-64kB.
>
> Test Function:
> ---
> #define PAGE_SIZE 4096
> #define CONT_PTES 16
> #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>
> void rwdata(char *buf)
> {
> for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
> buf[i] = 'a';
> volatile char c = buf[i];
> }
> }
> void test_contpte_perf()
> {
> char *buf;
> int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
> if (ret != 0) {
> perror("posix_memalign failed");
> exit(EXIT_FAILURE);
> }
>
> rwdata(buf);
>
> for (int j = 0; j < 500; j++) {
> mlock(buf, TEST_SIZE);
>
> rwdata(buf);
>
> munlock(buf, TEST_SIZE);
This is a microbenchmark in a pathological case and it's showing ~11%
improvement. But in principle I'm ok with it. I have some comments on the actual
change though, which I'll send through against email.
Thanks,
Ryan
> }
>
> free(buf);
> }
> ---
>
> Xavier (1):
> mm/contpte: Optimize loop to reduce redundant operations
>
> arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
Powered by blists - more mailing lists