[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87jz7cq9wh.fsf@oracle.com>
Date: Tue, 22 Apr 2025 12:22:06 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Raghavendra K T <raghavendra.kt@....com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org, torvalds@...ux-foundation.org,
akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, mingo@...hat.com, luto@...nel.org, peterz@...radead.org,
paulmck@...nel.org, rostedt@...dmis.org, tglx@...utronix.de,
willy@...radead.org, jon.grimm@....com, bharata@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
Raghavendra K T <raghavendra.kt@....com> writes:
> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2%
>> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57%
>> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.
Thanks for testing.
> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
> mm/folio_zero_user x86/folio_zero_user change
> pg-sz=1GB 2.47044 +- 0.38% 1.060877 +- 0.07% 57.06
> pg-sz=2MB 5.098403 +- 0.01% 2.52015 +- 0.36% 50.57
Just to translate it into the same units I was using above:
mm/folio_zero_user x86/folio_zero_user
pg-sz=1GB 25.91 GBps +- 0.38% 60.37 GBps +- 0.07%
pg-sz=2MB 12.57 GBps +- 0.01% 25.39 GBps +- 0.36%
That's a decent improvement over Milan. Btw, are you using boost=1?
Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.
There's a gap on Milan as well but it is much smaller.
Thanks
Ankur
> More details (1G example run):
>
> base kernel = 6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 2,476.47 msec task-clock # 1.002 CPUs
> utilized ( +- 0.39% )
> 5 context-switches # 2.025 /sec ( +- 29.70% )
> 2 cpu-migrations # 0.810 /sec ( +- 21.15% )
> 202 page-faults # 81.806 /sec ( +- 0.18% )
> 7,348,664,233 cycles # 2.976 GHz ( +- 0.38% ) (38.39%)
> 878,805,326 stalled-cycles-frontend # 11.99% frontend cycles idle ( +- 0.74% ) (38.43%)
> 339,023,729 instructions # 0.05 insn per
> cycle
> # 2.53 stalled cycles per
> insn ( +- 0.08% )
> (38.47%)
> 88,579,915 branches # 35.873 M/sec
> ( +- 0.06% ) (38.51%)
> 17,369,776 branch-misses # 19.55% of all
> branches ( +- 0.04% ) (38.55%)
> 2,261,339,695 L1-dcache-loads # 915.795 M/sec
> ( +- 0.06% ) (38.56%)
> 1,073,880,164 L1-dcache-load-misses # 47.48% of all
> L1-dcache accesses ( +- 0.05% ) (38.56%)
> 511,231,988 L1-icache-loads # 207.038 M/sec
> ( +- 0.25% ) (38.52%)
> 128,533 L1-icache-load-misses # 0.02% of all
> L1-icache accesses ( +- 0.40% ) (38.48%)
> 38,134 dTLB-loads # 15.443 K/sec
> ( +- 4.22% ) (38.44%)
> 33,992 dTLB-load-misses # 114.39% of all dTLB
> cache accesses ( +- 9.42% ) (38.40%)
> 156 iTLB-loads # 63.177 /sec
> ( +- 13.34% ) (38.36%)
> 156 iTLB-load-misses # 102.50% of all iTLB
> cache accesses ( +- 25.98% ) (38.36%)
>
> 2.47044 +- 0.00949 seconds time elapsed ( +- 0.38% )
>
> x86/folio_zero_user
> 1,056.72 msec task-clock # 0.996 CPUs
> utilized ( +- 0.07% )
> 10 context-switches # 9.436 /sec
> ( +- 3.59% )
> 3 cpu-migrations # 2.831 /sec
> ( +- 11.33% )
> 200 page-faults # 188.718 /sec
> ( +- 0.15% )
> 3,146,571,264 cycles # 2.969 GHz
> ( +- 0.07% ) (38.35%)
> 17,226,261 stalled-cycles-frontend # 0.55% frontend
> cycles idle ( +- 4.12% ) (38.44%)
> 14,130,553 instructions # 0.00 insn per
> cycle
> # 1.39 stalled cycles per
> insn ( +- 1.59% )
> (38.53%)
> 3,578,614 branches # 3.377 M/sec
> ( +- 1.54% ) (38.62%)
> 415,807 branch-misses # 12.45% of all
> branches ( +- 1.17% ) (38.62%)
> 22,208,699 L1-dcache-loads # 20.956 M/sec
> ( +- 5.27% ) (38.60%)
> 7,312,684 L1-dcache-load-misses # 27.79% of all
> L1-dcache accesses ( +- 8.46% ) (38.51%)
> 4,032,315 L1-icache-loads # 3.805 M/sec
> ( +- 1.29% ) (38.48%)
> 15,094 L1-icache-load-misses # 0.38% of all
> L1-icache accesses ( +- 1.14% ) (38.39%)
> 14,365 dTLB-loads # 13.555 K/sec
> ( +- 7.23% ) (38.38%)
> 9,477 dTLB-load-misses # 65.36% of all dTLB
> cache accesses ( +- 12.05% ) (38.38%)
> 18 iTLB-loads # 16.985 /sec
> ( +- 34.84% ) (38.38%)
> 67 iTLB-load-misses # 158.39% of all iTLB
> cache accesses ( +- 48.32% ) (38.32%)
>
> 1.060877 +- 0.000766 seconds time elapsed ( +- 0.07% )
>
> Thanks and Regards
> - Raghu
--
ankur
Powered by blists - more mailing lists