[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com>
Date: Tue, 22 Apr 2025 11:53:06 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org
Cc: torvalds@...ux-foundation.org, akpm@...ux-foundation.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
luto@...nel.org, peterz@...radead.org, paulmck@...nel.org,
rostedt@...dmis.org, tglx@...utronix.de, willy@...radead.org,
jon.grimm@....com, bharata@....com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
On 4/14/2025 9:16 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
>
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
>
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
>
> This results in significantly better performance.
>
> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
>
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>
> mm/folio_zero_user x86/folio_zero_user change
> (GB/s +- stddev) (GB/s +- stddev)
>
> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2%
> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5%
>
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>
> mm/folio_zero_user x86/folio_zero_user change
> (GB/s +- stddev) (GB/s +- stddev)
>
> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57%
> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10%
>
[...]
Hello Ankur,
Thank you for the patches. Was able to test briefly w/ lazy preempt
mode.
(I do understand that, there could be lot of churn based on Ingo,
Mateusz and others' comments)
But here it goes:
SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
metric = time taken in sec (lower is better). total SIZE=64GB
mm/folio_zero_user x86/folio_zero_user change
pg-sz=1GB 2.47044 +- 0.38% 1.060877 +- 0.07% 57.06
pg-sz=2MB 5.098403 +- 0.01% 2.52015 +- 0.36% 50.57
More details (1G example run):
base kernel = 6.14 (preempt = lazy)
mm/folio_zero_user
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
2,476.47 msec task-clock # 1.002
CPUs utilized ( +- 0.39% )
5 context-switches # 2.025
/sec ( +- 29.70% )
2 cpu-migrations # 0.810
/sec ( +- 21.15% )
202 page-faults # 81.806
/sec ( +- 0.18% )
7,348,664,233 cycles # 2.976 GHz
( +- 0.38% ) (38.39%)
878,805,326 stalled-cycles-frontend # 11.99%
frontend cycles idle ( +- 0.74% ) (38.43%)
339,023,729 instructions # 0.05
insn per cycle
# 2.53 stalled
cycles per insn ( +- 0.08% ) (38.47%)
88,579,915 branches # 35.873
M/sec ( +- 0.06% ) (38.51%)
17,369,776 branch-misses # 19.55% of
all branches ( +- 0.04% ) (38.55%)
2,261,339,695 L1-dcache-loads # 915.795
M/sec ( +- 0.06% ) (38.56%)
1,073,880,164 L1-dcache-load-misses # 47.48% of
all L1-dcache accesses ( +- 0.05% ) (38.56%)
511,231,988 L1-icache-loads # 207.038
M/sec ( +- 0.25% ) (38.52%)
128,533 L1-icache-load-misses # 0.02% of
all L1-icache accesses ( +- 0.40% ) (38.48%)
38,134 dTLB-loads # 15.443
K/sec ( +- 4.22% ) (38.44%)
33,992 dTLB-load-misses # 114.39% of
all dTLB cache accesses ( +- 9.42% ) (38.40%)
156 iTLB-loads # 63.177
/sec ( +- 13.34% ) (38.36%)
156 iTLB-load-misses # 102.50% of
all iTLB cache accesses ( +- 25.98% ) (38.36%)
2.47044 +- 0.00949 seconds time elapsed ( +- 0.38% )
x86/folio_zero_user
1,056.72 msec task-clock # 0.996
CPUs utilized ( +- 0.07% )
10 context-switches # 9.436
/sec ( +- 3.59% )
3 cpu-migrations # 2.831
/sec ( +- 11.33% )
200 page-faults # 188.718
/sec ( +- 0.15% )
3,146,571,264 cycles # 2.969 GHz
( +- 0.07% ) (38.35%)
17,226,261 stalled-cycles-frontend # 0.55%
frontend cycles idle ( +- 4.12% ) (38.44%)
14,130,553 instructions # 0.00
insn per cycle
# 1.39 stalled
cycles per insn ( +- 1.59% ) (38.53%)
3,578,614 branches # 3.377
M/sec ( +- 1.54% ) (38.62%)
415,807 branch-misses # 12.45% of
all branches ( +- 1.17% ) (38.62%)
22,208,699 L1-dcache-loads # 20.956
M/sec ( +- 5.27% ) (38.60%)
7,312,684 L1-dcache-load-misses # 27.79% of
all L1-dcache accesses ( +- 8.46% ) (38.51%)
4,032,315 L1-icache-loads # 3.805
M/sec ( +- 1.29% ) (38.48%)
15,094 L1-icache-load-misses # 0.38% of
all L1-icache accesses ( +- 1.14% ) (38.39%)
14,365 dTLB-loads # 13.555
K/sec ( +- 7.23% ) (38.38%)
9,477 dTLB-load-misses # 65.36% of
all dTLB cache accesses ( +- 12.05% ) (38.38%)
18 iTLB-loads # 16.985
/sec ( +- 34.84% ) (38.38%)
67 iTLB-load-misses # 158.39% of
all iTLB cache accesses ( +- 48.32% ) (38.32%)
1.060877 +- 0.000766 seconds time elapsed ( +- 0.07% )
Thanks and Regards
- Raghu
Powered by blists - more mailing lists