[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5570c6b9-4abd-1526-cd17-ed45f7d51b20@amd.com>
Date: Fri, 8 Sep 2023 07:48:16 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Mateusz Guzik <mjguzik@...il.com>,
Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
willy@...radead.org, mgorman@...e.de, peterz@...radead.org,
rostedt@...dmis.org, tglx@...utronix.de, jon.grimm@....com,
bharata@....com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing
On 9/3/2023 1:44 PM, Mateusz Guzik wrote:
> On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
>> This series adds a multi-page clearing primitive, clear_pages(),
>> which enables more effective use of x86 string instructions by
>> advertising the real region-size to be cleared.
>>
>> Region-size can be used as a hint by uarchs to optimize the
>> clearing.
>>
>> Also add allow_resched() which marks a code-section as allowing
>> rescheduling in the irqentry_exit path. This allows clear_pages()
>> to get by without having to call cond_sched() periodically.
>> (preempt_model_full() already handles this via
>> irqentry_exit_cond_resched(), so we handle this similarly for
>> preempt_model_none() and preempt_model_voluntary().)
>>
>> Performance
>> ==
>>
>> With this demand fault performance gets a decent increase:
>>
>> *Milan* mm/clear_huge_page x86/clear_huge_page change
>> (GB/s) (GB/s)
>>
>> pg-sz=2MB 14.55 19.29 +32.5%
>> pg-sz=1GB 19.34 49.60 +156.4%
>>
>> Milan (and some other AMD Zen uarchs tested) take advantage of the
>> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
>> this optimization seems to be at around region-size > LLC-size so
>> the pg-sz=2MB load still allocates cachelines.
>>
>
> Have you benchmarked clzero? It is an AMD-specific instruction issuing
> non-temporal stores. It is definitely something to try out for 1G pages.
>
> One would think rep stosq has to be at least not worse since the CPU is
> explicitly told what to do and is free to optimize it however it sees
> fit, but the rep prefix has a long history of underperforming.
>
> I'm not saying it is going to be better, but that this should be tested,
> albeit one can easily argue this can be done at a later date.
>
> I would do it myself but my access to AMD CPUs is limited.
>
Hello Mateuz,
I plugged in CLZERO unconditionally (even for coherent path with
sfence) for my earlier experimets on top of this series.
Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0), for both base-hugepage-size=2M and 1GB
perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
SUT: AMD Bergamo with 2 node/2 socket 128 cores per socket.
From that I see time taken is:
for 2M: 1.092125
for 1G: 0.997661
So overall for 64GB size experiment result look like this:
Time taken for 64GB region, (lesser = better)
page-size base patched (gain%) patched-clzero (gain%)
2M 5.0779 2.50623 (50.64) 1.092125 (78)
1G 2.50623 1.012439 (59.60) 0.997661 (60)
In summary I further see improvements for even for 2M base size (2.5x).
Overall CLZERO clearing is promising. But we may need threshold tuning
and hint passing as done in Ankurs'
Link:
https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
on top of current series.
I need to experiment with different chunk size as well as base size
further. (both clzero and rep stos)
Thanks and Regards
- Raghu
Run Details:
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10
runs):
996.34 msec task-clock # 0.999 CPUs
utilized ( +- 0.02% )
2 context-switches # 2.007 /sec
( +- 21.34% )
0 cpu-migrations # 0.000 /sec
212 page-faults # 212.735 /sec
( +- 0.20% )
3,116,497,471 cycles # 3.127 GHz
( +- 0.02% ) (35.66%)
100,343 stalled-cycles-frontend # 0.00% frontend
cycles idle ( +- 16.85% ) (35.75%)
1,369,118 stalled-cycles-backend # 0.04% backend
cycles idle ( +- 3.45% ) (35.86%)
4,325,987,025 instructions # 1.39 insn per cycle
# 0.00 stalled
cycles per insn ( +- 0.02% ) (35.87%)
1,078,119,163 branches # 1.082 G/sec
( +- 0.01% ) (35.87%)
87,907 branch-misses # 0.01% of all
branches ( +- 5.22% ) (35.83%)
12,337,100 L1-dcache-loads # 12.380 M/sec
( +- 5.44% ) (35.74%)
280,300 L1-dcache-load-misses # 2.48% of all
L1-dcache accesses ( +- 5.74% ) (35.64%)
1,464,549 L1-icache-loads # 1.470 M/sec
( +- 1.61% ) (35.63%)
30,659 L1-icache-load-misses # 2.12% of all
L1-icache accesses ( +- 3.30% ) (35.62%)
17,366 dTLB-loads # 17.426 K/sec
( +- 5.52% ) (35.63%)
11,774 dTLB-load-misses # 81.79% of all
dTLB cache accesses ( +- 7.94% ) (35.63%)
0 iTLB-loads # 0.000 /sec
(35.63%)
2 iTLB-load-misses # 0.00% of all
iTLB cache accesses ( +-342.39% ) (35.64%)
0.997661 +- 0.000150 seconds time elapsed ( +- 0.02% )
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb' (10 runs):
1,089.97 msec task-clock # 0.998 CPUs
utilized ( +- 0.03% )
3 context-switches # 2.750 /sec
( +- 15.11% )
0 cpu-migrations # 0.000 /sec
32,917 page-faults # 30.172 K/sec
( +- 0.00% )
3,408,713,422 cycles # 3.124 GHz
( +- 0.03% ) (35.60%)
982,417 stalled-cycles-frontend # 0.03% frontend
cycles idle ( +- 2.77% ) (35.60%)
8,495,409 stalled-cycles-backend # 0.25% backend
cycles idle ( +- 6.12% ) (35.59%)
4,970,939,278 instructions # 1.46 insn per cycle
# 0.00 stalled
cycles per insn ( +- 0.04% ) (35.64%)
1,196,644,653 branches # 1.097 G/sec
( +- 0.03% ) (35.73%)
196,584 branch-misses # 0.02% of all
branches ( +- 2.79% ) (35.78%)
226,254,284 L1-dcache-loads # 207.388 M/sec
( +- 0.23% ) (35.78%)
1,161,607 L1-dcache-load-misses # 0.52% of all
L1-dcache accesses ( +- 3.27% ) (35.78%)
21,757,775 L1-icache-loads # 19.943 M/sec
( +- 0.66% ) (35.77%)
165,503 L1-icache-load-misses # 0.78% of all
L1-icache accesses ( +- 3.11% ) (35.78%)
1,118,573 dTLB-loads # 1.025 M/sec
( +- 1.38% ) (35.78%)
415,943 dTLB-load-misses # 37.10% of all
dTLB cache accesses ( +- 1.12% ) (35.78%)
36 iTLB-loads # 32.998 /sec
( +- 18.47% ) (35.74%)
49,785 iTLB-load-misses # 270570.65% of all
iTLB cache accesses ( +- 0.34% ) (35.65%)
1.092125 +- 0.000350 seconds time elapsed ( +- 0.03% )
Powered by blists - more mailing lists