linux-kernel - Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com>
Date: Fri, 4 Jul 2025 13:45:13 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
 hpa@...or.com, mingo@...hat.com, mjguzik@...il.com, luto@...nel.org,
 peterz@...radead.org, acme@...nel.org, namhyung@...nel.org,
 tglx@...utronix.de, willy@...radead.org, jon.grimm@....com, bharata@....com,
 boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing


On 6/16/2025 10:52 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages, improving on the
> current page-at-a-time approach in two ways:
> 
>   - amortizes the per-page setup cost over a larger extent
>   - when using string instructions, exposes the real region size to the
>     processor. A processor could use that as a hint to optimize based
>     on the full extent size. AMD Zen uarchs, as an example, elide
>     allocation of cachelines for regions larger than L3-size.
> 
> Demand faulting a 64GB region shows good performance improvements:
> 
>   $ perf bench mem map -p $page-size -f demand -s 64GB -l 5
> 
>                   mm/folio_zero_user    x86/folio_zero_user       change
>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
>    pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%
> 
> [#] Only with preempt=full|lazy because cooperatively preempted models
> need regular invocations of cond_resched(). This limits the extent
> sizes that can be cleared as a unit.
> 
> Raghavendra also tested on AMD Genoa and that shows similar
> improvements [1].
> 
[...]
Sorry for coming back late on this:
It was nice to have it integrated to perf bench mem (easy to test :)).

I do see similar (almost same) improvement again with the rebased kernel
and patchset.
Tested only preempt=lazy and boost=1

base       6.16-rc4 + 1-9 patches of this series
patched =  6.16-rc4 + all patches

SUT: Genoa+ AMD EPYC 9B24

  $ perf bench mem map -p $page-size -f populate -s 64GB -l 10
                    base               patched              change
   pg-sz=2MB       12.731939 GB/sec    26.304263 GB/sec     106.6%
   pg-sz=1GB       26.232423 GB/sec    61.174836 GB/sec     133.2%

for 4kb page size there is a slight improvement (mostly a noise).

Thanks and Regards
- Raghu