lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com>
Date: Fri, 4 Jul 2025 13:45:13 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
 hpa@...or.com, mingo@...hat.com, mjguzik@...il.com, luto@...nel.org,
 peterz@...radead.org, acme@...nel.org, namhyung@...nel.org,
 tglx@...utronix.de, willy@...radead.org, jon.grimm@....com, bharata@....com,
 boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing


On 6/16/2025 10:52 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages, improving on the
> current page-at-a-time approach in two ways:
> 
>   - amortizes the per-page setup cost over a larger extent
>   - when using string instructions, exposes the real region size to the
>     processor. A processor could use that as a hint to optimize based
>     on the full extent size. AMD Zen uarchs, as an example, elide
>     allocation of cachelines for regions larger than L3-size.
> 
> Demand faulting a 64GB region shows good performance improvements:
> 
>   $ perf bench mem map -p $page-size -f demand -s 64GB -l 5
> 
>                   mm/folio_zero_user    x86/folio_zero_user       change
>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
>    pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%
> 
> [#] Only with preempt=full|lazy because cooperatively preempted models
> need regular invocations of cond_resched(). This limits the extent
> sizes that can be cleared as a unit.
> 
> Raghavendra also tested on AMD Genoa and that shows similar
> improvements [1].
> 
[...]
Sorry for coming back late on this:
It was nice to have it integrated to perf bench mem (easy to test :)).

I do see similar (almost same) improvement again with the rebased kernel
and patchset.
Tested only preempt=lazy and boost=1

base       6.16-rc4 + 1-9 patches of this series
patched =  6.16-rc4 + all patches

SUT: Genoa+ AMD EPYC 9B24

  $ perf bench mem map -p $page-size -f populate -s 64GB -l 10
                    base               patched              change
   pg-sz=2MB       12.731939 GB/sec    26.304263 GB/sec     106.6%
   pg-sz=1GB       26.232423 GB/sec    61.174836 GB/sec     133.2%

for 4kb page size there is a slight improvement (mostly a noise).

Thanks and Regards
- Raghu


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ