linux-kernel - Re: [PATCH v3 0/4] mm/folio_zero

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87tt6q3618.fsf@oracle.com>
Date: Mon, 14 Apr 2025 12:19:15 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, x86@...nel.org, torvalds@...ux-foundation.org,
        akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, luto@...nel.org, peterz@...radead.org,
        paulmck@...nel.org, rostedt@...dmis.org, tglx@...utronix.de,
        willy@...radead.org, jon.grimm@....com, bharata@....com,
        raghavendra.kt@....com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing


Ingo Molnar <mingo@...nel.org> writes:

> * Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>
> How was this measured? Could you integrate this measurement as a new
> tools/perf/bench/ subcommand so that people can try it on different
> systems, etc.? There's already a 'perf bench mem' subcommand space
> where this feature could be added to.

This was a standalone trivial mmap workload similar to what qemu does
when creating a VM, really any hugetlb mmap().

x86-64-stosq (lib/memset_64.S::__memset) should have the same performance
characteristics but it uses malloc() for allocation.

For this workload we want to control the allocation path as well. Let me
see if it makes sense to extend perf bench mem memset to optionally allocate
via mmap(MAP_HUGETLB) or add a new workload under perf bench mem which
does that.

Thanks for the review!

--
ankur