lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ikew6o3p.fsf@oracle.com>
Date: Wed, 26 Nov 2025 21:28:26 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
        david@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, luto@...nel.org, peterz@...radead.org,
        tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v9 4/7] x86/mm: Simplify clear_page_*


Mateusz Guzik <mjguzik@...il.com> writes:

> On Fri, Nov 21, 2025 at 9:24 PM Ankur Arora <ankur.a.arora@...cle.com> wrote:
>> + * Switch between three implementations of page clearing based on CPU
>> + * capabilities:
>> + *
>> + *  - __clear_pages_unrolled(): the oldest, slowest and universally
>> + *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
>> + *    to write a 64-byte cacheline in each loop iteration.
>> + *
>> + *  - "REP; STOSQ": really old CPUs had crummy REP implementations.
>> + *    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
>> + *    trusted. The instruction writes 8-byte per REP iteration but
>> + *    CPUs can internally batch these together and do larger writes.
>> + *
>> + *  - "REP; STOSB": CPUs that enumerate 'ERMS' have an improved STOS
>> + *    implementation that is less picky about alignment and where
>> + *    STOSB (1-byte at a time) is actually faster than STOSQ (8-bytes
>> + *    at a time.)
>> + *
>
> I think this is somewhat odd commentary in this context.
>
> Note about "crummy REP implementations" should be in description of
> __clear_pages_unrolled as it justifies its existence (I think the
> routine would be best whacked btw, but I'm not going to argue about it
> in this thread).
> Description of STOSQ notes the CPU can do more than 8 bytes at a time,
> while description of STOSB claim does not make such a clarification.
> At the same time the note about less picky about alignment makes no
> significance in the context of page clearing as they are, well, page
> aligned.

Good point. I'll rework the comment a little bit to align things better
(maybe reusing some of what you suggest below).

> There is a fucky real-world problem with ERMS worth noting: there are
> hypervisor setups out there which *hide* the bit by default (no
> really, see Proxmox for example -- you get a bare bones pre-ERMS
> cpuid)
>
> With all this in mind, modulo poor grammar on my end, I would suggest
> something like this:
>
> <quote>
> There are 3 variants implemented:
> - REP; STOSB: used if the CPU supports "Enhanced REP MOVSB/STOSB" (aka
> ERMS), which is true for majority of microarchitectures today
> - REP; STOSQ: fallback if the ERMS bit is not present
> - __clear_pages_unrolled: code for CPUs which are determined to have
> poor REP support, only concerns long obsolete uarchs.
>
> Warnings: some hypervisors are configured to expose a very limited set
> of capabilites in the guest, fitering out ERMS even if present. As
> such the STOSQ variant is still in active use on some setups even when
> hardware does not need it.
> </quote>

The last bit is useful context though maybe some of it fits better in
the commit message.

Thanks
ankur

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ