lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGudoHFgDEEBgQK5PrEUAJsb=iFpsT5OJ8+7W8PV0CGNePR4JQ@mail.gmail.com>
Date: Wed, 26 Nov 2025 11:01:43 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org, 
	akpm@...ux-foundation.org, david@...nel.org, bp@...en8.de, 
	dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com, luto@...nel.org, 
	peterz@...radead.org, tglx@...utronix.de, willy@...radead.org, 
	raghavendra.kt@....com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v9 4/7] x86/mm: Simplify clear_page_*

On Fri, Nov 21, 2025 at 9:24 PM Ankur Arora <ankur.a.arora@...cle.com> wrote:
> + * Switch between three implementations of page clearing based on CPU
> + * capabilities:
> + *
> + *  - __clear_pages_unrolled(): the oldest, slowest and universally
> + *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
> + *    to write a 64-byte cacheline in each loop iteration.
> + *
> + *  - "REP; STOSQ": really old CPUs had crummy REP implementations.
> + *    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
> + *    trusted. The instruction writes 8-byte per REP iteration but
> + *    CPUs can internally batch these together and do larger writes.
> + *
> + *  - "REP; STOSB": CPUs that enumerate 'ERMS' have an improved STOS
> + *    implementation that is less picky about alignment and where
> + *    STOSB (1-byte at a time) is actually faster than STOSQ (8-bytes
> + *    at a time.)
> + *

I think this is somewhat odd commentary in this context.

Note about "crummy REP implementations" should be in description of
__clear_pages_unrolled as it justifies its existence (I think the
routine would be best whacked btw, but I'm not going to argue about it
in this thread).
Description of STOSQ notes the CPU can do more than 8 bytes at a time,
while description of STOSB claim does not make such a clarification.
At the same time the note about less picky about alignment makes no
significance in the context of page clearing as they are, well, page
aligned.

There is a fucky real-world problem with ERMS worth noting: there are
hypervisor setups out there which *hide* the bit by default (no
really, see Proxmox for example -- you get a bare bones pre-ERMS
cpuid)

With all this in mind, modulo poor grammar on my end, I would suggest
something like this:

<quote>
There are 3 variants implemented:
- REP; STOSB: used if the CPU supports "Enhanced REP MOVSB/STOSB" (aka
ERMS), which is true for majority of microarchitectures today
- REP; STOSQ: fallback if the ERMS bit is not present
- __clear_pages_unrolled: code for CPUs which are determined to have
poor REP support, only concerns long obsolete uarchs.

Warnings: some hypervisors are configured to expose a very limited set
of capabilites in the guest, fitering out ERMS even if present. As
such the STOSQ variant is still in active use on some setups even when
hardware does not need it.
</quote>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ