[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ikew6o3p.fsf@oracle.com>
Date: Wed, 26 Nov 2025 21:28:26 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
david@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, mingo@...hat.com, luto@...nel.org, peterz@...radead.org,
tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v9 4/7] x86/mm: Simplify clear_page_*
Mateusz Guzik <mjguzik@...il.com> writes:
> On Fri, Nov 21, 2025 at 9:24 PM Ankur Arora <ankur.a.arora@...cle.com> wrote:
>> + * Switch between three implementations of page clearing based on CPU
>> + * capabilities:
>> + *
>> + * - __clear_pages_unrolled(): the oldest, slowest and universally
>> + * supported method. Zeroes via 8-byte MOV instructions unrolled 8x
>> + * to write a 64-byte cacheline in each loop iteration.
>> + *
>> + * - "REP; STOSQ": really old CPUs had crummy REP implementations.
>> + * Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
>> + * trusted. The instruction writes 8-byte per REP iteration but
>> + * CPUs can internally batch these together and do larger writes.
>> + *
>> + * - "REP; STOSB": CPUs that enumerate 'ERMS' have an improved STOS
>> + * implementation that is less picky about alignment and where
>> + * STOSB (1-byte at a time) is actually faster than STOSQ (8-bytes
>> + * at a time.)
>> + *
>
> I think this is somewhat odd commentary in this context.
>
> Note about "crummy REP implementations" should be in description of
> __clear_pages_unrolled as it justifies its existence (I think the
> routine would be best whacked btw, but I'm not going to argue about it
> in this thread).
> Description of STOSQ notes the CPU can do more than 8 bytes at a time,
> while description of STOSB claim does not make such a clarification.
> At the same time the note about less picky about alignment makes no
> significance in the context of page clearing as they are, well, page
> aligned.
Good point. I'll rework the comment a little bit to align things better
(maybe reusing some of what you suggest below).
> There is a fucky real-world problem with ERMS worth noting: there are
> hypervisor setups out there which *hide* the bit by default (no
> really, see Proxmox for example -- you get a bare bones pre-ERMS
> cpuid)
>
> With all this in mind, modulo poor grammar on my end, I would suggest
> something like this:
>
> <quote>
> There are 3 variants implemented:
> - REP; STOSB: used if the CPU supports "Enhanced REP MOVSB/STOSB" (aka
> ERMS), which is true for majority of microarchitectures today
> - REP; STOSQ: fallback if the ERMS bit is not present
> - __clear_pages_unrolled: code for CPUs which are determined to have
> poor REP support, only concerns long obsolete uarchs.
>
> Warnings: some hypervisors are configured to expose a very limited set
> of capabilites in the guest, fitering out ERMS even if present. As
> such the STOSQ variant is still in active use on some setups even when
> hardware does not need it.
> </quote>
The last bit is useful context though maybe some of it fits better in
the commit message.
Thanks
ankur
Powered by blists - more mailing lists