linux-kernel - Re: [PATCH v3 4/4] x86/folio_zero

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tt6pw11t.fsf@oracle.com>
Date: Tue, 15 Apr 2025 14:46:22 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, x86@...nel.org, torvalds@...ux-foundation.org,
        akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, luto@...nel.org, peterz@...radead.org,
        paulmck@...nel.org, rostedt@...dmis.org, tglx@...utronix.de,
        willy@...radead.org, jon.grimm@....com, bharata@....com,
        raghavendra.kt@....com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com
Subject: Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing


Mateusz Guzik <mjguzik@...il.com> writes:

> On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>>
>> As a secondary benefit, string instructions are typically microcoded,
>> and working with larger regions helps amortize the cost of the decode.
>>
>> When zeroing the 2MB page, maximize spatial locality by clearing in
>> three sections: the faulting page and its immediate neighbourhood, the
>> left and the right regions, with the local neighbourhood cleared last.
>>
>> Performance
>> ==
>>
>> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
>> NUMA node.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>
>> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
>>
>> pg-sz=1GB:
>>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>>   -    544,878,976      instructions                     #    0.06  insn per cycle
>>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
>>
>>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>>   +     10,979,121      instructions                     #    0.00  insn per cycle
>>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
>>
>> That's not the case with pg-sz=2MB, where we also perform better but
>> the number of cacheline allocations remain the same.
>>
>> It's not entirely clear why the performance for pg-sz=2MB improves. We
>> decode fewer instructions and the hardware prefetcher can do a better
>> job, but the perf stats for both of those aren't convincing enough to
>> the extent of ~30%.
>>
>> pg-sz=2MB:
>>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
>>   -    607,589,360      instructions                     #    0.05  insn per cycle
>>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
>>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
>>
>>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
>>   +    277,336,691      instructions                     #    0.03  insn per cycle
>>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
>>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>
>> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
>> see a drop in cycles but there's no drop in cacheline allocation.
>>
>
> Back when I was young and handsome and 32-bit x86 was king, people
> assumed 4K pages needed to be cleared with non-temporal stores to avoid
> evicting stuff from caches. I had never seen measurements showing this
> has the intended effect. Some time after this became a thing I did see
> measurements showing that this in fact *increases* cache misses. I am
> not saying this was necessarily the case for all x86 uarchs, merely that
> the sensibly sounding assumption turned bogus at some point (if it was
> ever legit).

That was a long time ago though ;-). And, your point makes sense for
small sized pages. But, consider that zeroing a 1GB page can easily blow
away an L3 cache for absolutely nothing gained -- probabilistically,
nothing that was in the page that remains in the cache will ever be
accessed.

Now, you could argue that the situation is less clear for 2MB pages.

> This brings me to the multi-stage clearing employed here for locality.
> While it sounds great on paper, for all I know it does not provide any
> advantage. It very well may be it is harmful by preventing the CPU from
> knowing what you are trying to do.
>
> I think doing this warrants obtaining stats from some real workloads,
> but given how time consuming this can be I think it would be tolerable
> to skip it for now.
>
>> Performance for preempt=none|voluntary remains unchanged.
>>
>
> So I was under the impression the benefit would be realized for all
> kernels.
>
> I don't know how preemption support is implemented on Linux. Do you
> always get an IPI?

No. The need-resched bit is common. It's just there's no preemption via
irqentry, just synchronous calls to cond_resched() (as you mention below).

Zeroing via a subroutine like instruction (rep; stos) is incompatible with
synchronous calls to cond_resched() so this code is explicitly not called
for none/voluntary (see patch 3.)

That said, I'll probably take Ingo's suggestion of chunking things up
in say 8/16MB portions for cooperative preemption models.

Ankur


> I was thinking something like this: a per-cpu var akin to preemption
> count, but indicating the particular code section is fully preemptible
>
> Then:
>
> preemptible_enter();
> clear_pages();
> preemptible_exit();
>
> for simpler handling of the var it could prevent migration to other
> CPUs.
>
> then the IPI handler for preemption would check if ->preemptible is set
> + preemption disablement is zero, in which case it would take you off
> cpu.
>
> If this is a problem, then a better granularity would help (say 8 pages
> between cond_rescheds?)
>
>> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
>> ---
>>  arch/x86/mm/Makefile |  1 +
>>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/mm.h   |  1 +
>>  3 files changed, 62 insertions(+)
>>  create mode 100644 arch/x86/mm/memory.c
>>
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index 32035d5be5a0..e61b4d331cdf 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
>>  obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
>>  obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
>>  obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>> +obj-$(CONFIG_PREEMPTION)	+= memory.o
>>
>>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
>>  obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
>> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
>> new file mode 100644
>> index 000000000000..99851c246fcc
>> --- /dev/null
>> +++ b/arch/x86/mm/memory.c
>> @@ -0,0 +1,60 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +#include <linux/mm.h>
>> +#include <linux/range.h>
>> +#include <linux/minmax.h>
>> +
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	int width = 2; /* pages cleared last on either side */
>> +	struct range r[3];
>> +	int i;
>> +
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>> +	 * ensure it sticks around in the cache.
>> +	 */
>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> +	/* Region to the left of the fault */
>> +	r[1] = DEFINE_RANGE(pg.start,
>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>> +
>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> +			    pg.end);
>> +
>> +	for (i = 0; i <= 2; i++) {
>> +		int len = range_len(&r[i]);
>> +
>> +		if (len > 0)
>> +			clear_pages(page_address(folio_page(folio, r[i].start)), len);
>> +	}
>> +
>> +out:
>> +	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> +	cond_resched();
>> +}
>> +
>> +#endif /* CONFIG_HIGHMEM */
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b7f13f087954..b57512da8173 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
>>  };
>>
>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
>>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
>>  int copy_user_large_folio(struct folio *dst, struct folio *src,
>>  			  unsigned long addr_hint,
>> --
>> 2.31.1
>>
>>


--
ankur