lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEuRgDEHQOAsK=SmFu29a3NUyLDL1r5PVuahxbdOR6jZg@mail.gmail.com>
Date: Wed, 16 Apr 2025 00:01:24 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org, 
	torvalds@...ux-foundation.org, akpm@...ux-foundation.org, bp@...en8.de, 
	dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com, luto@...nel.org, 
	peterz@...radead.org, paulmck@...nel.org, rostedt@...dmis.org, 
	tglx@...utronix.de, willy@...radead.org, jon.grimm@....com, bharata@....com, 
	raghavendra.kt@....com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing

On Tue, Apr 15, 2025 at 11:46 PM Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>
> Mateusz Guzik <mjguzik@...il.com> writes:
>
> > On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
> >> clear_pages_rep(), clear_pages_erms() use string instructions to zero
> >> memory. When operating on more than a single page, we can use these
> >> more effectively by explicitly advertising the region-size to the
> >> processor, which can use that as a hint to optimize the clearing
> >> (ex. by eliding cacheline allocation.)
> >>
> >> As a secondary benefit, string instructions are typically microcoded,
> >> and working with larger regions helps amortize the cost of the decode.
> >>
> >> When zeroing the 2MB page, maximize spatial locality by clearing in
> >> three sections: the faulting page and its immediate neighbourhood, the
> >> left and the right regions, with the local neighbourhood cleared last.
> >>
> >> Performance
> >> ==
> >>
> >> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
> >> NUMA node.
> >>
> >> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> >>
> >>                  mm/folio_zero_user    x86/folio_zero_user     change
> >>                   (GB/s  +- stddev)      (GB/s  +- stddev)
> >>
> >>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> >>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
> >>
> >> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
> >> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
> >>
> >> pg-sz=1GB:
> >>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
> >>   -    544,878,976      instructions                     #    0.06  insn per cycle
> >>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
> >>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
> >>
> >>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
> >>   +     10,979,121      instructions                     #    0.00  insn per cycle
> >>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
> >>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
> >>
> >> That's not the case with pg-sz=2MB, where we also perform better but
> >> the number of cacheline allocations remain the same.
> >>
> >> It's not entirely clear why the performance for pg-sz=2MB improves. We
> >> decode fewer instructions and the hardware prefetcher can do a better
> >> job, but the perf stats for both of those aren't convincing enough to
> >> the extent of ~30%.
> >>
> >> pg-sz=2MB:
> >>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
> >>   -    607,589,360      instructions                     #    0.05  insn per cycle
> >>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
> >>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
> >>
> >>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
> >>   +    277,336,691      instructions                     #    0.03  insn per cycle
> >>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
> >>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
> >>
> >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> >>
> >>                  mm/folio_zero_user    x86/folio_zero_user     change
> >>                   (GB/s +- stddev)      (GB/s +- stddev)
> >>
> >>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
> >>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
> >>
> >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
> >> see a drop in cycles but there's no drop in cacheline allocation.
> >>
> >
> > Back when I was young and handsome and 32-bit x86 was king, people
> > assumed 4K pages needed to be cleared with non-temporal stores to avoid
> > evicting stuff from caches. I had never seen measurements showing this
> > has the intended effect. Some time after this became a thing I did see
> > measurements showing that this in fact *increases* cache misses. I am
> > not saying this was necessarily the case for all x86 uarchs, merely that
> > the sensibly sounding assumption turned bogus at some point (if it was
> > ever legit).
>
> That was a long time ago though ;-). And, your point makes sense for
> small sized pages. But, consider that zeroing a 1GB page can easily blow
> away an L3 cache for absolutely nothing gained -- probabilistically,
> nothing that was in the page that remains in the cache will ever be
> accessed.
>
> Now, you could argue that the situation is less clear for 2MB pages.
>

Well I was talking about 2MB. ;) I thought it is a foregone conclusion
that 1GB pages will be handled with non-temporal stores, but maybe I'm
crossing my wires.

> > This brings me to the multi-stage clearing employed here for locality.
> > While it sounds great on paper, for all I know it does not provide any
> > advantage. It very well may be it is harmful by preventing the CPU from
> > knowing what you are trying to do.
> >
> > I think doing this warrants obtaining stats from some real workloads,
> > but given how time consuming this can be I think it would be tolerable
> > to skip it for now.
> >
> >> Performance for preempt=none|voluntary remains unchanged.
> >>
> >
> > So I was under the impression the benefit would be realized for all
> > kernels.
> >
> > I don't know how preemption support is implemented on Linux. Do you
> > always get an IPI?
>
> No. The need-resched bit is common. It's just there's no preemption via
> irqentry, just synchronous calls to cond_resched() (as you mention below).
>
> Zeroing via a subroutine like instruction (rep; stos) is incompatible with
> synchronous calls to cond_resched() so this code is explicitly not called
> for none/voluntary (see patch 3.)
>
> That said, I'll probably take Ingo's suggestion of chunking things up
> in say 8/16MB portions for cooperative preemption models.

makes sense, thanks

>
>
> > I was thinking something like this: a per-cpu var akin to preemption
> > count, but indicating the particular code section is fully preemptible
> >
> > Then:
> >
> > preemptible_enter();
> > clear_pages();
> > preemptible_exit();
> >
> > for simpler handling of the var it could prevent migration to other
> > CPUs.
> >
> > then the IPI handler for preemption would check if ->preemptible is set
> > + preemption disablement is zero, in which case it would take you off
> > cpu.
> >
> > If this is a problem, then a better granularity would help (say 8 pages
> > between cond_rescheds?)
> >
> >> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
> >> ---
> >>  arch/x86/mm/Makefile |  1 +
> >>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/mm.h   |  1 +
> >>  3 files changed, 62 insertions(+)
> >>  create mode 100644 arch/x86/mm/memory.c
> >>
> >> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> >> index 32035d5be5a0..e61b4d331cdf 100644
> >> --- a/arch/x86/mm/Makefile
> >> +++ b/arch/x86/mm/Makefile
> >> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)       += testmmiotrace.o
> >>  obj-$(CONFIG_NUMA)          += numa.o numa_$(BITS).o
> >>  obj-$(CONFIG_AMD_NUMA)              += amdtopology.o
> >>  obj-$(CONFIG_ACPI_NUMA)             += srat.o
> >> +obj-$(CONFIG_PREEMPTION)    += memory.o
> >>
> >>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)      += pkeys.o
> >>  obj-$(CONFIG_RANDOMIZE_MEMORY)                      += kaslr.o
> >> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
> >> new file mode 100644
> >> index 000000000000..99851c246fcc
> >> --- /dev/null
> >> +++ b/arch/x86/mm/memory.c
> >> @@ -0,0 +1,60 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +#include <linux/mm.h>
> >> +#include <linux/range.h>
> >> +#include <linux/minmax.h>
> >> +
> >> +#ifndef CONFIG_HIGHMEM
> >> +/*
> >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
> >> + *
> >> + * Taking inspiration from the common code variant, we split the zeroing in
> >> + * three parts: left of the fault, right of the fault, and up to 5 pages
> >> + * in the immediate neighbourhood of the target page.
> >> + *
> >> + * Cleared in that order to keep cache lines of the target region hot.
> >> + *
> >> + * For gigantic pages, there is no expectation of cache locality so just do a
> >> + * straight zero.
> >> + */
> >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
> >> +{
> >> +    unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> >> +    const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> >> +    const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> >> +    int width = 2; /* pages cleared last on either side */
> >> +    struct range r[3];
> >> +    int i;
> >> +
> >> +    if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> >> +            clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
> >> +            goto out;
> >> +    }
> >> +
> >> +    /*
> >> +     * Faulting page and its immediate neighbourhood. Cleared at the end to
> >> +     * ensure it sticks around in the cache.
> >> +     */
> >> +    r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> >> +                        clamp_t(s64, fault_idx + width, pg.start, pg.end));
> >> +
> >> +    /* Region to the left of the fault */
> >> +    r[1] = DEFINE_RANGE(pg.start,
> >> +                        clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> >> +
> >> +    /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> >> +    r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> >> +                        pg.end);
> >> +
> >> +    for (i = 0; i <= 2; i++) {
> >> +            int len = range_len(&r[i]);
> >> +
> >> +            if (len > 0)
> >> +                    clear_pages(page_address(folio_page(folio, r[i].start)), len);
> >> +    }
> >> +
> >> +out:
> >> +    /* Explicitly invoke cond_resched() to handle any live patching necessary. */
> >> +    cond_resched();
> >> +}
> >> +
> >> +#endif /* CONFIG_HIGHMEM */
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index b7f13f087954..b57512da8173 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
> >>  };
> >>
> >>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
> >>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
> >>  int copy_user_large_folio(struct folio *dst, struct folio *src,
> >>                        unsigned long addr_hint,
> >> --
> >> 2.31.1
> >>
> >>
>
>
> --
> ankur



-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ