[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87346o582b.fsf@oracle.com>
Date: Sun, 09 Nov 2025 23:20:15 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
mingo@...hat.com, mjguzik@...il.com, luto@...nel.org,
peterz@...radead.org, acme@...nel.org, namhyung@...nel.org,
tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges
David Hildenbrand (Red Hat) <david@...nel.org> writes:
> On 27.10.25 21:21, Ankur Arora wrote:
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a page-at-a-time. This enables CPU specific optimizations based on
>> the length of the region.
>> Operating on arbitrarily large regions can lead to high preemption
>> latency under cooperative preemption models. So, limit the worst
>> case preemption latency via architecture specified PAGE_CONTIG_NR
>> units.
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region being cleared. Two classes of
>> of optimizations:
>> - clearing iteration costs can be amortized over a range larger
>> than a single page.
>> - cacheline allocation elision (seen on AMD Zen models).
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>> $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>> page-at-a-time contiguous clearing change
>> (GB/s +- %stdev) (GB/s +- %stdev)
>> pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8%
>> preempt=*
>> pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% [#] + 5.2%
>> preempt=none|voluntary
>> pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% +144.3% preempt=full|lazy
>> [#] AMD Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, which is larger than ARCH_PAGE_CONTIG_NR, so
>> preempt=none|voluntary see no improvement on the pg-sz=1GB.
>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
>> Reviewed-by: Raghavendra K T <raghavendra.kt@....com>
>> Tested-by: Raghavendra K T <raghavendra.kt@....com>
>> ---
>> include/linux/mm.h | 6 ++++++
>> mm/memory.c | 42 +++++++++++++++++++++---------------------
>> 2 files changed, 27 insertions(+), 21 deletions(-)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ecbcb76df9de..02db84667f97 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3872,6 +3872,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>> unsigned int order) {}
>> #endif /* CONFIG_DEBUG_PAGEALLOC */
>> +#ifndef ARCH_PAGE_CONTIG_NR
>> +#define PAGE_CONTIG_NR 1
>> +#else
>> +#define PAGE_CONTIG_NR ARCH_PAGE_CONTIG_NR
>> +#endif
>
> The name is a bit misleading. We need something that tells us that this is for
> patch-processing (clearing? maybe alter copying?) contig pages. Likely spelling
> out that this is for the non-preemptible case only.
>
> I assume we can drop the "CONTIG", just like clear_pages() doesn't contain it
> etc.
>
> CLEAR_PAGES_NON_PREEMPT_BATCH
>
> PROCESS_PAGES_NON_PREEMPT_BATCH
I think this version is clearer. And would be viable for copying as well.
> Can you remind me again why this is arch specific, and why the default is 1
> instead of, say 2,4,8 ... ?
So, the only use for this value is to decide a reasonable frequency
for calling cond_resched() when operating on hugepages.
And the idea was the arch was best placed to have a reasonably safe
value based on the expected spread of bandwidths it might see across
uarchs. And the default choice of 1 was to keep it close to what we
have now.
Thinking about it now though, maybe it is better to instead do this
in common code. We could have two sets of defines,
PROCESS_PAGES_NON_PREEMPT_BATCH_{LARGE,SMALL}, the first for archs
that define __HAVE_ARCH_CLEAR_PAGES and the second, without.
The first could be 8MB (so with a machine doing ~10GBps, scheduling
latency of ~1ms), the second 512K (with ~512MBps, ~1ms)?
>> +
>> #ifndef __HAVE_ARCH_CLEAR_PAGES
>> /**
>> * clear_pages() - clear a page range for kernel-internal use.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 74b45e258323..7781b2aa18a8 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7144,40 +7144,40 @@ static inline int process_huge_page(
>> return 0;
>> }
>> -static void clear_gigantic_page(struct folio *folio, unsigned long
>> addr_hint,
>> - unsigned int nr_pages)
>> +/*
>> + * Clear contiguous pages chunking them up when running under
>> + * non-preemptible models.
>> + */
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> + unsigned int npages)
>> {
>> - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> - int i;
>> + unsigned int i, count, unit;
>> - might_sleep();
>> - for (i = 0; i < nr_pages; i++) {
>> + unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
>> +
>> + for (i = 0; i < npages; ) {
>> + count = min(unit, npages - i);
>> + clear_user_highpages(page + i,
>> + addr + i * PAGE_SIZE, count);
>> + i += count;
>
> Why not
>
> for (i = 0; i < nr_pages; i += count) {
>
> Also, I would leave the cond_resched(); where it was (before the invocation) to
> perform as little change as possible.
Both of those seem like good ideas.
Thanks
--
ankur
Powered by blists - more mailing lists