[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6f84476-e407-4d6b-a892-493c4359f86f@kernel.org>
Date: Wed, 7 Jan 2026 23:18:36 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, mingo@...hat.com, mjguzik@...il.com, luto@...nel.org,
peterz@...radead.org, tglx@...utronix.de, willy@...radead.org,
raghavendra.kt@....com, chleroy@...nel.org, ioworker0@...il.com,
lizhe.67@...edance.com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v11 8/8] mm: folio_zero_user: cache neighbouring pages
On 1/7/26 08:20, Ankur Arora wrote:
> folio_zero_user() does straight zeroing without caring about
> temporal locality for caches.
>
> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
> algorithm into a separate function") where we cleared a page at a
> time converging to the faulting page from the left and the right.
>
> To retain limited temporal locality, split the clearing in three
> parts: the faulting page and its immediate neighbourhood, and the
> regions on its left and right. We clear the local neighbourhood last
> to maximize chances of it sticking around in the cache.
>
> Performance
> ===
>
> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
> memory=2.2 TB, L1d=16K/thread, L2=512K/thread, L3=2MB/thread)
>
> vm-scalability/anon-w-seq-hugetlb: this workload runs with 384 processes
> (one for each CPU) each zeroing anonymously mapped hugetlb memory which
> is then accessed sequentially.
> stime utime
>
> discontiguous-page 1739.93 ( +- 6.15% ) 1016.61 ( +- 4.75% )
> contiguous-page 1853.70 ( +- 2.51% ) 1187.13 ( +- 3.50% )
> batched-pages 1756.75 ( +- 2.98% ) 1133.32 ( +- 4.89% )
> neighbourhood-last 1725.18 ( +- 4.59% ) 1123.78 ( +- 7.38% )
>
> Both stime and utime largely respond somewhat expectedly. There is a
> fair amount of run to run variation but the general trend is that the
> stime drops and utime increases. There are a few oddities, like
> contiguous-page performing very differently from batched-pages.
>
> As such this is likely an uncommon pattern where we saturate the memory
> bandwidth (since all CPUs are running the test) and at the same time
> are cache constrained because we access the entire region.
>
> Kernel make (make -j 12 bzImage):
>
> stime utime
>
> discontiguous-page 199.29 ( +- 0.63% ) 1431.67 ( +- .04% )
> contiguous-page 193.76 ( +- 0.58% ) 1433.60 ( +- .05% )
> batched-pages 193.92 ( +- 0.76% ) 1431.04 ( +- .08% )
> neighbourhood-last 194.46 ( +- 0.68% ) 1431.51 ( +- .06% )
>
> For make the utime stays relatively flat with a fairly small (-2.4%)
> improvement in the stime.
>
> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
> Reviewed-by: Raghavendra K T <raghavendra.kt@....com>
> Tested-by: Raghavendra K T <raghavendra.kt@....com>
> ---
Nothing jumped at me, thanks!
Acked-by: David Hildenbrand (Red Hat) <david@...nel.org>
--
Cheers
David
Powered by blists - more mailing lists