linux-kernel - Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAG48ez3=kLL4wBxAVSa2Ugrws+-RFQMdNY9jx5FAdbhpNt8fGg@mail.gmail.com>
Date: Thu, 7 Aug 2025 19:46:39 +0200
From: Jann Horn <jannh@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: kernel test robot <oliver.sang@...el.com>, Dev Jain <dev.jain@....com>, oe-lkp@...ts.linux.dev, 
	lkp@...el.com, linux-kernel@...r.kernel.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Barry Song <baohua@...nel.org>, 
	Pedro Falcato <pfalcato@...e.de>, Anshuman Khandual <anshuman.khandual@....com>, 
	Bang Li <libang.li@...group.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	bibo mao <maobibo@...ngson.cn>, David Hildenbrand <david@...hat.com>, Hugh Dickins <hughd@...gle.com>, 
	Ingo Molnar <mingo@...nel.org>, Lance Yang <ioworker0@...il.com>, 
	Liam Howlett <liam.howlett@...cle.com>, Matthew Wilcox <willy@...radead.org>, 
	Peter Xu <peterx@...hat.com>, Qi Zheng <zhengqi.arch@...edance.com>, 
	Ryan Roberts <ryan.roberts@....com>, Vlastimil Babka <vbabka@...e.cz>, 
	Yang Shi <yang@...amperecomputing.com>, Zi Yan <ziy@...dia.com>, linux-mm@...ck.org
Subject: Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec
 37.3% regression

On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
> > On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
> > <lorenzo.stoakes@...cle.com> wrote:
> > > On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > ---------------- ---------------------------
> > > >          %stddev     %change         %stddev
> > > >              \          |                \
> > > >      13777 ą 37%     +45.0%      19979 ą 27%  numa-vmstat.node1.nr_slab_reclaimable
> > > >     367205            +2.3%     375703        vmstat.system.in
> > > >      55106 ą 37%     +45.1%      79971 ą 27%  numa-meminfo.node1.KReclaimable
> > > >      55106 ą 37%     +45.1%      79971 ą 27%  numa-meminfo.node1.SReclaimable
> > > >     559381           -37.3%     350757        stress-ng.bigheap.realloc_calls_per_sec
> > > >      11468            +1.2%      11603        stress-ng.time.system_time
> > > >     296.25            +4.5%     309.70        stress-ng.time.user_time
> > > >       0.81 ą187%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > >       9.36 ą165%    -100.0%       0.00        perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > >       0.81 ą187%    -100.0%       0.00        perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > >       9.36 ą165%    -100.0%       0.00        perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > >       5.50 ą 17%    +390.9%      27.00 ą 56%  perf-c2c.DRAM.local
> > > >     388.50 ą 10%    +114.7%     834.17 ą 33%  perf-c2c.DRAM.remote
> > > >       1214 ą 13%    +107.3%       2517 ą 31%  perf-c2c.HITM.local
> > > >     135.00 ą 19%    +130.9%     311.67 ą 32%  perf-c2c.HITM.remote
> > > >       1349 ą 13%    +109.6%       2829 ą 31%  perf-c2c.HITM.total
> > >
> > > Yeah this also looks pretty consistent too...
> >
> > FWIW, HITM hat different meanings depending on exactly which
> > microarchitecture that test happened on; the message says it is from
> > Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
> > meaningful than if it came from a pre-IceLake system (see
> > https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
> >
> > To me those numbers mainly look like you're accessing a lot more
> > cache-cold data. (On pre-IceLake they would indicate cacheline
> > bouncing, but I guess here they probably don't.) And that makes sense,
> > since before the patch, this path was just moving PTEs around without
> > looking at the associated pages/folios; basically more or less like a
> > memcpy() on x86-64. But after the patch, for every 8 bytes that you
> > copy, you have to load a cacheline from the vmemmap to get the page.
>
> Yup this is representative of what my investigation is showing.
>
> I've narrowed it down but want to wait to report until I'm sure...
>
> But yeah we're doing a _lot_ more work.
>
> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
> especially sensitive to this (I found issues with this with my abortive mremap
> anon merging stuff too, but really expected it there...)

Another approach would be to always read and write PTEs in
contpte-sized chunks here, without caring whether they're actually
contiguous or whatever, or something along those lines.