linux-kernel - Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <41bdce39-f731-4a93-a91c-34035f2d2814@redhat.com>
Date: Thu, 7 Aug 2025 20:13:16 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Jann Horn <jannh@...gle.com>, kernel test robot <oliver.sang@...el.com>,
 Dev Jain <dev.jain@....com>, oe-lkp@...ts.linux.dev, lkp@...el.com,
 linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
 Barry Song <baohua@...nel.org>, Pedro Falcato <pfalcato@...e.de>,
 Anshuman Khandual <anshuman.khandual@....com>,
 Bang Li <libang.li@...group.com>, Baolin Wang
 <baolin.wang@...ux.alibaba.com>, bibo mao <maobibo@...ngson.cn>,
 Hugh Dickins <hughd@...gle.com>, Ingo Molnar <mingo@...nel.org>,
 Lance Yang <ioworker0@...il.com>, Liam Howlett <liam.howlett@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, Peter Xu <peterx@...hat.com>,
 Qi Zheng <zhengqi.arch@...edance.com>, Ryan Roberts <ryan.roberts@....com>,
 Vlastimil Babka <vbabka@...e.cz>, Yang Shi <yang@...amperecomputing.com>,
 Zi Yan <ziy@...dia.com>, linux-mm@...ck.org
Subject: Re: [linus:master] [mm] f822a9a81a:
 stress-ng.bigheap.realloc_calls_per_sec 37.3% regression

On 07.08.25 20:04, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:01:51PM +0200, David Hildenbrand wrote:
>> On 07.08.25 19:51, Lorenzo Stoakes wrote:
>>> On Thu, Aug 07, 2025 at 07:46:39PM +0200, Jann Horn wrote:
>>>> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
>>>> <lorenzo.stoakes@...cle.com> wrote:
>>>>> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>>>>>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>>>>>> <lorenzo.stoakes@...cle.com> wrote:
>>>>>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>>>>> ---------------- ---------------------------
>>>>>>>>            %stddev     %change         %stddev
>>>>>>>>                \          |                \
>>>>>>>>        13777 ą 37%     +45.0%      19979 ą 27%  numa-vmstat.node1.nr_slab_reclaimable
>>>>>>>>       367205            +2.3%     375703        vmstat.system.in
>>>>>>>>        55106 ą 37%     +45.1%      79971 ą 27%  numa-meminfo.node1.KReclaimable
>>>>>>>>        55106 ą 37%     +45.1%      79971 ą 27%  numa-meminfo.node1.SReclaimable
>>>>>>>>       559381           -37.3%     350757        stress-ng.bigheap.realloc_calls_per_sec
>>>>>>>>        11468            +1.2%      11603        stress-ng.time.system_time
>>>>>>>>       296.25            +4.5%     309.70        stress-ng.time.user_time
>>>>>>>>         0.81 ą187%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>>         9.36 ą165%    -100.0%       0.00        perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>>         0.81 ą187%    -100.0%       0.00        perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>>         9.36 ą165%    -100.0%       0.00        perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>>         5.50 ą 17%    +390.9%      27.00 ą 56%  perf-c2c.DRAM.local
>>>>>>>>       388.50 ą 10%    +114.7%     834.17 ą 33%  perf-c2c.DRAM.remote
>>>>>>>>         1214 ą 13%    +107.3%       2517 ą 31%  perf-c2c.HITM.local
>>>>>>>>       135.00 ą 19%    +130.9%     311.67 ą 32%  perf-c2c.HITM.remote
>>>>>>>>         1349 ą 13%    +109.6%       2829 ą 31%  perf-c2c.HITM.total
>>>>>>>
>>>>>>> Yeah this also looks pretty consistent too...
>>>>>>
>>>>>> FWIW, HITM hat different meanings depending on exactly which
>>>>>> microarchitecture that test happened on; the message says it is from
>>>>>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>>>>>> meaningful than if it came from a pre-IceLake system (see
>>>>>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>>>>>
>>>>>> To me those numbers mainly look like you're accessing a lot more
>>>>>> cache-cold data. (On pre-IceLake they would indicate cacheline
>>>>>> bouncing, but I guess here they probably don't.) And that makes sense,
>>>>>> since before the patch, this path was just moving PTEs around without
>>>>>> looking at the associated pages/folios; basically more or less like a
>>>>>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>>>>>> copy, you have to load a cacheline from the vmemmap to get the page.
>>>>>
>>>>> Yup this is representative of what my investigation is showing.
>>>>>
>>>>> I've narrowed it down but want to wait to report until I'm sure...
>>>>>
>>>>> But yeah we're doing a _lot_ more work.
>>>>>
>>>>> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
>>>>> especially sensitive to this (I found issues with this with my abortive mremap
>>>>> anon merging stuff too, but really expected it there...)
>>>>
>>>> Another approach would be to always read and write PTEs in
>>>> contpte-sized chunks here, without caring whether they're actually
>>>> contiguous or whatever, or something along those lines.
>>>
>>> Not sure I love that, you'd have to figure out offset without cont pte batch and
>>> can it vary? And we're doing this on non-arm64 arches for what reason?
>>>
>>> And would it solve anything really? We'd still be looking at folio, yes less
>>> than now, but uselessly for arches that don't benefit?
>>>
>>> The basis of this series was (and I did explicitly ask) that it wouldn't harm
>>> other arches.
>>
>> We'd need some hint to detect "this is either small" or "this is
>> unbatchable".
>>
>> Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
>> benefit with larger folios (e.g., 64K, 128K) with this patch.
> 
> For the record I did think of using this prior to being mentioned, product of
> actually trying to get the data to back this up instead of talking...
> 
> Anyway, isn't that chicken and egg? We'd have to go get the folio to find out if
> large folio and incur the cost before we knew?
> 
> So how could we make that workable?

E.g., a best-effort check if the next pte likely points at the next PFN.

But as Jann mentioned, there might actually be no benefit on other 
architectures (benchmarking would probably tell us the real story).

-- 
Cheers,

David / dhildenb