[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc727530-8535-4d98-9fc4-f6a36941ca75@arm.com>
Date: Thu, 7 Aug 2025 23:20:13 +0530
From: Dev Jain <dev.jain@....com>
To: Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>, Barry Song <baohua@...nel.org>,
Pedro Falcato <pfalcato@...e.de>,
Anshuman Khandual <anshuman.khandual@....com>,
Bang Li <libang.li@...group.com>, Baolin Wang
<baolin.wang@...ux.alibaba.com>, bibo mao <maobibo@...ngson.cn>,
David Hildenbrand <david@...hat.com>, Hugh Dickins <hughd@...gle.com>,
Ingo Molnar <mingo@...nel.org>, Lance Yang <ioworker0@...il.com>,
Liam Howlett <liam.howlett@...cle.com>, Matthew Wilcox
<willy@...radead.org>, Peter Xu <peterx@...hat.com>,
Qi Zheng <zhengqi.arch@...edance.com>, Ryan Roberts <ryan.roberts@....com>,
Vlastimil Babka <vbabka@...e.cz>, Yang Shi <yang@...amperecomputing.com>,
Zi Yan <ziy@...dia.com>, linux-mm@...ck.org
Subject: Re: [linus:master] [mm] f822a9a81a:
stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
On 07/08/25 11:16 pm, Jann Horn wrote:
> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
>> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>>> <lorenzo.stoakes@...cle.com> wrote:
>>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>> ---------------- ---------------------------
>>>>> %stddev %change %stddev
>>>>> \ | \
>>>>> 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
>>>>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
>>>>> 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
>>>>> 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
>>>>> 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
>>>>> 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>>>> Yeah this also looks pretty consistent too...
>>> FWIW, HITM hat different meanings depending on exactly which
>>> microarchitecture that test happened on; the message says it is from
>>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>>> meaningful than if it came from a pre-IceLake system (see
>>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>>
>>> To me those numbers mainly look like you're accessing a lot more
>>> cache-cold data. (On pre-IceLake they would indicate cacheline
>>> bouncing, but I guess here they probably don't.) And that makes sense,
>>> since before the patch, this path was just moving PTEs around without
>>> looking at the associated pages/folios; basically more or less like a
>>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>>> copy, you have to load a cacheline from the vmemmap to get the page.
>> Yup this is representative of what my investigation is showing.
>>
>> I've narrowed it down but want to wait to report until I'm sure...
>>
>> But yeah we're doing a _lot_ more work.
>>
>> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
>> especially sensitive to this (I found issues with this with my abortive mremap
>> anon merging stuff too, but really expected it there...)
> Another approach would be to always read and write PTEs in
> contpte-sized chunks here, without caring whether they're actually
> contiguous or whatever, or something along those lines.
The initial approach was to wrap all of this around pte_batch_hint(),
effectively making the optimization only on arm64. I guess that sounds
reasonable now.
Powered by blists - more mailing lists