[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <41bdce39-f731-4a93-a91c-34035f2d2814@redhat.com>
Date: Thu, 7 Aug 2025 20:13:16 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Jann Horn <jannh@...gle.com>, kernel test robot <oliver.sang@...el.com>,
Dev Jain <dev.jain@....com>, oe-lkp@...ts.linux.dev, lkp@...el.com,
linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
Barry Song <baohua@...nel.org>, Pedro Falcato <pfalcato@...e.de>,
Anshuman Khandual <anshuman.khandual@....com>,
Bang Li <libang.li@...group.com>, Baolin Wang
<baolin.wang@...ux.alibaba.com>, bibo mao <maobibo@...ngson.cn>,
Hugh Dickins <hughd@...gle.com>, Ingo Molnar <mingo@...nel.org>,
Lance Yang <ioworker0@...il.com>, Liam Howlett <liam.howlett@...cle.com>,
Matthew Wilcox <willy@...radead.org>, Peter Xu <peterx@...hat.com>,
Qi Zheng <zhengqi.arch@...edance.com>, Ryan Roberts <ryan.roberts@....com>,
Vlastimil Babka <vbabka@...e.cz>, Yang Shi <yang@...amperecomputing.com>,
Zi Yan <ziy@...dia.com>, linux-mm@...ck.org
Subject: Re: [linus:master] [mm] f822a9a81a:
stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
On 07.08.25 20:04, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:01:51PM +0200, David Hildenbrand wrote:
>> On 07.08.25 19:51, Lorenzo Stoakes wrote:
>>> On Thu, Aug 07, 2025 at 07:46:39PM +0200, Jann Horn wrote:
>>>> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
>>>> <lorenzo.stoakes@...cle.com> wrote:
>>>>> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>>>>>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>>>>>> <lorenzo.stoakes@...cle.com> wrote:
>>>>>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>>>>> ---------------- ---------------------------
>>>>>>>> %stddev %change %stddev
>>>>>>>> \ | \
>>>>>>>> 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
>>>>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
>>>>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
>>>>>>>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>>>>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
>>>>>>>> 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
>>>>>>>> 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
>>>>>>>> 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
>>>>>>>> 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>>>>>>>
>>>>>>> Yeah this also looks pretty consistent too...
>>>>>>
>>>>>> FWIW, HITM hat different meanings depending on exactly which
>>>>>> microarchitecture that test happened on; the message says it is from
>>>>>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>>>>>> meaningful than if it came from a pre-IceLake system (see
>>>>>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>>>>>
>>>>>> To me those numbers mainly look like you're accessing a lot more
>>>>>> cache-cold data. (On pre-IceLake they would indicate cacheline
>>>>>> bouncing, but I guess here they probably don't.) And that makes sense,
>>>>>> since before the patch, this path was just moving PTEs around without
>>>>>> looking at the associated pages/folios; basically more or less like a
>>>>>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>>>>>> copy, you have to load a cacheline from the vmemmap to get the page.
>>>>>
>>>>> Yup this is representative of what my investigation is showing.
>>>>>
>>>>> I've narrowed it down but want to wait to report until I'm sure...
>>>>>
>>>>> But yeah we're doing a _lot_ more work.
>>>>>
>>>>> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
>>>>> especially sensitive to this (I found issues with this with my abortive mremap
>>>>> anon merging stuff too, but really expected it there...)
>>>>
>>>> Another approach would be to always read and write PTEs in
>>>> contpte-sized chunks here, without caring whether they're actually
>>>> contiguous or whatever, or something along those lines.
>>>
>>> Not sure I love that, you'd have to figure out offset without cont pte batch and
>>> can it vary? And we're doing this on non-arm64 arches for what reason?
>>>
>>> And would it solve anything really? We'd still be looking at folio, yes less
>>> than now, but uselessly for arches that don't benefit?
>>>
>>> The basis of this series was (and I did explicitly ask) that it wouldn't harm
>>> other arches.
>>
>> We'd need some hint to detect "this is either small" or "this is
>> unbatchable".
>>
>> Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
>> benefit with larger folios (e.g., 64K, 128K) with this patch.
>
> For the record I did think of using this prior to being mentioned, product of
> actually trying to get the data to back this up instead of talking...
>
> Anyway, isn't that chicken and egg? We'd have to go get the folio to find out if
> large folio and incur the cost before we knew?
>
> So how could we make that workable?
E.g., a best-effort check if the next pte likely points at the next PFN.
But as Jann mentioned, there might actually be no benefit on other
architectures (benchmarking would probably tell us the real story).
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists