[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b804115c4df4a9283118329e06656c1c76b69b5c.camel@surriel.com>
Date: Wed, 29 Jan 2025 10:23:04 -0500
From: Rik van Riel <riel@...riel.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: Peter Zijlstra <peterz@...radead.org>, David Hildenbrand
<david@...hat.com>, kernel test robot <oliver.sang@...el.com>,
oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org, Andrew
Morton <akpm@...ux-foundation.org>, Dave Hansen
<dave.hansen@...ux.intel.com>, Andy Lutomirski <luto@...nel.org>, Catalin
Marinas <catalin.marinas@....com>, David Rientjes <rientjes@...gle.com>,
Hugh Dickins <hughd@...gle.com>, Jann Horn <jannh@...gle.com>, Lorenzo
Stoakes <lorenzo.stoakes@...cle.com>, Matthew Wilcox <willy@...radead.org>,
Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>, Peter
Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>, Zach O'Keefe
<zokeefe@...gle.com>, Dan Carpenter <dan.carpenter@...aro.org>, "Paul E.
McKenney" <paulmck@...nel.org>, Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
63.0% regression
On Wed, 2025-01-29 at 16:14 +0800, Qi Zheng wrote:
> On 2025/1/29 02:35, Rik van Riel wrote:
> >
> > That looks like the RCU freeing somehow bypassing the
> > per-cpu-pages, and hitting the zone->lock at page free
> > time, while regular freeing usually puts pages in the
> > CPU-local free page cache, without the lock?
>
> Take the following call stack as an example:
>
> @[
> _raw_spin_unlock_irqrestore+5
> free_one_page+85
> tlb_remove_table_rcu+140
> rcu_do_batch+424
> rcu_core+401
> handle_softirqs+204
> irq_exit_rcu+208
> sysvec_apic_timer_interrupt+113
> asm_sysvec_apic_timer_interrupt+26
> _raw_spin_unlock_irqrestore+29
> get_page_from_freelist+2014
> __alloc_frozen_pages_noprof+364
> alloc_pages_mpol+123
> alloc_pages_noprof+14
> get_free_pages_noprof+17
> __x64_sys_mincore+141
> do_syscall_64+98
> entry_SYSCALL_64_after_hwframe+118
> , stress-ng-mmapa]: 5301
>
> It looks like the following happened:
>
> get_page_from_freelist
> --> rmqueue
> --> rmqueue_pcplist
> --> pcp_spin_trylock (hold the pcp lock)
> __rmqueue_pcplist
> --> rmqueue_bulk
> --> spin_lock_irqsave(&zone->lock)
> __rmqueue
> spin_unlock_irqrestore(&zone->lock)
>
> <run softirq at this time>
>
> tlb_remove_table_rcu
> --> free_frozen_pages
> --> pcp = pcp_spin_trylock (failed!!!)
> if (!pcp)
> free_one_page
>
> It seems that the pcp lock is held when doing tlb_remove_table_rcu(),
> so
> trylock fails, then bypassing PCP and calling free_one_page()
> directly,
> which leads to the hot spot of zone lock.
>
> As for the regular freeing, since the freeing operation will not be
> performed in the softirq, the above situation will not occur.
>
> Right?
You are absolutely right!
This raises an interesting question: should we keep
RCU from running callbacks while the pcp_spinlock is
held, and what would be the best way to do that?
Are there other corner cases where RCU callbacks
should not be running from softirq context at
irq reenable time?
Should maybe the RCU callbacks only run when
the current process has no locks held,
or should they simply always run from some
kernel thread?
I'm really not sure what the right answer is...
--
All Rights Reversed.
Powered by blists - more mailing lists