linux-kernel - Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b804115c4df4a9283118329e06656c1c76b69b5c.camel@surriel.com>
Date: Wed, 29 Jan 2025 10:23:04 -0500
From: Rik van Riel <riel@...riel.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: Peter Zijlstra <peterz@...radead.org>, David Hildenbrand
 <david@...hat.com>,  kernel test robot <oliver.sang@...el.com>,
 oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org, Andrew
 Morton <akpm@...ux-foundation.org>, Dave Hansen
 <dave.hansen@...ux.intel.com>, Andy Lutomirski	 <luto@...nel.org>, Catalin
 Marinas <catalin.marinas@....com>, David Rientjes	 <rientjes@...gle.com>,
 Hugh Dickins <hughd@...gle.com>, Jann Horn	 <jannh@...gle.com>, Lorenzo
 Stoakes <lorenzo.stoakes@...cle.com>, Matthew Wilcox <willy@...radead.org>,
 Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,  Peter
 Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>, Zach O'Keefe
 <zokeefe@...gle.com>,  Dan Carpenter <dan.carpenter@...aro.org>, "Paul E.
 McKenney" <paulmck@...nel.org>, Frederic Weisbecker	 <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression

On Wed, 2025-01-29 at 16:14 +0800, Qi Zheng wrote:
> On 2025/1/29 02:35, Rik van Riel wrote:
> > 
> > That looks like the RCU freeing somehow bypassing the
> > per-cpu-pages, and hitting the zone->lock at page free
> > time, while regular freeing usually puts pages in the
> > CPU-local free page cache, without the lock?
> 
> Take the following call stack as an example:
> 
> @[
> _raw_spin_unlock_irqrestore+5
> free_one_page+85
> tlb_remove_table_rcu+140
> rcu_do_batch+424
> rcu_core+401
> handle_softirqs+204
> irq_exit_rcu+208
> sysvec_apic_timer_interrupt+113
> asm_sysvec_apic_timer_interrupt+26
> _raw_spin_unlock_irqrestore+29
> get_page_from_freelist+2014
> __alloc_frozen_pages_noprof+364
> alloc_pages_mpol+123
> alloc_pages_noprof+14
> get_free_pages_noprof+17
> __x64_sys_mincore+141
> do_syscall_64+98
> entry_SYSCALL_64_after_hwframe+118
> , stress-ng-mmapa]: 5301
> 
> It looks like the following happened:
> 
> get_page_from_freelist
> --> rmqueue
>      --> rmqueue_pcplist
>          --> pcp_spin_trylock (hold the pcp lock)
>              __rmqueue_pcplist
>              --> rmqueue_bulk
>                  --> spin_lock_irqsave(&zone->lock)
>                      __rmqueue
>                      spin_unlock_irqrestore(&zone->lock)
> 
>                      <run softirq at this time>
> 
>                      tlb_remove_table_rcu
>                      --> free_frozen_pages
>                          --> pcp = pcp_spin_trylock (failed!!!)
>                              if (!pcp)
>                                  free_one_page
> 
> It seems that the pcp lock is held when doing tlb_remove_table_rcu(),
> so
> trylock fails, then bypassing PCP and calling free_one_page()
> directly,
> which leads to the hot spot of zone lock.
> 
> As for the regular freeing, since the freeing operation will not be
> performed in the softirq, the above situation will not occur.
> 
> Right?

You are absolutely right!

This raises an interesting question: should we keep
RCU from running callbacks while the pcp_spinlock is
held, and what would be the best way to do that?

Are there other corner cases where RCU callbacks
should not be running from softirq context at
irq reenable time?

Should maybe the RCU callbacks only run when
the current process has no locks held,
or should they simply always run from some
kernel thread?

I'm really not sure what the right answer is...

-- 
All Rights Reversed.