lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b804115c4df4a9283118329e06656c1c76b69b5c.camel@surriel.com>
Date: Wed, 29 Jan 2025 10:23:04 -0500
From: Rik van Riel <riel@...riel.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: Peter Zijlstra <peterz@...radead.org>, David Hildenbrand
 <david@...hat.com>,  kernel test robot <oliver.sang@...el.com>,
 oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org, Andrew
 Morton <akpm@...ux-foundation.org>, Dave Hansen
 <dave.hansen@...ux.intel.com>, Andy Lutomirski	 <luto@...nel.org>, Catalin
 Marinas <catalin.marinas@....com>, David Rientjes	 <rientjes@...gle.com>,
 Hugh Dickins <hughd@...gle.com>, Jann Horn	 <jannh@...gle.com>, Lorenzo
 Stoakes <lorenzo.stoakes@...cle.com>, Matthew Wilcox <willy@...radead.org>,
 Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,  Peter
 Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>, Zach O'Keefe
 <zokeefe@...gle.com>,  Dan Carpenter <dan.carpenter@...aro.org>, "Paul E.
 McKenney" <paulmck@...nel.org>, Frederic Weisbecker	 <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression

On Wed, 2025-01-29 at 16:14 +0800, Qi Zheng wrote:
> On 2025/1/29 02:35, Rik van Riel wrote:
> > 
> > That looks like the RCU freeing somehow bypassing the
> > per-cpu-pages, and hitting the zone->lock at page free
> > time, while regular freeing usually puts pages in the
> > CPU-local free page cache, without the lock?
> 
> Take the following call stack as an example:
> 
> @[
> _raw_spin_unlock_irqrestore+5
> free_one_page+85
> tlb_remove_table_rcu+140
> rcu_do_batch+424
> rcu_core+401
> handle_softirqs+204
> irq_exit_rcu+208
> sysvec_apic_timer_interrupt+113
> asm_sysvec_apic_timer_interrupt+26
> _raw_spin_unlock_irqrestore+29
> get_page_from_freelist+2014
> __alloc_frozen_pages_noprof+364
> alloc_pages_mpol+123
> alloc_pages_noprof+14
> get_free_pages_noprof+17
> __x64_sys_mincore+141
> do_syscall_64+98
> entry_SYSCALL_64_after_hwframe+118
> , stress-ng-mmapa]: 5301
> 
> It looks like the following happened:
> 
> get_page_from_freelist
> --> rmqueue
>      --> rmqueue_pcplist
>          --> pcp_spin_trylock (hold the pcp lock)
>              __rmqueue_pcplist
>              --> rmqueue_bulk
>                  --> spin_lock_irqsave(&zone->lock)
>                      __rmqueue
>                      spin_unlock_irqrestore(&zone->lock)
> 
>                      <run softirq at this time>
> 
>                      tlb_remove_table_rcu
>                      --> free_frozen_pages
>                          --> pcp = pcp_spin_trylock (failed!!!)
>                              if (!pcp)
>                                  free_one_page
> 
> It seems that the pcp lock is held when doing tlb_remove_table_rcu(),
> so
> trylock fails, then bypassing PCP and calling free_one_page()
> directly,
> which leads to the hot spot of zone lock.
> 
> As for the regular freeing, since the freeing operation will not be
> performed in the softirq, the above situation will not occur.
> 
> Right?

You are absolutely right!

This raises an interesting question: should we keep
RCU from running callbacks while the pcp_spinlock is
held, and what would be the best way to do that?

Are there other corner cases where RCU callbacks
should not be running from softirq context at
irq reenable time?

Should maybe the RCU callbacks only run when
the current process has no locks held,
or should they simply always run from some
kernel thread?

I'm really not sure what the right answer is...

-- 
All Rights Reversed.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ