[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b14a36da-ff29-4796-8f36-64704924232b@paulmck-laptop>
Date: Wed, 29 Jan 2025 11:19:02 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: Rik van Riel <riel@...riel.com>, Matthew Wilcox <willy@...radead.org>,
Peter Zijlstra <peterz@...radead.org>,
David Hildenbrand <david@...hat.com>,
kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
Catalin Marinas <catalin.marinas@....com>,
David Rientjes <rientjes@...gle.com>,
Hugh Dickins <hughd@...gle.com>, Jann Horn <jannh@...gle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,
Peter Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>,
Zach O'Keefe <zokeefe@...gle.com>,
Dan Carpenter <dan.carpenter@...aro.org>,
Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
63.0% regression
On Thu, Jan 30, 2025 at 01:53:06AM +0800, Qi Zheng wrote:
>
>
> On 2025/1/30 01:33, Qi Zheng wrote:
> >
> >
> > On 2025/1/30 00:53, Rik van Riel wrote:
> > > On Wed, 29 Jan 2025 08:36:12 -0800
> > > "Paul E. McKenney" <paulmck@...nel.org> wrote:
> > > > On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:
> > >
> > > > > Paul, does this look like it could do the trick,
> > > > > or do we need something else to make RCU freeing
> > > > > happy again?
> > > >
> > > > I don't claim to fully understand the issue, but this would prevent
> > > > any RCU grace periods starting subsequently from completing. It would
> > > > not prevent RCU callbacks from being invoked for RCU grace periods that
> > > > started earlier.
> > > >
> > > > So it won't prevent RCU callbacks from being invoked.
> > >
> > > That makes things clear! I guess we need a different approach.
> > >
> > > Qi, does the patch below resolve the regression for you?
> > >
> > > ---8<---
> > >
> > > From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
> > > From: Rik van Riel <riel@...riel.com>
> > > Date: Wed, 29 Jan 2025 10:51:51 -0500
> > > Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with
> > > pcp lock
> > > held
> > >
> > > Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
> > > zone->lock. This turns out to be because in some configurations
> > > RCU callbacks are called when IRQs are re-enabled inside
> > > rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
> > >
> > > That results in the RCU callbacks being unable to grab the
> > > PCP lock, and taking the slow path with the zone->lock for
> > > each item freed.
> > >
> > > Speed things up by blocking RCU callbacks while holding the
> > > PCP lock.
> > >
> > > Signed-off-by: Rik van Riel <riel@...riel.com>
> > > Suggested-by: Paul McKenney <paulmck@...nel.org>
> > > Reported-by: Qi Zheng <zhengqi.arch@...edance.com>
> > > ---
> > > mm/page_alloc.c | 10 +++++++---
> > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 6e469c7ef9a4..73e334f403fd 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
> > > #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
> > > /*
> > > - * On SMP, spin_trylock is sufficient protection.
> > > + * On SMP, spin_trylock is sufficient protection against recursion.
> > > * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
> > > + *
> > > + * Block softirq execution to prevent RCU frees from running in softirq
> > > + * context while this CPU holds the PCP lock, which could result in
> > > a whole
> > > + * bunch of frees contending on the zone->lock.
> > > */
> > > -#define pcp_trylock_prepare(flags) do { } while (0)
> > > -#define pcp_trylock_finish(flag) do { } while (0)
> > > +#define pcp_trylock_prepare(flags) local_bh_disable()
> > > +#define pcp_trylock_finish(flag) local_bh_enable()
> >
> > I just tested this, and it doesn't seem to improve much:
> >
> > root@...ian:~# stress-ng --timeout 60 --times --verify --metrics
> > --no-rand-seed --mmapaddr 64
> > stress-ng: info: [671] dispatching hogs: 64 mmapaddr
> > stress-ng: info: [671] successful run completed in 60.07s (1 min, 0.07
> > secs)
> > stress-ng: info: [671] stressor bogo ops real time usr time sys
> > time bogo ops/s bogo ops/s
> > stress-ng: info: [671] (secs) (secs)
> > (secs) (real time) (usr+sys time)
> > stress-ng: info: [671] mmapaddr 19803127 60.01 235.20
> > 1146.76 330007.29 14329.74
> > stress-ng: info: [671] for a 60.07s run time:
> > stress-ng: info: [671] 1441.59s available CPU time
> > stress-ng: info: [671] 235.57s user time ( 16.34%)
> > stress-ng: info: [671] 1147.20s system time ( 79.58%)
> > stress-ng: info: [671] 1382.77s total time ( 95.92%)
> > stress-ng: info: [671] load average: 41.42 11.91 4.10
> >
> > The _raw_spin_unlock_irqrestore hotspot still exists:
> >
> > 15.87% [kernel] [k] _raw_spin_unlock_irqrestore
> > 9.18% [kernel] [k] clear_page_rep
> > 7.03% [kernel] [k] do_syscall_64
> > 3.67% [kernel] [k] _raw_spin_lock
> > 3.28% [kernel] [k] __slab_free
> > 2.03% [kernel] [k] rcu_cblist_dequeue
> > 1.98% [kernel] [k] flush_tlb_mm_range
> > 1.88% [kernel] [k] lruvec_stat_mod_folio.part.131
> > 1.85% [kernel] [k] get_page_from_freelist
> > 1.64% [kernel] [k] kmem_cache_alloc_noprof
> > 1.61% [kernel] [k] tlb_remove_table_rcu
> > 1.39% [kernel] [k] mtree_range_walk
> > 1.36% [kernel] [k] __alloc_frozen_pages_noprof
> > 1.27% [kernel] [k] pmd_install
> > 1.24% [kernel] [k] memcpy_orig
> > 1.23% [kernel] [k] __call_rcu_common.constprop.77
> > 1.17% [kernel] [k] free_pgd_range
> > 1.15% [kernel] [k] pte_alloc_one
> >
> > The call stack is as follows:
> >
> > bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();}
> > interval:s:1 {exit();}'
> >
> > @[
> > _raw_spin_unlock_irqrestore+5
> > hrtimer_interrupt+289
> > __sysvec_apic_timer_interrupt+85
> > sysvec_apic_timer_interrupt+108
> > asm_sysvec_apic_timer_interrupt+26
> > tlb_remove_table_rcu+48
> > rcu_do_batch+424
> > rcu_core+401
> > handle_softirqs+204
> > irq_exit_rcu+208
> > sysvec_apic_timer_interrupt+61
> > asm_sysvec_apic_timer_interrupt+26
> > , stress-ng-mmapa]: 8
> >
> > The tlb_remove_table_rcu() is called very rarely, so I guess the
> > PCP cache is basically empty at this time, resulting in the following
> > call stack:
>
> But I think this may be just an extreme test scenario, because my test
> machine has no other load at this time. Under normal workload, page
> table pages should only occupy a small part of the PCP cache, and
> delayed freeing should not have much impact on the PCP cache.
It might well be extreme, but it might also be well worth looking at
for people in environments where one in a million is the common case. ;-)
Thanx, Paul
Powered by blists - more mailing lists