[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <92fe6c2b-ef8d-4ed4-bf96-27f0760329b7@bytedance.com>
Date: Thu, 30 Jan 2025 01:53:06 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Rik van Riel <riel@...riel.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Matthew Wilcox <willy@...radead.org>, Peter Zijlstra <peterz@...radead.org>,
David Hildenbrand <david@...hat.com>,
kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
<luto@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
David Rientjes <rientjes@...gle.com>, Hugh Dickins <hughd@...gle.com>,
Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,
Peter Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>,
Zach O'Keefe <zokeefe@...gle.com>, Dan Carpenter <dan.carpenter@...aro.org>,
Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
63.0% regression
On 2025/1/30 01:33, Qi Zheng wrote:
>
>
> On 2025/1/30 00:53, Rik van Riel wrote:
>> On Wed, 29 Jan 2025 08:36:12 -0800
>> "Paul E. McKenney" <paulmck@...nel.org> wrote:
>>> On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:
>>
>>>> Paul, does this look like it could do the trick,
>>>> or do we need something else to make RCU freeing
>>>> happy again?
>>>
>>> I don't claim to fully understand the issue, but this would prevent
>>> any RCU grace periods starting subsequently from completing. It would
>>> not prevent RCU callbacks from being invoked for RCU grace periods that
>>> started earlier.
>>>
>>> So it won't prevent RCU callbacks from being invoked.
>>
>> That makes things clear! I guess we need a different approach.
>>
>> Qi, does the patch below resolve the regression for you?
>>
>> ---8<---
>>
>> From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
>> From: Rik van Riel <riel@...riel.com>
>> Date: Wed, 29 Jan 2025 10:51:51 -0500
>> Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with
>> pcp lock
>> held
>>
>> Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
>> zone->lock. This turns out to be because in some configurations
>> RCU callbacks are called when IRQs are re-enabled inside
>> rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
>>
>> That results in the RCU callbacks being unable to grab the
>> PCP lock, and taking the slow path with the zone->lock for
>> each item freed.
>>
>> Speed things up by blocking RCU callbacks while holding the
>> PCP lock.
>>
>> Signed-off-by: Rik van Riel <riel@...riel.com>
>> Suggested-by: Paul McKenney <paulmck@...nel.org>
>> Reported-by: Qi Zheng <zhengqi.arch@...edance.com>
>> ---
>> mm/page_alloc.c | 10 +++++++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6e469c7ef9a4..73e334f403fd 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
>> #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
>> /*
>> - * On SMP, spin_trylock is sufficient protection.
>> + * On SMP, spin_trylock is sufficient protection against recursion.
>> * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
>> + *
>> + * Block softirq execution to prevent RCU frees from running in softirq
>> + * context while this CPU holds the PCP lock, which could result in a
>> whole
>> + * bunch of frees contending on the zone->lock.
>> */
>> -#define pcp_trylock_prepare(flags) do { } while (0)
>> -#define pcp_trylock_finish(flag) do { } while (0)
>> +#define pcp_trylock_prepare(flags) local_bh_disable()
>> +#define pcp_trylock_finish(flag) local_bh_enable()
>
> I just tested this, and it doesn't seem to improve much:
>
> root@...ian:~# stress-ng --timeout 60 --times --verify --metrics
> --no-rand-seed --mmapaddr 64
> stress-ng: info: [671] dispatching hogs: 64 mmapaddr
> stress-ng: info: [671] successful run completed in 60.07s (1 min, 0.07
> secs)
> stress-ng: info: [671] stressor bogo ops real time usr time sys
> time bogo ops/s bogo ops/s
> stress-ng: info: [671] (secs) (secs)
> (secs) (real time) (usr+sys time)
> stress-ng: info: [671] mmapaddr 19803127 60.01 235.20
> 1146.76 330007.29 14329.74
> stress-ng: info: [671] for a 60.07s run time:
> stress-ng: info: [671] 1441.59s available CPU time
> stress-ng: info: [671] 235.57s user time ( 16.34%)
> stress-ng: info: [671] 1147.20s system time ( 79.58%)
> stress-ng: info: [671] 1382.77s total time ( 95.92%)
> stress-ng: info: [671] load average: 41.42 11.91 4.10
>
> The _raw_spin_unlock_irqrestore hotspot still exists:
>
> 15.87% [kernel] [k] _raw_spin_unlock_irqrestore
> 9.18% [kernel] [k] clear_page_rep
> 7.03% [kernel] [k] do_syscall_64
> 3.67% [kernel] [k] _raw_spin_lock
> 3.28% [kernel] [k] __slab_free
> 2.03% [kernel] [k] rcu_cblist_dequeue
> 1.98% [kernel] [k] flush_tlb_mm_range
> 1.88% [kernel] [k] lruvec_stat_mod_folio.part.131
> 1.85% [kernel] [k] get_page_from_freelist
> 1.64% [kernel] [k] kmem_cache_alloc_noprof
> 1.61% [kernel] [k] tlb_remove_table_rcu
> 1.39% [kernel] [k] mtree_range_walk
> 1.36% [kernel] [k] __alloc_frozen_pages_noprof
> 1.27% [kernel] [k] pmd_install
> 1.24% [kernel] [k] memcpy_orig
> 1.23% [kernel] [k] __call_rcu_common.constprop.77
> 1.17% [kernel] [k] free_pgd_range
> 1.15% [kernel] [k] pte_alloc_one
>
> The call stack is as follows:
>
> bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();}
> interval:s:1 {exit();}'
>
> @[
> _raw_spin_unlock_irqrestore+5
> hrtimer_interrupt+289
> __sysvec_apic_timer_interrupt+85
> sysvec_apic_timer_interrupt+108
> asm_sysvec_apic_timer_interrupt+26
> tlb_remove_table_rcu+48
> rcu_do_batch+424
> rcu_core+401
> handle_softirqs+204
> irq_exit_rcu+208
> sysvec_apic_timer_interrupt+61
> asm_sysvec_apic_timer_interrupt+26
> , stress-ng-mmapa]: 8
>
> The tlb_remove_table_rcu() is called very rarely, so I guess the
> PCP cache is basically empty at this time, resulting in the following
> call stack:
>
But I think this may be just an extreme test scenario, because my test
machine has no other load at this time. Under normal workload, page
table pages should only occupy a small part of the PCP cache, and
delayed freeing should not have much impact on the PCP cache.
Thanks!
Powered by blists - more mailing lists