[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96f55f36-5cb7-4226-a8ea-374a7891ecff@bytedance.com>
Date: Thu, 30 Jan 2025 01:33:56 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Rik van Riel <riel@...riel.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Matthew Wilcox <willy@...radead.org>, Qi Zheng <zhengqi.arch@...edance.com>,
Peter Zijlstra <peterz@...radead.org>, David Hildenbrand <david@...hat.com>,
kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
<luto@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
David Rientjes <rientjes@...gle.com>, Hugh Dickins <hughd@...gle.com>,
Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,
Peter Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>,
Zach O'Keefe <zokeefe@...gle.com>, Dan Carpenter <dan.carpenter@...aro.org>,
Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
63.0% regression
On 2025/1/30 00:53, Rik van Riel wrote:
> On Wed, 29 Jan 2025 08:36:12 -0800
> "Paul E. McKenney" <paulmck@...nel.org> wrote:
>> On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:
>
>>> Paul, does this look like it could do the trick,
>>> or do we need something else to make RCU freeing
>>> happy again?
>>
>> I don't claim to fully understand the issue, but this would prevent
>> any RCU grace periods starting subsequently from completing. It would
>> not prevent RCU callbacks from being invoked for RCU grace periods that
>> started earlier.
>>
>> So it won't prevent RCU callbacks from being invoked.
>
> That makes things clear! I guess we need a different approach.
>
> Qi, does the patch below resolve the regression for you?
>
> ---8<---
>
> From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@...riel.com>
> Date: Wed, 29 Jan 2025 10:51:51 -0500
> Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
> held
>
> Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
> zone->lock. This turns out to be because in some configurations
> RCU callbacks are called when IRQs are re-enabled inside
> rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
>
> That results in the RCU callbacks being unable to grab the
> PCP lock, and taking the slow path with the zone->lock for
> each item freed.
>
> Speed things up by blocking RCU callbacks while holding the
> PCP lock.
>
> Signed-off-by: Rik van Riel <riel@...riel.com>
> Suggested-by: Paul McKenney <paulmck@...nel.org>
> Reported-by: Qi Zheng <zhengqi.arch@...edance.com>
> ---
> mm/page_alloc.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e469c7ef9a4..73e334f403fd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
>
> #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
> /*
> - * On SMP, spin_trylock is sufficient protection.
> + * On SMP, spin_trylock is sufficient protection against recursion.
> * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
> + *
> + * Block softirq execution to prevent RCU frees from running in softirq
> + * context while this CPU holds the PCP lock, which could result in a whole
> + * bunch of frees contending on the zone->lock.
> */
> -#define pcp_trylock_prepare(flags) do { } while (0)
> -#define pcp_trylock_finish(flag) do { } while (0)
> +#define pcp_trylock_prepare(flags) local_bh_disable()
> +#define pcp_trylock_finish(flag) local_bh_enable()
I just tested this, and it doesn't seem to improve much:
root@...ian:~# stress-ng --timeout 60 --times --verify --metrics
--no-rand-seed --mmapaddr 64
stress-ng: info: [671] dispatching hogs: 64 mmapaddr
stress-ng: info: [671] successful run completed in 60.07s (1 min, 0.07
secs)
stress-ng: info: [671] stressor bogo ops real time usr time sys
time bogo ops/s bogo ops/s
stress-ng: info: [671] (secs) (secs)
(secs) (real time) (usr+sys time)
stress-ng: info: [671] mmapaddr 19803127 60.01 235.20
1146.76 330007.29 14329.74
stress-ng: info: [671] for a 60.07s run time:
stress-ng: info: [671] 1441.59s available CPU time
stress-ng: info: [671] 235.57s user time ( 16.34%)
stress-ng: info: [671] 1147.20s system time ( 79.58%)
stress-ng: info: [671] 1382.77s total time ( 95.92%)
stress-ng: info: [671] load average: 41.42 11.91 4.10
The _raw_spin_unlock_irqrestore hotspot still exists:
15.87% [kernel] [k] _raw_spin_unlock_irqrestore
9.18% [kernel] [k] clear_page_rep
7.03% [kernel] [k] do_syscall_64
3.67% [kernel] [k] _raw_spin_lock
3.28% [kernel] [k] __slab_free
2.03% [kernel] [k] rcu_cblist_dequeue
1.98% [kernel] [k] flush_tlb_mm_range
1.88% [kernel] [k] lruvec_stat_mod_folio.part.131
1.85% [kernel] [k] get_page_from_freelist
1.64% [kernel] [k] kmem_cache_alloc_noprof
1.61% [kernel] [k] tlb_remove_table_rcu
1.39% [kernel] [k] mtree_range_walk
1.36% [kernel] [k] __alloc_frozen_pages_noprof
1.27% [kernel] [k] pmd_install
1.24% [kernel] [k] memcpy_orig
1.23% [kernel] [k] __call_rcu_common.constprop.77
1.17% [kernel] [k] free_pgd_range
1.15% [kernel] [k] pte_alloc_one
The call stack is as follows:
bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();}
interval:s:1 {exit();}'
@[
_raw_spin_unlock_irqrestore+5
hrtimer_interrupt+289
__sysvec_apic_timer_interrupt+85
sysvec_apic_timer_interrupt+108
asm_sysvec_apic_timer_interrupt+26
tlb_remove_table_rcu+48
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+61
asm_sysvec_apic_timer_interrupt+26
, stress-ng-mmapa]: 8
The tlb_remove_table_rcu() is called very rarely, so I guess the
PCP cache is basically empty at this time, resulting in the following
call stack:
@[
_raw_spin_unlock_irqrestore+5
__put_partials+218
kmem_cache_free+860
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
do_softirq.part.23+59
__local_bh_enable_ip+91
get_page_from_freelist+399
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 776
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
pte_alloc_one+30
__pte_alloc+42
move_page_tables+2285
move_vma+472
__do_sys_mremap+1759
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1214
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
tlb_remove_table+82
free_pgd_range+655
free_pgtables+601
vms_clear_ptes.part.39+255
vms_complete_munmap_vmas+311
do_vmi_align_munmap+419
do_vmi_munmap+195
move_vma+802
__do_sys_mremap+1759
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1631
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
tlb_remove_table+82
free_pgd_range+655
free_pgtables+601
vms_clear_ptes.part.39+255
vms_complete_munmap_vmas+311
do_vmi_align_munmap+419
do_vmi_munmap+195
__vm_munmap+177
__x64_sys_munmap+27
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1672
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
__pmd_alloc+52
__handle_mm_fault+1265
handle_mm_fault+195
__get_user_pages+690
populate_vma_page_range+127
__mm_populate+159
vm_mmap_pgoff+329
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2042
@[
_raw_spin_unlock_irqrestore+5
get_partial_node.part.102+378
___slab_alloc.part.103+1180
__slab_alloc.isra.104+34
kmem_cache_alloc_noprof+192
mas_alloc_nodes+358
mas_store_gfp+183
do_vmi_align_munmap+398
do_vmi_munmap+195
__vm_munmap+177
__x64_sys_munmap+27
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2219
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
pte_alloc_one+30
__pte_alloc+42
do_pte_missing+2493
__handle_mm_fault+1914
handle_mm_fault+195
__get_user_pages+690
populate_vma_page_range+127
__mm_populate+159
vm_mmap_pgoff+329
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2657
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5734
> #else
>
> /* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
Powered by blists - more mailing lists