linux-kernel - Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96f55f36-5cb7-4226-a8ea-374a7891ecff@bytedance.com>
Date: Thu, 30 Jan 2025 01:33:56 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Rik van Riel <riel@...riel.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
 Matthew Wilcox <willy@...radead.org>, Qi Zheng <zhengqi.arch@...edance.com>,
 Peter Zijlstra <peterz@...radead.org>, David Hildenbrand <david@...hat.com>,
 kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
 lkp@...el.com, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
 <luto@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
 David Rientjes <rientjes@...gle.com>, Hugh Dickins <hughd@...gle.com>,
 Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,
 Peter Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>,
 Zach O'Keefe <zokeefe@...gle.com>, Dan Carpenter <dan.carpenter@...aro.org>,
 Frederic Weisbecker <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression



On 2025/1/30 00:53, Rik van Riel wrote:
> On Wed, 29 Jan 2025 08:36:12 -0800
> "Paul E. McKenney" <paulmck@...nel.org> wrote:
>> On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:
> 
>>> Paul, does this look like it could do the trick,
>>> or do we need something else to make RCU freeing
>>> happy again?
>>
>> I don't claim to fully understand the issue, but this would prevent
>> any RCU grace periods starting subsequently from completing.  It would
>> not prevent RCU callbacks from being invoked for RCU grace periods that
>> started earlier.
>>
>> So it won't prevent RCU callbacks from being invoked.
> 
> That makes things clear! I guess we need a different approach.
> 
> Qi, does the patch below resolve the regression for you?
> 
> ---8<---
> 
>  From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@...riel.com>
> Date: Wed, 29 Jan 2025 10:51:51 -0500
> Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
>   held
> 
> Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
> zone->lock.  This turns out to be because in some configurations
> RCU callbacks are called when IRQs are re-enabled inside
> rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
> 
> That results in the RCU callbacks being unable to grab the
> PCP lock, and taking the slow path with the zone->lock for
> each item freed.
> 
> Speed things up by blocking RCU callbacks while holding the
> PCP lock.
> 
> Signed-off-by: Rik van Riel <riel@...riel.com>
> Suggested-by: Paul McKenney <paulmck@...nel.org>
> Reported-by: Qi Zheng <zhengqi.arch@...edance.com>
> ---
>   mm/page_alloc.c | 10 +++++++---
>   1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e469c7ef9a4..73e334f403fd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
>   
>   #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
>   /*
> - * On SMP, spin_trylock is sufficient protection.
> + * On SMP, spin_trylock is sufficient protection against recursion.
>    * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
> + *
> + * Block softirq execution to prevent RCU frees from running in softirq
> + * context while this CPU holds the PCP lock, which could result in a whole
> + * bunch of frees contending on the zone->lock.
>    */
> -#define pcp_trylock_prepare(flags)	do { } while (0)
> -#define pcp_trylock_finish(flag)	do { } while (0)
> +#define pcp_trylock_prepare(flags)	local_bh_disable()
> +#define pcp_trylock_finish(flag)	local_bh_enable()

I just tested this, and it doesn't seem to improve much:

root@...ian:~# stress-ng --timeout 60 --times --verify --metrics 
--no-rand-seed --mmapaddr 64
stress-ng: info:  [671] dispatching hogs: 64 mmapaddr
stress-ng: info:  [671] successful run completed in 60.07s (1 min, 0.07 
secs)
stress-ng: info:  [671] stressor       bogo ops real time  usr time  sys 
time   bogo ops/s   bogo ops/s
stress-ng: info:  [671]                           (secs)    (secs) 
(secs)   (real time) (usr+sys time)
stress-ng: info:  [671] mmapaddr       19803127     60.01    235.20 
1146.76    330007.29     14329.74
stress-ng: info:  [671] for a 60.07s run time:
stress-ng: info:  [671]    1441.59s available CPU time
stress-ng: info:  [671]     235.57s user time   ( 16.34%)
stress-ng: info:  [671]    1147.20s system time ( 79.58%)
stress-ng: info:  [671]    1382.77s total time  ( 95.92%)
stress-ng: info:  [671] load average: 41.42 11.91 4.10

The _raw_spin_unlock_irqrestore hotspot still exists:

   15.87%  [kernel]  [k] _raw_spin_unlock_irqrestore
    9.18%  [kernel]  [k] clear_page_rep
    7.03%  [kernel]  [k] do_syscall_64
    3.67%  [kernel]  [k] _raw_spin_lock
    3.28%  [kernel]  [k] __slab_free
    2.03%  [kernel]  [k] rcu_cblist_dequeue
    1.98%  [kernel]  [k] flush_tlb_mm_range
    1.88%  [kernel]  [k] lruvec_stat_mod_folio.part.131
    1.85%  [kernel]  [k] get_page_from_freelist
    1.64%  [kernel]  [k] kmem_cache_alloc_noprof
    1.61%  [kernel]  [k] tlb_remove_table_rcu
    1.39%  [kernel]  [k] mtree_range_walk
    1.36%  [kernel]  [k] __alloc_frozen_pages_noprof
    1.27%  [kernel]  [k] pmd_install
    1.24%  [kernel]  [k] memcpy_orig
    1.23%  [kernel]  [k] __call_rcu_common.constprop.77
    1.17%  [kernel]  [k] free_pgd_range
    1.15%  [kernel]  [k] pte_alloc_one

The call stack is as follows:

bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} 
interval:s:1 {exit();}'

@[
     _raw_spin_unlock_irqrestore+5
     hrtimer_interrupt+289
     __sysvec_apic_timer_interrupt+85
     sysvec_apic_timer_interrupt+108
     asm_sysvec_apic_timer_interrupt+26
     tlb_remove_table_rcu+48
     rcu_do_batch+424
     rcu_core+401
     handle_softirqs+204
     irq_exit_rcu+208
     sysvec_apic_timer_interrupt+61
     asm_sysvec_apic_timer_interrupt+26
, stress-ng-mmapa]: 8

The tlb_remove_table_rcu() is called very rarely, so I guess the
PCP cache is basically empty at this time, resulting in the following
call stack:

@[
     _raw_spin_unlock_irqrestore+5
     __put_partials+218
     kmem_cache_free+860
     rcu_do_batch+424
     rcu_core+401
     handle_softirqs+204
     do_softirq.part.23+59
     __local_bh_enable_ip+91
     get_page_from_freelist+399
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     get_free_pages_noprof+17
     __x64_sys_mincore+141
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 776
@[
     _raw_spin_unlock_irqrestore+5
     get_page_from_freelist+2044
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     pte_alloc_one+30
     __pte_alloc+42
     move_page_tables+2285
     move_vma+472
     __do_sys_mremap+1759
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1214
@[
     _raw_spin_unlock_irqrestore+5
     get_page_from_freelist+2044
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     get_free_pages_noprof+17
     tlb_remove_table+82
     free_pgd_range+655
     free_pgtables+601
     vms_clear_ptes.part.39+255
     vms_complete_munmap_vmas+311
     do_vmi_align_munmap+419
     do_vmi_munmap+195
     move_vma+802
     __do_sys_mremap+1759
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1631
@[
     _raw_spin_unlock_irqrestore+5
     get_page_from_freelist+2044
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     get_free_pages_noprof+17
     tlb_remove_table+82
     free_pgd_range+655
     free_pgtables+601
     vms_clear_ptes.part.39+255
     vms_complete_munmap_vmas+311
     do_vmi_align_munmap+419
     do_vmi_munmap+195
     __vm_munmap+177
     __x64_sys_munmap+27
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1672
@[
     _raw_spin_unlock_irqrestore+5
     get_page_from_freelist+2044
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     __pmd_alloc+52
     __handle_mm_fault+1265
     handle_mm_fault+195
     __get_user_pages+690
     populate_vma_page_range+127
     __mm_populate+159
     vm_mmap_pgoff+329
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2042
@[
     _raw_spin_unlock_irqrestore+5
     get_partial_node.part.102+378
     ___slab_alloc.part.103+1180
     __slab_alloc.isra.104+34
     kmem_cache_alloc_noprof+192
     mas_alloc_nodes+358
     mas_store_gfp+183
     do_vmi_align_munmap+398
     do_vmi_munmap+195
     __vm_munmap+177
     __x64_sys_munmap+27
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2219
@[
     _raw_spin_unlock_irqrestore+5
     get_page_from_freelist+2044
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     pte_alloc_one+30
     __pte_alloc+42
     do_pte_missing+2493
     __handle_mm_fault+1914
     handle_mm_fault+195
     __get_user_pages+690
     populate_vma_page_range+127
     __mm_populate+159
     vm_mmap_pgoff+329
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2657
@[
     _raw_spin_unlock_irqrestore+5
     get_page_from_freelist+2044
     __alloc_frozen_pages_noprof+364
     alloc_pages_mpol+123
     alloc_pages_noprof+14
     get_free_pages_noprof+17
     __x64_sys_mincore+141
     do_syscall_64+98
     entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5734

>   #else
>   
>   /* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */