linux-kernel - Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <92fe6c2b-ef8d-4ed4-bf96-27f0760329b7@bytedance.com>
Date: Thu, 30 Jan 2025 01:53:06 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Rik van Riel <riel@...riel.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
 Matthew Wilcox <willy@...radead.org>, Peter Zijlstra <peterz@...radead.org>,
 David Hildenbrand <david@...hat.com>,
 kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
 lkp@...el.com, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
 <luto@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
 David Rientjes <rientjes@...gle.com>, Hugh Dickins <hughd@...gle.com>,
 Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>,
 Peter Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>,
 Zach O'Keefe <zokeefe@...gle.com>, Dan Carpenter <dan.carpenter@...aro.org>,
 Frederic Weisbecker <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression



On 2025/1/30 01:33, Qi Zheng wrote:
> 
> 
> On 2025/1/30 00:53, Rik van Riel wrote:
>> On Wed, 29 Jan 2025 08:36:12 -0800
>> "Paul E. McKenney" <paulmck@...nel.org> wrote:
>>> On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:
>>
>>>> Paul, does this look like it could do the trick,
>>>> or do we need something else to make RCU freeing
>>>> happy again?
>>>
>>> I don't claim to fully understand the issue, but this would prevent
>>> any RCU grace periods starting subsequently from completing.  It would
>>> not prevent RCU callbacks from being invoked for RCU grace periods that
>>> started earlier.
>>>
>>> So it won't prevent RCU callbacks from being invoked.
>>
>> That makes things clear! I guess we need a different approach.
>>
>> Qi, does the patch below resolve the regression for you?
>>
>> ---8<---
>>
>>  From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
>> From: Rik van Riel <riel@...riel.com>
>> Date: Wed, 29 Jan 2025 10:51:51 -0500
>> Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with 
>> pcp lock
>>   held
>>
>> Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
>> zone->lock.  This turns out to be because in some configurations
>> RCU callbacks are called when IRQs are re-enabled inside
>> rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
>>
>> That results in the RCU callbacks being unable to grab the
>> PCP lock, and taking the slow path with the zone->lock for
>> each item freed.
>>
>> Speed things up by blocking RCU callbacks while holding the
>> PCP lock.
>>
>> Signed-off-by: Rik van Riel <riel@...riel.com>
>> Suggested-by: Paul McKenney <paulmck@...nel.org>
>> Reported-by: Qi Zheng <zhengqi.arch@...edance.com>
>> ---
>>   mm/page_alloc.c | 10 +++++++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6e469c7ef9a4..73e334f403fd 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
>>   #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
>>   /*
>> - * On SMP, spin_trylock is sufficient protection.
>> + * On SMP, spin_trylock is sufficient protection against recursion.
>>    * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
>> + *
>> + * Block softirq execution to prevent RCU frees from running in softirq
>> + * context while this CPU holds the PCP lock, which could result in a 
>> whole
>> + * bunch of frees contending on the zone->lock.
>>    */
>> -#define pcp_trylock_prepare(flags)    do { } while (0)
>> -#define pcp_trylock_finish(flag)    do { } while (0)
>> +#define pcp_trylock_prepare(flags)    local_bh_disable()
>> +#define pcp_trylock_finish(flag)    local_bh_enable()
> 
> I just tested this, and it doesn't seem to improve much:
> 
> root@...ian:~# stress-ng --timeout 60 --times --verify --metrics 
> --no-rand-seed --mmapaddr 64
> stress-ng: info:  [671] dispatching hogs: 64 mmapaddr
> stress-ng: info:  [671] successful run completed in 60.07s (1 min, 0.07 
> secs)
> stress-ng: info:  [671] stressor       bogo ops real time  usr time  sys 
> time   bogo ops/s   bogo ops/s
> stress-ng: info:  [671]                           (secs)    (secs) 
> (secs)   (real time) (usr+sys time)
> stress-ng: info:  [671] mmapaddr       19803127     60.01    235.20 
> 1146.76    330007.29     14329.74
> stress-ng: info:  [671] for a 60.07s run time:
> stress-ng: info:  [671]    1441.59s available CPU time
> stress-ng: info:  [671]     235.57s user time   ( 16.34%)
> stress-ng: info:  [671]    1147.20s system time ( 79.58%)
> stress-ng: info:  [671]    1382.77s total time  ( 95.92%)
> stress-ng: info:  [671] load average: 41.42 11.91 4.10
> 
> The _raw_spin_unlock_irqrestore hotspot still exists:
> 
>    15.87%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     9.18%  [kernel]  [k] clear_page_rep
>     7.03%  [kernel]  [k] do_syscall_64
>     3.67%  [kernel]  [k] _raw_spin_lock
>     3.28%  [kernel]  [k] __slab_free
>     2.03%  [kernel]  [k] rcu_cblist_dequeue
>     1.98%  [kernel]  [k] flush_tlb_mm_range
>     1.88%  [kernel]  [k] lruvec_stat_mod_folio.part.131
>     1.85%  [kernel]  [k] get_page_from_freelist
>     1.64%  [kernel]  [k] kmem_cache_alloc_noprof
>     1.61%  [kernel]  [k] tlb_remove_table_rcu
>     1.39%  [kernel]  [k] mtree_range_walk
>     1.36%  [kernel]  [k] __alloc_frozen_pages_noprof
>     1.27%  [kernel]  [k] pmd_install
>     1.24%  [kernel]  [k] memcpy_orig
>     1.23%  [kernel]  [k] __call_rcu_common.constprop.77
>     1.17%  [kernel]  [k] free_pgd_range
>     1.15%  [kernel]  [k] pte_alloc_one
> 
> The call stack is as follows:
> 
> bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} 
> interval:s:1 {exit();}'
> 
> @[
>      _raw_spin_unlock_irqrestore+5
>      hrtimer_interrupt+289
>      __sysvec_apic_timer_interrupt+85
>      sysvec_apic_timer_interrupt+108
>      asm_sysvec_apic_timer_interrupt+26
>      tlb_remove_table_rcu+48
>      rcu_do_batch+424
>      rcu_core+401
>      handle_softirqs+204
>      irq_exit_rcu+208
>      sysvec_apic_timer_interrupt+61
>      asm_sysvec_apic_timer_interrupt+26
> , stress-ng-mmapa]: 8
> 
> The tlb_remove_table_rcu() is called very rarely, so I guess the
> PCP cache is basically empty at this time, resulting in the following
> call stack:
> 

But I think this may be just an extreme test scenario, because my test
machine has no other load at this time. Under normal workload, page
table pages should only occupy a small part of the PCP cache, and
delayed freeing should not have much impact on the PCP cache.

Thanks!