linux-kernel - Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b661116-0a85-4928-91ed-3c01ebbf8d39@bytedance.com>
Date: Wed, 29 Jan 2025 01:06:55 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
 lkp@...el.com, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
 <luto@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
 David Rientjes <rientjes@...gle.com>, Hugh Dickins <hughd@...gle.com>,
 Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, Mel Gorman <mgorman@...e.de>,
 Muchun Song <muchun.song@...ux.dev>, Peter Xu <peterx@...hat.com>,
 Will Deacon <will@...nel.org>, Zach O'Keefe <zokeefe@...gle.com>,
 Dan Carpenter <dan.carpenter@...aro.org>, Rik van Riel <riel@...riel.com>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression

Hi,

On 2025/1/28 21:42, David Hildenbrand wrote:
> On 28.01.25 14:28, Peter Zijlstra wrote:
>> On Tue, Jan 28, 2025 at 12:39:51PM +0100, David Hildenbrand wrote:
>>> On 28.01.25 12:31, Peter Zijlstra wrote:
>>
>>>>> I recall a recent series to select MMU_GATHER_RCU_TABLE_FREE on x86
>>>>> unconditionally (@Peter, @Rik).
>>>>
>>>> Those changes should not have made it to Linus yet.
>>>>
>>>> /me updates git and checks...
>>>>
>>>> nope, nothing changed there ... yet
>>>
>>> Sorry, I wasn't quite clear. CONFIG_PT_RECLAIM made it upstream, 
>>> which has
>>> "select MMU_GATHER_RCU_TABLE_FREE" in kconfig.
>>>
>>> So I'm wondering if the degradation we see in this report is due to
>>> MMU_GATHER_RCU_TABLE_FREE being selected by CONFIG_PT_RECLAIM, and 
>>> we'd get
>>> the same result (degradation) when unconditionally enabling
>>> MMU_GATHER_RCU_TABLE_FREE.
>>
>> Ah, yes, put a RHEL based config (as is the case here) should already
>> have it selected due to PARAVIRT.
> 
> Ah, right. Most distros will just have it enabled either way.
> 
> But that would then mean that MMU_GATHER_RCU_TABLE_FREE is not the cause 
> for the regression here, and something else is going wrong.
> 

I did reproduce the performance regression using the following test
program:

stress-ng --timeout 60 --times --verify --metrics --no-rand-seed 
--mmapaddr 64

The results are as follows:

1) Enable CONFIG_PT_RECLAIM

stress-ng: info:  [826] dispatching hogs: 64 mmapaddr
stress-ng: info:  [826] successful run completed in 60.29s (1 min, 0.29 
secs)
stress-ng: info:  [826] stressor       bogo ops real time  usr time  sys 
time   bogo ops/s   bogo ops/s
stress-ng: info:  [826]                           (secs)    (secs) 
(secs)   (real time) (usr+sys time)
stress-ng: info:  [826] mmapaddr       17233711     60.01    238.47 
1128.46    287178.92     12607.60
stress-ng: info:  [826] for a 60.29s run time:
stress-ng: info:  [826]    1447.07s available CPU time
stress-ng: info:  [826]     238.85s user time   ( 16.51%)
stress-ng: info:  [826]    1128.87s system time ( 78.01%)
stress-ng: info:  [826]    1367.72s total time  ( 94.52%)
stress-ng: info:  [826] load average: 48.64 20.73 7.82

2) Disable CONFIG_PT_RECLAIM

stress-ng: info:  [704] dispatching hogs: 64 mmapaddr
stress-ng: info:  [704] successful run completed in 60.05s (1 min, 0.05 
secs)
stress-ng: info:  [704] stressor       bogo ops real time  usr time  sys 
time   bogo ops/s   bogo ops/s
stress-ng: info:  [704]                           (secs)    (secs) 
(secs)   (real time) (usr+sys time)
stress-ng: info:  [704] mmapaddr       28440843     60.02    343.93 
1090.70    473882.98     19824.51
stress-ng: info:  [704] for a 60.05s run time:
stress-ng: info:  [704]    1441.23s available CPU time
stress-ng: info:  [704]     344.30s user time   ( 23.89%)
stress-ng: info:  [704]    1091.12s system time ( 75.71%)
stress-ng: info:  [704]    1435.42s total time  ( 99.60%)
stress-ng: info:  [704] load average: 40.03 11.51 3.96

Then I found that after enabling CONFIG_PT_RECLAIM, there was an
additional perf hotspot function:

   16.35%  [kernel]  [k] _raw_spin_unlock_irqrestore
    9.09%  [kernel]  [k] clear_page_rep
    6.92%  [kernel]  [k] do_syscall_64
    3.76%  [kernel]  [k] _raw_spin_lock
    3.27%  [kernel]  [k] __slab_free
    2.07%  [kernel]  [k] rcu_cblist_dequeue
    1.94%  [kernel]  [k] flush_tlb_mm_range
    1.87%  [kernel]  [k] lruvec_stat_mod_folio.part.130
    1.79%  [kernel]  [k] get_page_from_freelist
    1.61%  [kernel]  [k] tlb_remove_table_rcu
    1.58%  [kernel]  [k] kmem_cache_alloc_noprof
    1.43%  [kernel]  [k] mtree_range_walk

And its call stack is as follows:

bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} 
interval:s:1 {exit();}'

@[
_raw_spin_unlock_irqrestore+5
free_one_page+85
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+113
asm_sysvec_apic_timer_interrupt+26
_raw_spin_unlock_irqrestore+29
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2283
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
pte_alloc_one+30
__pte_alloc+42
do_pte_missing+2499
__handle_mm_fault+1862
handle_mm_fault+195
__get_user_pages+690
populate_vma_page_range+127
__mm_populate+159
vm_mmap_pgoff+329
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2443
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5184
@[
_raw_spin_unlock_irqrestore+5
free_one_page+85
tlb_remove_table_rcu+140
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+113
asm_sysvec_apic_timer_interrupt+26
_raw_spin_unlock_irqrestore+29
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5301
@Error looking up stack id 4294967279 (pid -1): -1
[, stress-ng-mmapa]: 53366

It seems to be related to CONFIG_MMU_GATHER_RCU_TABLE_FREE?

I will continue to investigate further.

Thanks!