linux-kernel - Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <14159fb4-0c59-4653-9265-73f415e70063@bytedance.com>
Date: Wed, 29 Jan 2025 16:14:01 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Rik van Riel <riel@...riel.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
 David Hildenbrand <david@...hat.com>,
 kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
 lkp@...el.com, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
 <luto@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
 David Rientjes <rientjes@...gle.com>, Hugh Dickins <hughd@...gle.com>,
 Jann Horn <jannh@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, Mel Gorman <mgorman@...e.de>,
 Muchun Song <muchun.song@...ux.dev>, Peter Xu <peterx@...hat.com>,
 Will Deacon <will@...nel.org>, Zach O'Keefe <zokeefe@...gle.com>,
 Dan Carpenter <dan.carpenter@...aro.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression

Hi Rik,

On 2025/1/29 02:35, Rik van Riel wrote:
> On Wed, 2025-01-29 at 01:06 +0800, Qi Zheng wrote:
>>
>> I did reproduce the performance regression using the following test
>> program:
>>
>> stress-ng --timeout 60 --times --verify --metrics --no-rand-seed
>> --mmapaddr 64
>>
>> And its call stack is as follows:
>>
>> bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();}
>> interval:s:1 {exit();}'
>>
>> @[
>> _raw_spin_unlock_irqrestore+5
>> free_one_page+85
>> rcu_do_batch+424
>> rcu_core+401
>> handle_softirqs+204
>> irq_exit_rcu+208
> 
> That looks like the RCU freeing somehow bypassing the
> per-cpu-pages, and hitting the zone->lock at page free
> time, while regular freeing usually puts pages in the
> CPU-local free page cache, without the lock?

Take the following call stack as an example:

@[
_raw_spin_unlock_irqrestore+5
free_one_page+85
tlb_remove_table_rcu+140
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+113
asm_sysvec_apic_timer_interrupt+26
_raw_spin_unlock_irqrestore+29
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5301

It looks like the following happened:

get_page_from_freelist
--> rmqueue
     --> rmqueue_pcplist
         --> pcp_spin_trylock (hold the pcp lock)
             __rmqueue_pcplist
             --> rmqueue_bulk
                 --> spin_lock_irqsave(&zone->lock)
                     __rmqueue
                     spin_unlock_irqrestore(&zone->lock)

                     <run softirq at this time>

                     tlb_remove_table_rcu
                     --> free_frozen_pages
                         --> pcp = pcp_spin_trylock (failed!!!)
                             if (!pcp)
                                 free_one_page

It seems that the pcp lock is held when doing tlb_remove_table_rcu(), so
trylock fails, then bypassing PCP and calling free_one_page() directly,
which leads to the hot spot of zone lock.

As for the regular freeing, since the freeing operation will not be
performed in the softirq, the above situation will not occur.

Right?

> 
> I'm not quite sure why this would be happening, though.
> 
> Maybe the RCU batches are too big for the PCPs to
> hold them?
> 
> If that is the case, chances are more code paths are
> hitting that issue, and we should just fix it, rather
> than trying to bypass it.
> 
> Maybe the reason is more simple than that?
> 
> I have not found a place where it explicitly bypasses
> the PCPs, but who knows?
>