linux-kernel - Re: [PATCH] slub: avoid scanning all partial slabs in get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ab2b2391-09c1-4801-b9bd-04aa8f7f23e7@suse.cz>
Date: Mon, 19 Feb 2024 11:17:52 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: Chengming Zhou <zhouchengming@...edance.com>,
 David Rientjes <rientjes@...gle.com>,
 Jianfeng Wang <jianfeng.w.wang@...cle.com>
Cc: cl@...ux.com, penberg@...nel.org, iamjoonsoo.kim@....com,
 akpm@...ux-foundation.org, roman.gushchin@...ux.dev, 42.hyeyoo@...il.com,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] slub: avoid scanning all partial slabs in get_slabinfo()

On 2/19/24 10:29, Chengming Zhou wrote:
> On 2024/2/19 16:30, Vlastimil Babka wrote:
>> On 2/18/24 20:25, David Rientjes wrote:
>>> On Thu, 15 Feb 2024, Jianfeng Wang wrote:
>>>
>>>> When reading "/proc/slabinfo", the kernel needs to report the number of
>>>> free objects for each kmem_cache. The current implementation relies on
>>>> count_partial() that counts the number of free objects by scanning each
>>>> kmem_cache_node's partial slab list and summing free objects from all
>>>> partial slabs in the list. This process must hold per kmem_cache_node
>>>> spinlock and disable IRQ. Consequently, it can block slab allocation
>>>> requests on other CPU cores and cause timeouts for network devices etc.,
>>>> if the partial slab list is long. In production, even NMI watchdog can
>>>> be triggered because some slab caches have a long partial list: e.g.,
>>>> for "buffer_head", the number of partial slabs was observed to be ~1M
>>>> in one kmem_cache_node. This problem was also observed by several
> 
> Not sure if this situation is normal? It maybe very fragmented, right?
> 
> SLUB completely depend on the timing order to place partial slabs in node,
> which maybe suboptimal in some cases. Maybe we could introduce anti-fragment
> mechanism like fullness grouping in zsmalloc to have multiple lists based
> on fullness grouping? Just some random thoughts... :)

Most likely that's wouldn't be feasible. When freeing to a slab on partial
list that's just a cmpxchg128 (unless the slab become empty) and additional
list manipulation to maintain the grouping would kill the performance.