linux-kernel - Re: [External] : Re: [PATCH] slub: limit number of slabs to scan in count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a8e208fb-7842-4bca-9d2d-3aae21da030c@oracle.com>
Date: Fri, 12 Apr 2024 10:29:55 -0700
From: Jianfeng Wang <jianfeng.w.wang@...cle.com>
To: Vlastimil Babka <vbabka@...e.cz>,
        "Christoph Lameter (Ampere)" <cl@...ux.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, penberg@...nel.org,
        rientjes@...gle.com, iamjoonsoo.kim@....com, akpm@...ux-foundation.org,
        junxiao.bi@...cle.com
Subject: Re: [External] : Re: [PATCH] slub: limit number of slabs to scan in
 count_partial()



On 4/12/24 12:48 AM, Vlastimil Babka wrote:
> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>
>>> So, the fix is to limit the number of slabs to scan in
>>> count_partial(), and output an approximated result if the list is too
>>> long. Default to 10000 which should be enough for most sane cases.
>>
>>
>> That is a creative approach. The problem though is that objects on the 
>> partial lists are kind of sorted. The partial slabs with only a few 
>> objects available are at the start of the list so that allocations cause 
>> them to be removed from the partial list fast. Full slabs do not need to 
>> be tracked on any list.
>>
>> The partial slabs with few objects are put at the end of the partial list 
>> in the hope that the few objects remaining will also be freed which would 
>> allow the freeing of the slab folio.
>>
>> So the object density may be higher at the beginning of the list.
>>
>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>> partial pages in that order.
>>
>> Can you run some tests showing the difference between the estimation and 
>> the real count?

Yes.
On a server with one NUMA node, I create a case that uses many dentry objects.
For "dentry", the length of partial slabs is slightly above 250000. Then, I
compare my approach of scanning N slabs from the list's head v.s. the original
approach of scanning the full list. I do it by getting both results using
the new and the original count_partial() and printing them in /proc/slabinfo.

N = 10000
my_result = 4741651
org_result = 4744966
diff = (org_result - my_result) / org_result = 0.00069 = 0.069 %

Increasing N further to 25000 will only slight improve the accuracy:
N = 15000 -> diff =  0.02 %
N = 20000 -> diff =  0.01 %
N = 25000 -> diff = -0.017 %

Based on the measurement, I think the difference between the estimation and
the real count is very limited (i.e. less than 0.1% for N = 10000). The
benefit is significant: shorter execution time for get_slabinfo(); no more
soft lockups or crashes caused by count_partial().

> 
> Maybe we could also get a more accurate picture by counting N slabs from the
> head and N from the tail and approximating from both. Also not perfect, but
> could be able to answer the question if the kmem_cache is significantly
> fragmented. Which is probably the only information we can get from the
> slabinfo <active_objs> vs <num_objs>. IIRC the latter is always accurate,
> the former never because of cpu slabs, so we never know how many objects are
> exactly in use. By comparing both we can get an idea of the fragmentation,
> and if this change won't make that estimate significantly worse, it should
> be acceptable.