linux-kernel - Re: [PATCH] slub: limit number of slabs to scan in count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <653d0bf6-9acc-4625-a307-af195437e744@suse.cz>
Date: Tue, 16 Apr 2024 22:14:42 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Jianfeng Wang <jianfeng.w.wang@...cle.com>,
 "Christoph Lameter (Ampere)" <cl@...ux.com>
Cc: "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "penberg@...nel.org" <penberg@...nel.org>,
 "rientjes@...gle.com" <rientjes@...gle.com>,
 "iamjoonsoo.kim@....com" <iamjoonsoo.kim@....com>,
 "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
 Junxiao Bi <junxiao.bi@...cle.com>
Subject: Re: [PATCH] slub: limit number of slabs to scan in count_partial()

On 4/16/24 8:58 PM, Jianfeng Wang wrote:
> 
> 
> On 4/15/24 12:35 AM, Vlastimil Babka wrote:
>> On 4/13/24 3:17 AM, Jianfeng Wang wrote:
>>>
>>>> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@...cle.com> wrote:
>>>>
>>>> On 4/12/24 1:20 PM, Vlastimil Babka wrote:
>>>>> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>>>>>
>>>>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>>>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>>>>>
>>>>>>>>> So, the fix is to limit the number of slabs to scan in
>>>>>>>>> count_partial(), and output an approximated result if the list is too
>>>>>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>>>>>
>>>>>>>>
>>>>>>>> That is a creative approach. The problem though is that objects on the 
>>>>>>>> partial lists are kind of sorted. The partial slabs with only a few 
>>>>>>>> objects available are at the start of the list so that allocations cause 
>>>>>>>> them to be removed from the partial list fast. Full slabs do not need to 
>>>>>>>> be tracked on any list.
>>>>>>>>
>>>>>>>> The partial slabs with few objects are put at the end of the partial list 
>>>>>>>> in the hope that the few objects remaining will also be freed which would 
>>>>>>>> allow the freeing of the slab folio.
>>>>>>>>
>>>>>>>> So the object density may be higher at the beginning of the list.
>>>>>>>>
>>>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the 
>>>>>>>> partial pages in that order.
>>>>>>>>
>>>
>>> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
>>> After that, the numbers become:
>>> N = 10000 -> diff = 7.1 %
>>> N = 20000 -> diff = 5.7 %
>>> N = 25000 -> diff = 5.4 %
>>> So, expecting ~5-7% difference after shrinking.
>>>
>>>>>>>> Can you run some tests showing the difference between the estimation and 
>>>>>>>> the real count?
>>>>>>
>>>>>> Yes.
>>>>>> On a server with one NUMA node, I create a case that uses many dentry objects.
>>>>>
>>>>> Could you describe in more detail how do you make dentry cache to grow such
>>>>> a large partial slabs list? Thanks.
>>>>>
>>>>
>>>> I utilized the fact that creating a folder will create a new dentry object;
>>>> deleting a folder will delete all its sub-folder's dentry objects.
>>>>
>>>> Then, I started to create N folders, while each folder has M empty sub-folders.
>>>> Assuming that these operations would consume a large number of dentry
>>>> objects in the sequential order. Their slabs were very likely to be full slabs.
>>>> After all folders were created, I deleted a subset of the N folders (i.e.,
>>>> one out of every two folders). This would create many holes, which turned a
>>>> subset of full slabs into partial slabs.
>> 
>> Thanks, right, so that's quite a deterministic way to achieve the long
>> partial lists with very close to uniform ratio of free/used, so no wonder
>> the resulting accuracy is good and the diff is very small. But in practice
>> the workloads that may lead to long lists will not be so uniform. The result
>> after shrinking shows what happens if there's bias in which slabs we inspect
>> due to the sorting. But still most of the slabs will have the near-uniform
>> free/used ratio so the sorting will not do so much difference. But another
>> workload might do that.
>> 
>> So what happens if you inspect X slabs from the head and X from the tail as
>> I suggested? That should help your test case even after you sort, and also
>> should in theory be more accurate even for less uniform workloads.
> 
> Yes, the approach of counting from both directions and then approximating
> works better after sorting the partial list.

Yeah I think we could go with that approach then. Let's do 5000 from each
side. You can check whether n->nr_partial < 10000 and then just scan the
whole list in single direction with no approximation, and otherwise 5000
from each side with approximation. I think the code as you show below will
scan some slabs in the middle of the list twice if there's between 5000 and
10000 on the list, so checking n->nr_partial would avoid that. Thanks!

> Here is what I did.
> ---
> +static unsigned long count_partial(struct kmem_cache_node *n,
> +                                       int (*get_count)(struct slab *))
> +{
> +       unsigned long flags;
> +       unsigned long x = 0;
> +       unsigned long scanned = 0;
> +       struct slab *slab;
> +
> +       spin_lock_irqsave(&n->list_lock, flags);
> +       list_for_each_entry(slab, &n->partial, slab_list) {
> +               x += get_count(slab);
> +               if (++scanned > MAX_PARTIAL_TO_SCAN)
> +                       break;
> +       }
> +
> +       if (scanned > max_partial_to_scan) {
> +               scanned = 0;
> +               list_for_each_entry_reverse(slab, &n->partial, slab_list) {
> +                       x += get_count(slab);
> +                       if (++scanned > MAX_PARTIAL_TO_SCAN) {
> +                               /* Approximate total count of objects */
> +                               x = mult_frac(x, n->nr_partial, scanned * 2);
> +                               x = min(x, node_nr_objs(n));
> +                               break;
> +                       }
> +               }
> +       }
> +       spin_unlock_irqrestore(&n->list_lock, flags);
> +       return x;
> +}
> ---
> 
> Results:
> ---
> * Pre-shrink:
> MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | 
> 1000             | 0.43  %                 | 0.80   %                |
> 5000             | 0.06  %                 | 0.16   %                |
> 10000            | 0.02  %                 | -0.003 %                |
> 20000            | 0.009 %                 | 0.03   %                |
> 
> * After-shrink:
> MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) | 
> 1000             | 12.46 %                 | 3.60   %                |
> 5000             | 5.38  %                 | 0.22   %                |
> 10000            | 4.99  %                 | -0.06  %                |
> 20000            | 4.86  %                 | -0.17  %                |
> ---
> 
> For |MAX_SLAB_TO_SCAN| >= 5000, count_partial() returns the exact
> object count if the length of the partial list is not larger than
> |MAX_SLAB_TO_SCAN|. Otherwise, it counts |MAX_SLAB_TO_SCAN| from both
> the head and from the tail, and outputs an approximation that shows
> <1% difference.
> 
> With a slightly larger limit (like 10000), count_partial() should
> produce the exact number for most cases (that won't lead to a
> lockup) and avoid lockups with a good estimate.