linux-kernel - Re: [PATCH v2] slab: fix slab accounting imbalance due to defer_deactivate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cee707e7-b7f7-4c21-8887-2cb69d73df93@suse.cz>
Date: Fri, 24 Oct 2025 10:55:20 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Harry Yoo <harry.yoo@...cle.com>,
 Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>,
 Roman Gushchin <roman.gushchin@...ux.dev>,
 Alexei Starovoitov <ast@...nel.org>, linux-mm <linux-mm@...ck.org>,
 LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] slab: fix slab accounting imbalance due to
 defer_deactivate_slab()

On 10/24/25 04:03, Harry Yoo wrote:
> On Thu, Oct 23, 2025 at 06:17:19PM -0700, Alexei Starovoitov wrote:
>> On Thu, Oct 23, 2025 at 5:00 PM Harry Yoo <harry.yoo@...cle.com> wrote:
>> >
>> > On Thu, Oct 23, 2025 at 04:13:37PM -0700, Alexei Starovoitov wrote:
>> > > On Thu, Oct 23, 2025 at 5:01 AM Vlastimil Babka <vbabka@...e.cz> wrote:
>> > > >
>> > > > Since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and
>> > > > kfree_nolock().") there's a possibility in alloc_single_from_new_slab()
>> > > > that we discard the newly allocated slab if we can't spin and we fail to
>> > > > trylock. As a result we don't perform inc_slabs_node() later in the
>> > > > function. Instead we perform a deferred deactivate_slab() which can
>> > > > either put the unacounted slab on partial list, or discard it
>> > > > immediately while performing dec_slabs_node(). Either way will cause an
>> > > > accounting imbalance.
>> > > >
>> > > > Fix this by not marking the slab as frozen, and using free_slab()
>> > > > instead of deactivate_slab() for non-frozen slabs in
>> > > > free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible
>> > > > case. By not using discard_slab() we avoid dec_slabs_node().
>> > > >
>> > > > Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
>> > > > Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
>> > > > ---
>> > > > Changes in v2:
>> > > > - Fix the problem differently. Harry pointed out that we can't move
>> > > >   inc_slabs_node() outside of list_lock protected regions as that would
>> > > >   reintroduce issues fixed by commit c7323a5ad078
>> > > > - Link to v1: https://patch.msgid.link/20251022-fix-slab-accounting-v1-1-27870ec363ce@suse.cz
>> > > > ---
>> > > >  mm/slub.c | 8 +++++---
>> > > >  1 file changed, 5 insertions(+), 3 deletions(-)
>> > > >
>> > > > diff --git a/mm/slub.c b/mm/slub.c
>> > > > index 23d8f54e9486..87a1d2f9de0d 100644
>> > > > --- a/mm/slub.c
>> > > > +++ b/mm/slub.c
>> > > > @@ -3422,7 +3422,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
>> > > >
>> > > >         if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
>> > > >                 /* Unlucky, discard newly allocated slab */
>> > > > -               slab->frozen = 1;
>> > > >                 defer_deactivate_slab(slab, NULL);
>> > > >                 return NULL;
>> > > >         }
>> > > > @@ -6471,9 +6470,12 @@ static void free_deferred_objects(struct irq_work *work)
>> > > >                 struct slab *slab = container_of(pos, struct slab, llnode);
>> > > >
>> > > >  #ifdef CONFIG_SLUB_TINY
>> > > > -               discard_slab(slab->slab_cache, slab);
>> > > > +               free_slab(slab->slab_cache, slab);
>> > > >  #else
>> > > > -               deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
>> > > > +               if (slab->frozen)
>> > > > +                       deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
>> > > > +               else
>> > > > +                       free_slab(slab->slab_cache, slab);
>> > >
>> > > A bit odd to use 'frozen' flag as such a signal.
>> > > I guess I'm worried that truly !frozen slab can come here
>> > > via ___slab_alloc() -> retry_load_slab: -> defer_deactivate_slab().
>> > > And things will be much worse than just accounting.
>> >
>> > But the cpu slab must have been frozen before it's attached to
>> > c->slab?

Note that deactivate_slab() contains VM_BUG_ON(!old.frozen);
we would have seen this triggered if we were passing unfrozen slabs to
(defer_)deactivate_slab(). I assume it's also why the "unlucky, discard"
code marks it frozen before calling defer_deactivate_slab() (and this patch
removes that).

>> Is it?
>> the path is
>> c = slub_get_cpu_ptr(s->cpu_slab);
>> if (unlikely(c->slab)) {
>>    struct slab *flush_slab = c->slab;
>>    defer_deactivate_slab(flush_slab, ...);
>> 
>> I don't see why it would be frozen.

c->slab is always frozen, that's an invariant

> 
> Oh god. I was going to say the cpu slab is always frozen. It has been
> true for very long time, but it seems it's not true after commit 90b1e56641
> ("mm/slub: directly load freelist from cpu partial slab in the likely case").

It's still true. That commit only removes VM_BUG_ON(!new.frozen); where
"new" is in fact the old state - when slab is on cpu partial list it's not
yet frozen. get_freelist() then sets new.frozen = freelist != NULL;
and we know that freelist cant't be NULL for a slab on the cpu partial list.
The commit even added VM_BUG_ON(!freelist); on the get_freelist() result for
this case.

So I think we're fine?

> So I think you're right that a non-frozen slab can go through
> free_slab() in free_deferred_objects()...
> 
> But fixing this should be simple. Add something like
> freeze_and_get_freelist() and call it when SLUB take a slab from
> per-cpu partial slab list?
> 
>> > > Maybe add
>> > >   inc_slabs_node(s, nid, slab->objects);
>> > > right before
>> > >   defer_deactivate_slab(slab, NULL);
>> > >   return NULL;
>> > >
>> > > I don't quite get why c7323a5ad078 is doing everything under n->list_lock.
>> > > It's been 3 years since.
>> >
>> > When n->nr_slabs is inconsistent, validate_slab_node() might report an
>> > error (false positive) when someone wrote '1' to
>> > /sys/kernel/slab/<cache name>/validate
>> 
>> Ok. I see it now. It's the actual number of elements in n->full
>> list needs to match n->nr_slabs.
>> 
>> But then how it's not broken already?
>> I see that
>> alloc_single_from_new_slab()
>> unconditionally does inc_slabs_node(), but
> 
> It increments n->nr_slabs. It doesn't matter which list it's going to be
> added to, because it's total number of slabs in that node.
> 
>> slab itself is added either to n->full or n->partial lists.
> 
> and then n->nr_partial is also incremented if it's added to n->partial.
> 
>> And validate_slab_node() should be complaining already.
> 
> The debug routine checks if:
> - the number of slabs in n->partial == n->nr_partial
> - the number of slabs in n->full + n->partial == n->nr_slabs
> 
> under n->list_lock. So it's not broken?
> 
>> Anyway, I'm not arguing. Just trying to understand.
>> If you think the fix is fine, then go ahead.
>