linux-kernel - Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <208e2b4a-cff5-432e-a330-74a8fc3a22cc@suse.cz>
Date: Tue, 22 Apr 2025 17:02:47 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Christoph Lameter <cl@...ux.com>, David Rientjes <rientjes@...gle.com>,
 Roman Gushchin <roman.gushchin@...ux.dev>, Harry Yoo <harry.yoo@...cle.com>,
 Uladzislau Rezki <urezki@...il.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
 maple-tree@...ts.infradead.org
Subject: Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu
 sheaves

On 4/10/25 21:51, Suren Baghdasaryan wrote:
>> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +
>> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> +       /* The cpu is not executing anymore so we don't need pcs->lock */
>> +       sheaf_flush_unused(s, pcs->main);
> 
> You are flushing pcs->main but sheaf_flush_unused() will account that
> in SHEAF_FLUSH_OTHER instead of SHEAF_FLUSH_MAIN. Perhaps
> sheaf_flush_unused() needs a parameter to indicate which counter to be
> incremented.

Hm it's technically true, but it's a cpu hotremove operation so rather rare
and not performance sensitive and one could argue it's not a main sheaf (of
an active cpu) anymore. So I rather went with a comment than complicating
the code.

>> +       if (pcs->spare) {
>> +               sheaf_flush_unused(s, pcs->spare);
>> +               free_empty_sheaf(s, pcs->spare);
>> +               pcs->spare = NULL;
>> +       }
>> +}
>> +
>> +static void pcs_destroy(struct kmem_cache *s)
>> +{
>> +       int cpu;
>> +
>> +       for_each_possible_cpu(cpu) {
>> +               struct slub_percpu_sheaves *pcs;
>> +
>> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> +               /* can happen when unwinding failed create */
>> +               if (!pcs->main)
>> +                       continue;
>> +
>> +               WARN_ON(pcs->spare);
>> +
>> +               if (!WARN_ON(pcs->main->size)) {
> 
> If pcs->main->size > 0, shouldn't we flush pcs->main? I understand
> it's not a normal situation but I think something like this would be
> more correct:
> 
>               if (WARN_ON(pcs->main->size))
>                             sheaf_flush_unused(s, pcs->main);
>               free_empty_sheaf(s, pcs->main);
>               pcs->main = NULL;

Initially I agreed but then I realized we only reach this code when we
already found all slabs only containing free objects. So if we then find
some in sheaves it means something must have gone terribly wrong so trying
to flush can only make things worse. I've added a comment instead.

>> +                       free_empty_sheaf(s, pcs->main);
>> +                       pcs->main = NULL;
>> +               }
>> +       }
>> +
>> +       free_percpu(s->cpu_sheaves);
>> +       s->cpu_sheaves = NULL;
>> +}
>> +
>> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
>> +{
>> +       struct slab_sheaf *empty = NULL;
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&barn->lock, flags);
>> +
>> +       if (barn->nr_empty) {
>> +               empty = list_first_entry(&barn->sheaves_empty,
>> +                                        struct slab_sheaf, barn_list);
>> +               list_del(&empty->barn_list);
>> +               barn->nr_empty--;
>> +       }
>> +
>> +       spin_unlock_irqrestore(&barn->lock, flags);
>> +
>> +       return empty;
>> +}
>> +
>> +static int barn_put_empty_sheaf(struct node_barn *barn,
>> +                               struct slab_sheaf *sheaf, bool ignore_limit)
> 
> This ignore_limit in barn_put_empty_sheaf()/barn_put_full_sheaf() is
> sticking out and really used only inside rcu_free_sheaf() in the next
> patch. Every time I saw the return value of these functions ignored I
> had to remind myself that they pass ignore_limit=true, so the function
> can't really fail. Maybe we could get rid of this flag and do trimming
> of barn lists inside rcu_free_sheaf() separately in one go? IIUC

Good suggestion, didn't realize the barn_put_() are only used in the
fallback-due-to-a-race-or-cpu-migration paths, so did as you suggested.

> because we ignore the limits in all other places, at the time of
> rcu_free_sheaf() we might end up with a barn having many more sheaves
> than the limit allows for, so trimming in bulk might be even more
> productive.

There are many caches with no rcu freeing happening so making
rcu_free_sheaf() a bulk trimming point could be insufficient. I'd just leave
it for now and hope that ignoring limits only due to races won't cause barns
to grow more than it could self-correct for just by further allocation/freeing.

>> +{
>> +       unsigned long flags;
>> +       int ret = 0;
>> +
>> +       spin_lock_irqsave(&barn->lock, flags);
>> +
>> +       if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
>> +               ret = -E2BIG;
>> +       } else {
>> +               list_add(&sheaf->barn_list, &barn->sheaves_empty);
>> +               barn->nr_empty++;
>> +       }
>> +
>> +       spin_unlock_irqrestore(&barn->lock, flags);
>> +       return ret;
>> +}
>> +
>> +static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
>> +                              bool ignore_limit)
> 
> Can this function be called for partially populated sheaves or only
> full ones? I think rcu_free_sheaf() in the next patch might end up
> calling it for a partially populated sheaf.

It can, but the code handles not-really-full sheaves fine and it's easier to
allow them than try to be strict, there should be just a bit lower
efficiency in debug scenarios and that's fine. Adding a comment.

>> +{
>> +       unsigned long flags;
>> +       int ret = 0;
>> +
>> +       spin_lock_irqsave(&barn->lock, flags);
>> +
>> +       if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
>> +               ret = -E2BIG;
>> +       } else {
>> +               list_add(&sheaf->barn_list, &barn->sheaves_full);
>> +               barn->nr_full++;
>> +       }
>> +
>> +       spin_unlock_irqrestore(&barn->lock, flags);
>> +       return ret;
>> +}
>> +
>> +/*
>> + * If a full sheaf is available, return it and put the supplied empty one to
>> + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
>> + * change.
> 
> I'm unclear why we ignore the limit here but not in
> barn_replace_full_sheaf(). Maybe because full sheaves consume more
> memory?

Yes.

> But then why do we mostly pass ignore_limit=true when invoking
> barn_put_full_sheaf()?

Not anymore with your suggestion.

> Explanation of this limit logic needs to be
> clarified.
> 
>> +static __fastpath_inline
>> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +       struct slab_sheaf *main;
>> +       unsigned int allocated = 0;
>> +       unsigned int batch;
>> +
>> +next_batch:
>> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
>> +               return allocated;
>> +
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +       if (unlikely(pcs->main->size == 0)) {
> 
> The above condition is unlikely only for the first batch. I think it's
> actually guaranteed on all consecutive batches once you do "goto
> next_batch", right?

Hm yes but perhaps bulk allocs larger than batch size are rare?

>> +static __fastpath_inline
>> +bool free_to_pcs(struct kmem_cache *s, void *object)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +
>> +restart:
>> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
>> +               return false;
>> +
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>> +
>> +               struct slab_sheaf *empty;
>> +
>> +               if (!pcs->spare) {
>> +                       empty = barn_get_empty_sheaf(pcs->barn);
>> +                       if (empty) {
>> +                               pcs->spare = pcs->main;
>> +                               pcs->main = empty;
>> +                               goto do_free;
>> +                       }
>> +                       goto alloc_empty;
>> +               }
>> +
>> +               if (pcs->spare->size < s->sheaf_capacity) {
>> +                       stat(s, SHEAF_SWAP);
>> +                       swap(pcs->main, pcs->spare);
>> +                       goto do_free;
>> +               }
>> +
>> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> 
> This function reads easier now but if barn_replace_full_sheaf() could
> ignore the MAX_FULL_SHEAVES and barn list trimming could be done
> later, it would simplify this function even further.

It could increase barn lock contention to put full sheaves there over limit
and then flush them afterwards so I think it's better to not do that.
Especially with "slab: determine barn status racily outside of lock".

> 
> nit: I think the section below would be more readable if structured
> with fail handling blocks at the end. Something like this:
> 
> next_batch:
>         if (!localtry_trylock(&s->cpu_sheaves->lock))
>                 goto fallback;
> 
>         pcs = this_cpu_ptr(s->cpu_sheaves);
> 
>         if (likely(pcs->main->size < s->sheaf_capacity))
>                 goto do_free;

Nice suggestion, did that. Thanks.