linux-kernel - Re: [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dac2e02c-9a71-4ced-8ac1-7b15dea120c3@suse.cz>
Date: Tue, 19 Nov 2024 09:27:00 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: Hyeonggon Yoo <42.hyeyoo@...il.com>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Christoph Lameter
 <cl@...ux.com>, David Rientjes <rientjes@...gle.com>,
 Pekka Enberg <penberg@...nel.org>, Joonsoo Kim <iamjoonsoo.kim@....com>,
 Roman Gushchin <roman.gushchin@...ux.dev>,
 "Paul E. McKenney" <paulmck@...nel.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, Boqun Feng <boqun.feng@...il.com>,
 Uladzislau Rezki <urezki@...il.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
 maple-tree@...ts.infradead.org
Subject: Re: [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed
 allocations

On 11/19/24 03:29, Hyeonggon Yoo wrote:
> On Mon, Nov 18, 2024 at 11:26 PM Vlastimil Babka <vbabka@...e.cz> wrote:
>>
>> On 11/18/24 14:13, Hyeonggon Yoo wrote:
>> > On Wed, Nov 13, 2024 at 1:39 AM Vlastimil Babka <vbabka@...e.cz> wrote:
>> >> +
>> >> +/*
>> >> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
>> >> + *
>> >> + * Guaranteed not to fail as many allocations as was the requested count.
>> >> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
>> >> + *
>> >> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
>> >> + * memcg charging is forced over limit if necessary, to avoid failure.
>> >> + */
>> >> +void *
>> >> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>> >> +                                  struct slab_sheaf *sheaf)
>> >> +{
>> >> +       void *ret = NULL;
>> >> +       bool init;
>> >> +
>> >> +       if (sheaf->size == 0)
>> >> +               goto out;
>> >> +
>> >> +       ret = sheaf->objects[--sheaf->size];
>> >> +
>> >> +       init = slab_want_init_on_alloc(gfp, s);
>> >> +
>> >> +       /* add __GFP_NOFAIL to force successful memcg charging */
>> >> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
>> >
>> > Maybe I'm missing something, but how can this be used for non-sleepable contexts
>> > if __GFP_NOFAIL is used? I think we have to charge them when the sheaf
>>
>> AFAIK it forces memcg to simply charge even if allocated memory goes over
>> the memcg limit. So there's no issue with a non-sleepable context, there
>> shouldn't be memcg reclaim happening in that case.
> 
> Ok, but I am still worried about mem alloc profiling/memcg trying to
> allocate some memory
> with __GFP_NOFAIL flag and eventually passing it to the buddy allocator,
> which does not want __GFP_NOFAIL without __GFP_DIRECT_RECLAIM?
> e.g.) memcg hook calls
> alloc_slab_obj_exts()->kcalloc_node()->....->alloc_pages()

alloc_slab_obj_exts() removes __GFP_NOFAIL via OBJCGS_CLEAR_MASK so that's fine.
I think kmemleak_alloc_recursive() is also fine as it ends up in
mem_pool_alloc() and will clear __GFP_NOFAIL it via gfp_nested_mask()
Hope I'm not missing something else.

>> > is returned
>> > via kmem_cache_prefill_sheaf(), just like users of bulk alloc/free?
>>
>> That would be very costly to charge/uncharge if most of the objects are not
>> actually used - it's what we want to avoid here.
>> Going over the memcgs limit a bit in a very rare case isn't considered such
>> an issue, for example Linus advocated such approach too in another context.
> 
> Thanks for the explanation! That was a point I was missing.
> 
>> > Best,
>> > Hyeonggon
>> >
>> >> +out:
>> >> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
>> >> +
>> >> +       return ret;
>> >> +}
>> >> +
>> >>  /*
>> >>   * To avoid unnecessary overhead, we pass through large allocation requests
>> >>   * directly to the page allocator. We use __GFP_COMP, because we will need to
>> >>
>> >> --
>> >> 2.47.0
>> >>
>>