linux-kernel - Re: [linus:master] [bpf] c930472552: WARNING:at_kernel/bpf/memalloc.c:#bpf_mem_alloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40ba7904-ade6-4a06-bb40-26b3853d0c0b@linux.dev>
Date:   Tue, 7 Nov 2023 08:04:37 -0800
From:   Yonghong Song <yonghong.song@...ux.dev>
To:     Hou Tao <houtao@...weicloud.com>,
        Alexei Starovoitov <ast@...nel.org>,
        kernel test robot <oliver.sang@...el.com>
Cc:     oe-lkp@...ts.linux.dev, lkp@...el.com,
        linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
        "houtao1@...wei.com" <houtao1@...wei.com>
Subject: Re: [linus:master] [bpf] c930472552:
 WARNING:at_kernel/bpf/memalloc.c:#bpf_mem_alloc_init


On 11/7/23 2:43 AM, Hou Tao wrote:
> Hi,
>
> On 11/4/2023 12:49 AM, Yonghong Song wrote:
>> On 11/2/23 11:54 PM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 11/3/2023 12:08 AM, Yonghong Song wrote:
>>>> On 11/2/23 6:40 AM, Hou Tao wrote:
>>>>> Hi Alexei,
>>>>>
>>>>> On 10/31/2023 4:01 PM, Hou Tao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 10/30/2023 10:11 PM, kernel test robot wrote:
>>>>>>> hi, Hou Tao,
>>>>>>>
>>>>>>> we noticed a WARN_ONCE added in this commit was hit in our tests.
>>>>>>> FYI.
>>>>>>>
>>>>>>>
> SNIP
>>>>> I see what has happened. The problem is twofold:
>>>>> (1) The object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
>>>>> slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
>>>>> enabled by default for kmalloc slab, so align is 64 and size is 128
>>>>> for
>>>>> kmalloc-cg-96. So when unit_alloc() does kmalloc_node(96,
>>>>> __GFP_ACCOUNT,
>>>>> node), ksize() will return 128 instead of 96 for the returned pointer.
>>>>> SLUB has a similar merge logic, but because its align is 8 under
>>>>> x86-64,
>>>>> so the warning doesn't happen for i386 + SLUB, but I think the similar
>>>>> problem may exist for other architectures.
>>>>> (2) kmalloc_size_roundup() returns the object_size of kmalloc-96
>>>>> instead
>>>>> of kmalloc-cg-96, so bpf_mem_cache_adjust_size() doesn't adjust
>>>>> size_index accordingly. The reason why the object_size of
>>>>> kmalloc-96 is
>>>>> 96 instead of 128 is that there is slab merge for kmalloc-96.
>>>>>
>>>>> About how to fix the problem, I have two ideas:
>>>>> The first is to introduce kmalloc_size_roundup_flags(), so
>>>>> bpf_mem_cache_adjust_size() could use kmalloc_size_roundup_flags(size,
>>>>> __GFP_ACCOUNT) to get the object_size of kmalloc-cg-xxx. It could fix
>>>>> the warning for now, but the warning may pop-up occasionally due to
>>>>> SLUB
>>>>> merge and unusual slab align. The second is just using the
>>>>> bpf_mem_cache
>>>>> pointer to get the unit_size which is saved before the to-be-free
>>>>> pointer. Its downside is that it may can not be able to skip the free
>>>>> operation for pointer which is not allocated from bpf ma, but I
>>>>> think it
>>>>> is acceptable. I prefer the latter solution. What do you think ?
>>>> Is it possible that in bpf_mem_cache_adjust_size(), we do a series of
>>>> kmalloc (for supported bucket size) and call ksize() to get the actual
>>>> allocated object size. So eventually all possible allocated object
>>>> sizes
>>>> will be used for size_index[]. This will avoid all kind of special
>>>> corner cases due to config/macro/arch etc. WDYT?
>>> It is basically the same as the first proposed solution and it has the
>>> same flaw. The problem is that slab merge can happen in any time, so the
>>> return value of ksize() may change even all passed pointers are
>>> allocated from the same slab. Considering the following case: during the
>>> invocation of bpf_mem_cache_adjust_size() or the initialization of
>>> bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object
>>> returns 96. But after these invocations, a new slab created by a kernel
>>> module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96
>>> is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB,
>>> because it is alignment requirement is 64 for 96-bytes slab). So soon or
>> So, the object_size for allocated objects in that is adjusted from 96
>> to 128
>> while previously allocated objects should have no change, it is merely
>> ksize(old_obj)
>> previous return 96, now returns 128, right? Okay, so this is indeed a
>> problem
>> since we use ksize() to decide the bucket.
> Yes. The object_size of underlying slab changes, so the return value of
> ksize() will change as well.
>>
>>> later, when bpf_global_ma frees a 96-byte-sized pointer which is
>>> allocated from a bpf_mem_cache in which unit_size is 96, bpf_mem_free()
>>> will free the pointer through a bpf_mem_cache in which unit_size is 128,
>>> because the return value of ksize() changes. Maybe we should introduce a
>>> new API in mm which returns size instead of object_size of underlying
>>> slab, so the return value will not change due to slab merge.
>> In this case, to avoid the warning, indeed we need to use '96' instead
>> of '128'.
>> So use the original ksize() return value is indeed a solution.
>> We could use the mechanism similar to percpu alloc to save '96' in the
>> memory.
> We have already saved the pointer of bpf_mem_cache in the extra space
> (aka LLIST_NODE_SZ) which is allocated together with the returned
> pointer, so I think we could use bpf_mem_cache->unit_size to get the
> size of the free pointer directly. I will check whether or not there is
> performance degradation before posting the patch.

See:

bpf_local_storage.c:            err = bpf_mem_alloc_init(&smap->selem_ma, smap->elem_size, false);
bpf_local_storage.c:            err = bpf_mem_alloc_init(&smap->storage_ma, sizeof(struct bpf_local_storage), false);
core.c: ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false);
core.c: ret = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true);
cpumask.c:      ret = bpf_mem_alloc_init(&bpf_cpumask_ma, sizeof(struct bpf_cpumask), false);
hashtab.c:              err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
hashtab.c:                      err = bpf_mem_alloc_init(&htab->pcpu_ma,
memalloc.c:int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)

Some 'size' parameter in core.c is zero.
Not sure how exactly you will resolve this issue based on
bpf_mem_cache->unit_size. But looking forward to your patch!

>
> Regards,
> Tao
>