netdev - Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2bf11b56-7494-c0a9-09d4-c9e41aaba850@meta.com>
Date: Tue, 27 Jun 2023 17:59:54 -0700
From: Alexei Starovoitov <ast@...a.com>
To: Hou Tao <houtao@...weicloud.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>, David Vernet <void@...ifault.com>,
        "Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>,
        rcu@...r.kernel.org, Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from
 waiting_for_gp_ttrace list.

On 6/26/23 12:16 AM, Hou Tao wrote:
> Hi,
> 
> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@...weicloud.com> wrote:
>>> Hi,
>>>
>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>>> From: Alexei Starovoitov <ast@...nel.org>
>>>>
>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@...nel.org>
>>>> ---
>>>>   kernel/bpf/memalloc.c | 9 +++++++++
>>>>   1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 692a9a30c1dc..666917c16e87 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>>        if (i >= cnt)
>>>>                return;
>>>>
>>>> +     for (; i < cnt; i++) {
>>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>>> After allowing to reuse elements from waiting_for_gp_ttrace, there may
>>> be concurrent llist_del_first() and llist_del_all() as shown below and
>>> llist_del_first() is not safe because the elements freed from free_rcu()
>>> could be reused immediately and head->first may be added back to
>>> c0->waiting_for_gp_ttrace by other process.
>>>
>>> // c0
>>> alloc_bulk()
>>>      llist_del_first(&c->waiting_for_gp_ttrace)
>>>
>>> // c1->tgt = c0
>>> free_rcu()
>>>      llist_del_all(&c->waiting_for_gp_ttrace)
>> I'm still thinking about how to fix the other issues you've reported,
>> but this one, I believe, is fine.
>> Are you basing 'not safe' on a comment?
>> Why xchg(&head->first, NULL); on one cpu and
>> try_cmpxchg(&head->first, &entry, next);
>> is unsafe?
> No, I didn't reason it only based on the comment in llist.h. The reason
> I though it was not safe because llist_del_first() may have ABA problem
> as pointed by you early, because after __free_rcu(), the elements on
> waiting_for_gp_ttrace would be available immediately and
> waiting_for_gp_ttrace->first may be reused and then added back to
> waiting_for_gp_ttrace->first again as shown below.
> 
> // c0->waiting_for_gp_ttrace
> A -> B -> C -> nil
>   
> P1:
> alloc_bulk(c0)
>      llist_del_first(&c0->waiting_for_gp_ttrace)
>          entry = A
>          next = B
>   
>      P2: __free_rcu
>          // A/B/C is freed back to slab
>          // c0->waiting_for_gp_ttrace->first = NULL
>          free_all(llist_del_all(c0))
>   
>      P3: alloc_bulk(c1)
>          // got A
>          __kmalloc()
>   
>      // free A (from c1), ..., last free X (allocated from c0)
>      P3: unit_free(c1)
>          // the last freed element X is from c0
>          c1->tgt = c0
>          c1->free_llist->first -> X -> Y -> ... -> A
>      P3: free_bulk(c1)
>          enque_to_free(c0)
>              c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
>          __llist_add_batch(c0->waiting_for_gp_ttrace)
>              c0->waiting_for_gp_ttrace = A -> ... -> Y -> X

In theory that's possible, but for this to happen one cpu needs
to be thousand times slower than all others and since there is no
preemption in llist_del_first I don't think we need to worry about it.
Also with removal of _tail optimization the above 
llist_add_batch(waiting_for_gp_ttrace)
will become a loop, so reused element will be at the very end
instead of top, so one cpu to million times slower which is not realistic.

> P1:
>      // A is added back as first again
>      // but llist_del_first() didn't know
>      try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
>      // c0->waiting_for_gp_trrace is corrupted
>      c0->waiting_for_gp_ttrace->first = B
>