netdev - Re: [PATCH v3 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <df46e0cc-cf0f-76a1-60a0-ef1ef8b8280a@huaweicloud.com>
Date: Thu, 29 Jun 2023 12:31:10 +0800
From: Hou Tao <houtao@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Tejun Heo <tj@...nel.org>, rcu@...r.kernel.org,
 Network Development <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
 Kernel Team <kernel-team@...com>, Daniel Borkmann <daniel@...earbox.net>,
 Andrii Nakryiko <andrii@...nel.org>, David Vernet <void@...ifault.com>,
 "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [PATCH v3 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu()
 similar to kfree_rcu().

Hi,

On 6/29/2023 11:42 AM, Alexei Starovoitov wrote:
> On Wed, Jun 28, 2023 at 7:24 PM Hou Tao <houtao@...weicloud.com> wrote:
>> Hi,
>>
>> On 6/28/2023 9:56 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@...nel.org>
>>>
>>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
>>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
>>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>>> task trace grace period to be freed into slab.
>>>
>>> The life cycle of objects:
>>> alloc: dequeue free_llist
>>> free: enqeueu free_llist
>>> free_rcu: enqueue free_by_rcu -> waiting_for_gp
>>> free_llist above high watermark -> free_by_rcu_ttrace
>>> after RCU GP waiting_for_gp -> free_by_rcu_ttrace
>>> free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@...nel.org>
>> SNIP
>>> +static void __free_by_rcu(struct rcu_head *head)
>>> +{
>>> +     struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>> +     struct bpf_mem_cache *tgt = c->tgt;
>>> +     struct llist_node *llnode;
>>> +
>>> +     llnode = llist_del_all(&c->waiting_for_gp);
>>> +     if (!llnode)
>>> +             goto out;
>>> +
>>> +     llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace);
>>> +
>>> +     /* Objects went through regular RCU GP. Send them to RCU tasks trace */
>>> +     do_call_rcu_ttrace(tgt);
>> I still got report about leaked free_by_rcu_ttrace without adding any
>> extra hack except using bpf_mem_cache_free_rcu() in htab.
> Please share the steps to repro.
Firstly, I added the attached patch to check whether or not there are
leaked elements when calling free_mem_alloc_no_barrier(), then I ran the
following scripts to do the stress test in a kvm VM with 72 CPUs and
64GB memory:


#!/bin/bash

BASE=/all/code/linux_working/

{
        cd $BASE/tools/testing/selftests/bpf
        for x in $(seq 2)
        do
                while true; do ./test_maps &>/dev/null; done &
        done
} &

{
        cd $BASE/samples/bpf
        for y in $(seq 8)
        do
                while true; do ./map_perf_test &>/dev/null; done &
        done
} &

{
        cd $BASE/tools/testing/selftests/bpf
        for z in $(seq 8)
        do
                while true
                do
                        for name in overwrite batch_add_batch_del
add_del_on_diff_cpu
                        do
                                ./bench htab-mem -w1 -d5 -a -p32
--use-case $name
                        done
                done &>/dev/null &
                sleep 0.3
        done
} &


>
>> When bpf ma is freed through free_mem_alloc(), the following sequence
>> may lead to leak of free_by_rcu_ttrace:
>>
>> P1: bpf_mem_alloc_destroy()
>>     P2: __free_by_rcu()
>>
>>     // got false
>>     P2: read c->draining
>>
>> P1: c->draining = true
>> P1: llist_del_all(&c->free_by_rcu_ttrace)
>>
>>     // add to free_by_rcu_ttrace again
>>     P2: llist_add_batch(..., &tgt->free_by_rcu_ttrace)
>>         P2: do_call_rcu_ttrace()
>>             // call_rcu_ttrace_in_progress is 1, so xchg return 1
>>             // and it doesn't being moved to waiting_for_gp_ttrace
>>             P2: atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)
>>
>> // got 1
>> P1: atomic_read(&c->call_rcu_ttrace_in_progress)
>> // objects in free_by_rcu_ttrace is leaked
>>
>> I think the race could be fixed by checking c->draining in
>> do_call_rcu_ttrace() when atomic_xchg() returns 1 as shown below:
> If the theory of the bug holds true then the fix makes sense,
> but did you repro without fix and cannot repro with the fix?
> We should not add extra code based on a hunch.
Yes. With the fix applied only, the leak didn't occur in the last 13 hours.
>
> .


View attachment "0001-bpf-Check-leaked-objects.patch" of type "text/plain" (2672 bytes)