[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLFvTvg5nggOLMnV6BLpTp8K+b78ZyB3VNdV_T=Fhusmg@mail.gmail.com>
Date: Fri, 8 Nov 2024 09:00:56 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Xu Kuohai <xukuohai@...weicloud.com>
Cc: bpf <bpf@...r.kernel.org>, Network Development <netdev@...r.kernel.org>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>,
Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yonghong.song@...ux.dev>,
Kui-Feng Lee <thinker.li@...il.com>
Subject: Re: [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map
On Fri, Nov 8, 2024 at 12:15 AM Xu Kuohai <xukuohai@...weicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@...wei.com>
>
> The test in the follow-up patch triggers the following kernel panic:
>
> Oops: int3: 0000 [#1] PREEMPT SMP PTI
> CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G OE 6.12.0-rc4-gd1d187515
> Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4
> RIP: 0010:0xffffffffc0015041
> Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc
> RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000
> RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000
> R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0
> FS: 00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0
> Call Trace:
> <TASK>
> ? show_regs+0x68/0x80
> ? die+0x3b/0x90
> ? exc_int3+0xca/0xe0
> ? asm_exc_int3+0x3e/0x50
> run_struct_ops+0x58/0xb0 [bpf_testmod]
> param_attr_store+0xa2/0x100
> module_attr_store+0x25/0x40
> sysfs_kf_write+0x50/0x70
> kernfs_fop_write_iter+0x146/0x1f0
> vfs_write+0x27e/0x530
> ksys_write+0x75/0x100
> __x64_sys_write+0x1d/0x30
> x64_sys_call+0x1d30/0x1f30
> do_syscall_64+0x68/0x140
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f051831727f
> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108
> RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f
> RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004
> RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b
> R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30
> R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000
> </TASK>
>
> It's because the sleepable prog is still running when the struct_ops
> map is released.
>
> To fix it, follow the approach used in bpf_tramp_image_put. First,
> before release struct_ops map, wait a rcu_tasks_trace gp for sleepable
> progs to finish. Then, wait a rcu_tasks gp for normal progs and the
> rest trampoline insns to finish.
>
> Additionally, switch to call_rcu to remove the synchronous waiting,
> as suggested by Alexei and Martin.
>
> Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.")
> Suggested-by: Alexei Starovoitov <ast@...nel.org>
> Suggested-by: Martin KaFai Lau <martin.lau@...ux.dev>
> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
> ---
> kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------
> kernel/bpf/syscall.c | 7 ++++++-
> 2 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index fda3dd2ee984..dd5f9bf12791 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
> */
> if (btf_is_module(st_map->btf))
> module_put(st_map->st_ops_desc->st_ops->owner);
> -
> - /* The struct_ops's function may switch to another struct_ops.
> - *
> - * For example, bpf_tcp_cc_x->init() may switch to
> - * another tcp_cc_y by calling
> - * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
> - * and its refcount may reach 0 which then free its
> - * trampoline image while tcp_cc_x is still running.
> - *
> - * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
> - * to finish. bpf-tcp-cc prog is non sleepable.
> - * A rcu_tasks gp is to wait for the last few insn
> - * in the tramopline image to finish before releasing
> - * the trampoline image.
> - */
> - synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> -
> __bpf_struct_ops_map_free(map);
> }
>
> @@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> mutex_init(&st_map->lock);
> bpf_map_init_from_attr(map, attr);
>
> + /* The struct_ops's function may switch to another struct_ops.
> + *
> + * For example, bpf_tcp_cc_x->init() may switch to
> + * another tcp_cc_y by calling
> + * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
> + * and its refcount may reach 0 which then free its
> + * trampoline image while tcp_cc_x is still running.
> + *
> + * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp
> + * is to wait for sleepable progs in the map to finish. Then a
> + * rcu_tasks gp is to wait for the normal progs and the last few
> + * insns in the tramopline image to finish before releasing the
> + * trampoline image.
> + *
> + * Also see the comment in function bpf_tramp_image_put.
> + */
> + WRITE_ONCE(map->free_after_mult_rcu_gp, true);
> +
> return map;
>
> errout_free:
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8254b2973157..ae927f087f04 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
>
> static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
> {
> - if (rcu_trace_implies_rcu_gp())
> + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
> +
> + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS)
> + /* See comment in the end of bpf_struct_ops_map_alloc */
The fix makes sense, but pls copy paste a sentence here instead
of the above comment. Like:
"
rcu_tasks gp is necessary to wait for struct_ops bpf trampoline to finish.
Unlike all other bpf trampolines struct_ops trampoline is not
protected by percpu_ref.
"
> + call_rcu_tasks(rcu, bpf_map_free_rcu_gp);
> + else if (rcu_trace_implies_rcu_gp())
pw-bot: cr
Powered by blists - more mailing lists