[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250107085559.3081563-7-houtao@huaweicloud.com>
Date: Tue, 7 Jan 2025 16:55:58 +0800
From: Hou Tao <houtao@...weicloud.com>
To: bpf@...r.kernel.org,
netdev@...r.kernel.org
Cc: Martin KaFai Lau <martin.lau@...ux.dev>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
Andrii Nakryiko <andrii@...nel.org>,
Eduard Zingerman <eddyz87@...il.com>,
Song Liu <song@...nel.org>,
Hao Luo <haoluo@...gle.com>,
Yonghong Song <yonghong.song@...ux.dev>,
Daniel Borkmann <daniel@...earbox.net>,
KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>,
Jiri Olsa <jolsa@...nel.org>,
John Fastabend <john.fastabend@...il.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
houtao1@...wei.com,
xukuohai@...wei.com
Subject: [PATCH bpf-next 6/7] bpf: Free element after unlock for pre-allocated htab
From: Hou Tao <houtao1@...wei.com>
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:
BUG: scheduling while atomic: test_progs/676/0x00000003
3 locks held by test_progs/676:
#0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
#1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
#2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
Modules linked in: bpf_testmod(O)
Preemption disabled at:
[<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
Tainted: [W]=WARN, [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x70
dump_stack+0x10/0x20
__schedule_bug+0x120/0x170
__schedule+0x300c/0x4800
schedule_rtlock+0x37/0x60
rtlock_slowlock_locked+0x6d9/0x54c0
rt_spin_lock+0x168/0x230
hrtimer_cancel_wait_running+0xe9/0x1b0
hrtimer_cancel+0x24/0x30
bpf_timer_delete_work+0x1d/0x40
bpf_timer_cancel_and_free+0x5e/0x80
bpf_obj_free_fields+0x262/0x4a0
check_and_free_fields+0x1d0/0x280
htab_map_update_elem+0x7fc/0x1500
bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
bpf_prog_test_run_syscall+0x322/0x830
__sys_bpf+0x135d/0x3ca0
__x64_sys_bpf+0x75/0xb0
x64_sys_call+0x1b5/0xa10
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
...
</TASK>
To fix the problem, the patch breaks the reuse and refill of per-cpu
extra_elems into two independent part: reuse the per-cpu extra_elems
with bucket lock being held and refill the old_element as per-cpu
extra_elems after the bucket lock is unlocked. After the break, it is
safe to free pre-allocated element after bucket lock is unlocked.
Reported-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Signed-off-by: Hou Tao <houtao1@...wei.com>
---
kernel/bpf/hashtab.c | 43 ++++++++++++++++---------------------------
1 file changed, 16 insertions(+), 27 deletions(-)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 9211df2adda4..83c96c8941f0 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1034,9 +1034,16 @@ static struct htab_elem *alloc_preallocated_htab_elem(struct bpf_htab *htab,
* use per-cpu extra elems to avoid freelist_pop/push
*/
pl_new = this_cpu_ptr(htab->extra_elems);
- l_new = *pl_new;
- *pl_new = old_elem;
- return l_new;
+ /* Paired with cmpxchg_release() in free_htab_elem() */
+ l_new = smp_load_acquire(pl_new);
+ /* extra_elems can be NULL if the current update operation
+ * preempts another update operation that hasn't yet refilled
+ * the per-cpu extra_elems.
+ */
+ if (l_new) {
+ WRITE_ONCE(*pl_new, NULL);
+ return l_new;
+ }
}
l = __pcpu_freelist_pop(&htab->freelist);
@@ -1139,7 +1146,6 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
struct htab_elem *l_new = NULL, *l_old;
struct hlist_nulls_head *head;
unsigned long flags;
- void *old_map_ptr;
struct bucket *b;
u32 key_size, hash;
int ret;
@@ -1200,7 +1206,8 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
copy_map_value_locked(map,
l_old->key + round_up(key_size, 8),
value, false);
- ret = 0;
+ /* don't free the reused old element */
+ l_old = NULL;
goto err;
}
@@ -1216,31 +1223,13 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
* concurrent search will find it before old elem
*/
hlist_nulls_add_head_rcu(&l_new->hash_node, head);
- if (l_old) {
+ if (l_old)
hlist_nulls_del_rcu(&l_old->hash_node);
-
- /* l_old has already been stashed in htab->extra_elems, free
- * its special fields before it is available for reuse. Also
- * save the old map pointer in htab of maps before unlock
- * and release it after unlock.
- */
- old_map_ptr = NULL;
- if (htab_is_prealloc(htab)) {
- if (map->ops->map_fd_put_ptr)
- old_map_ptr = fd_htab_map_get_ptr(map, l_old);
- check_and_free_fields(htab, l_old);
- }
- }
- htab_unlock_bucket(htab, b, hash, flags);
- if (l_old) {
- if (old_map_ptr)
- map->ops->map_fd_put_ptr(map, old_map_ptr, true);
- if (!htab_is_prealloc(htab))
- free_htab_elem(htab, l_old, false);
- }
- return 0;
err:
htab_unlock_bucket(htab, b, hash, flags);
+ /* refill per-cpu extra_elems for preallocated htab */
+ if (!ret && l_old)
+ free_htab_elem(htab, l_old, true);
return ret;
}
--
2.29.2
Powered by blists - more mailing lists