[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7f773c114001cbcd0c6ff21da9976eb0ba533421.camel@redhat.com>
Date: Tue, 13 Jun 2023 10:39:33 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Peilin Ye <yepeilin.cs@...il.com>, Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, "David
S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub
Kicinski <kuba@...nel.org>
Cc: Peilin Ye <peilin.ye@...edance.com>, Vlad Buslov <vladbu@...lanox.com>,
Pedro Tammela <pctammela@...atatu.com>, John Fastabend
<john.fastabend@...il.com>, Daniel Borkmann <daniel@...earbox.net>, Hillf
Danton <hdanton@...a.com>, Zhengchao Shao <shaozhengchao@...wei.com>,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org, Cong Wang
<cong.wang@...edance.com>
Subject: Re: [PATCH net 2/2] net/sched: qdisc_destroy() old ingress and
clsact Qdiscs before grafting
Hi,
On Sat, 2023-06-10 at 20:30 -0700, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@...edance.com>
>
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized
> in ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar
> for clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> requests (thanks Hillf Danton for the hint), when replacing ingress or
> clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> causing race conditions [1] including a use-after-free bug in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> @old and @new should not affect each other. In other words, @old should
> never modify miniq_{in,e}gress after @new, and @new should not update
> @old's RCU state.
>
> Fixing without changing sch_api.c turned out to be difficult (please
> refer to Closes: for discussions). Instead, make sure @new's first call
> always happen after @old's last call (in {ingress,clsact}_destroy()) has
> finished:
>
> In qdisc_graft(), return -EBUSY if @old has any ongoing filter requests,
> and call qdisc_destroy() for @old before grafting @new.
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for filter requests. Introduce a
> non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> Depends on patch "net/sched: Refactor qdisc_graft() for ingress and
> clsact Qdiscs".
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after commit c7cfbd115001 ("net/sched:
> sch_ingress: Only create under TC_H_INGRESS")) on eth0 that has 8
> transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2),
> then adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-unlocked)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU sync 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to
> its mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress
> packets on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> miniq_{in,e}gress pointers.
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@...kaller.appspotmail.com
> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> Cc: Hillf Danton <hdanton@...a.com>
> Cc: Vlad Buslov <vladbu@...lanox.com>
> Signed-off-by: Peilin Ye <peilin.ye@...edance.com>
The fixes LGTM, but I guess this could deserve an explicit ack from
Jakub, as he raised to point for the full retry implementation.
Cheers,
Paolo
Powered by blists - more mailing lists