netdev - Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKErNvqDfnQM03Npj7Z2dfz_ATcPPuwvSng6MqqX1q=g2z8AWQ@mail.gmail.com>
Date:   Fri, 6 Jan 2023 14:03:16 +0200
From:   Maxim Mikityanskiy <maxtram95@...il.com>
To:     Rahul Rameshbabu <rrameshbabu@...dia.com>
Cc:     netdev@...r.kernel.org, Tariq Toukan <tariqt@...dia.com>,
        Gal Pressman <gal@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload

On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@...dia.com> wrote:
>
> Maxim Mikityanskiy <maxtram95@...il.com> writes:
>
> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@...dia.com> wrote:
> >>
> >> When the netdev qdisc is updated correctly with the new qdisc before
> >> destroying the old qdisc, the netdev should not be activated till cleanup
> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> >> netdev may be activated before cleanup is completed.
> >
> > Oh, so that's what was happening! Now I get the full picture:
> >
> > 1. The user does RTM_DELQDISC.
> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> > NULL, but keeps dev_queue->qdisc_sleeping.
> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> > where new is NULL, for each queue.
> > 4. Then we get into htb_destroy_class_offload, and it's important
> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> > (after Eric's patch).
> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> > activates the netdev, attach_default_qdiscs will be called, and
> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
> > hence the WARN_ON triggering.
> >
> > Nice catch indeed, premature activation of the netdev wasn't intended.
> >
> >> The new netdev qdisc
> >> may be used prematurely by queues before cleanup is done. Call
> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> >> prevent premature netdev activation.
> >>
> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@...dia.com>
> >> Acked-by: Saeed Mahameed <saeedm@...dia.com>
> >> Cc: Eric Dumazet <edumazet@...gle.com>
> >> Cc: Maxim Mikityanskiy <maxtram95@...il.com>
> >> ---
> >>  net/sched/sch_htb.c | 8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> >> index 2238edece1a4..f62334ef016a 100644
> >> --- a/net/sched/sch_htb.c
> >> +++ b/net/sched/sch_htb.c
> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
> >>
> >>         WARN_ON(!q);
> >>         dev_queue = htb_offload_get_queue(cl);
> >> -       old = htb_graft_helper(dev_queue, NULL);
> >> -       if (destroying)
> >> +       if (destroying) {
> >> +               old = dev_graft_qdisc(dev_queue, NULL);
> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
> >>                  * all queues.
> >>                  */
> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
> >
> > Now regarding this WARN_ON, I have concerns about its correctness.
> >
> > Can the user replace the root qdisc from HTB to something else with a
> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> > called with new != NULL? If that is possible, then:
> >
> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> > support the attach callback) or the old one left from HTB (old == q,
> > if the new qdisc supports the attach callback). WARN_ON should
> > trigger.
> >
> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> > is true). Likewise, we shouldn't try to revert it on errors or call
> > qdisc_put on it.
> >
> > Could you please try to reproduce this scenario of triggering WARN_ON?
> > I remember testing it, and something actually prevented me from doing
> > a replacement, but maybe I just missed something back then.
> >
>
> Reproduction steps
>
>   ip link set dev eth2 up
>   ip link set dev eth2 up
>   ip addr add 194.237.173.123/16 dev eth2
>   tc qdisc add dev eth2 clsact
>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
>   tc qdisc replace dev eth2 root pfifo
>
> The warning is indeed triggered because the new root is pfifo rather
> than noop_qdisc. I agree with both points you brought up in the patch.
> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
> was worried about the case when new was defined as a new qdisc rather
> than defaulting to noop_qdisc but assumed there were some guaratees for
> htb.

OK, so that means my WARN_ON relies on false guarantees.

> However, I see a number of functions, not
> just offload related ones, in the htb implementation that seem to depend
> on the assumption that the old qdisc can safely be accessed with helpers
> such as htb_graft_helper. One such example is htb_change_class.

htb_graft_helper is only used by the offload, you can see that all
usages are guarded by if (q->offload).

> The
> trivial solution I see is to change qdisc_graft to first do a
> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
> know your thoughts on this.

I don't think it's a good idea to introduce such a change to generic
code just to fix HTB offload. It would basically remove the ability to
replace the qdisc atomically, and there will be periods of time when
all packets are dropped.

> I believe the correct fix for a robust implementation of
> htb_destroy_class_offload would be to not depend on functions that
> retrieve the top level qdisc.

This is actually the case. The source of truth is internal data
structures of HTB, specifically cl->leaf.q points to the queue qdisc,
and cl->leaf.offload_queue points to the queue itself (the latter is
very useful when the qdisc is noop_qdisc, and we can't retrieve
dev_queue from the qdisc itself). Whenever we take a qdisc from the
netdev_queue itself, it should be only to check consistency with the
source of truth (please tell me if you spotted some places where it's
used for something else).

What I suggest to do to fix this particular bug:

1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
invalid. Rephrase the comment to explain why you WARN_ON(old != q)
only when !destroying.

2. Don't graft anything on destroy, the kernel has already replaced
the qdisc with the right one (noop_qdisc on delete, the new qdisc on
replace).

3. qdisc_put down below shouldn't be called when destroying, because
we didn't graft anything ourselves. We should still call
qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
the same qdisc). And we should still graft old back on errors, but not
when destroying (we ignore all errors on destroy and just go on, we
can't fail).

>
>   [  384.474535] ------------[ cut here ]------------
>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
>   [  384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [  384.505699] Call Trace:
>   [  384.506231]  <TASK>
>   [  384.506691]  ? tcf_block_put+0x74/0xa0
>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
>   [  384.510293]  qdisc_destroy+0x3b/0xd0
>   [  384.510943]  qdisc_graft+0x40b/0x590
>   [  384.511600]  tc_modify_qdisc+0x577/0x870
>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
>   [  384.513806]  netlink_rcv_skb+0x54/0x100
>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
>   [  384.515190]  netlink_sendmsg+0x237/0x490
>   [  384.515890]  sock_sendmsg+0x33/0x40
>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
>   [  384.518631]  __sys_sendmsg+0x51/0x90
>   [  384.519289]  do_syscall_64+0x3d/0x90
>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
>   [  384.531221]  </TASK>
>   [  384.531689] ---[ end trace 0000000000000000 ]---
>
> >> -       else
> >> +       } else {
> >> +               old = htb_graft_helper(dev_queue, NULL);
> >>                 WARN_ON(old != q);
> >> +       }
> >>
> >>         if (cl->parent) {
> >>                 _bstats_update(&cl->parent->bstats_bias,
> >> --
> >> 2.36.2
> >>