netdev - Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v8ljme0q.fsf@nvidia.com>
Date:   Fri, 06 Jan 2023 15:36:37 -0800
From:   Rahul Rameshbabu <rrameshbabu@...dia.com>
To:     Maxim Mikityanskiy <maxtram95@...il.com>
Cc:     netdev@...r.kernel.org, Tariq Toukan <tariqt@...dia.com>,
        Gal Pressman <gal@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH net] sch_htb: Fix prematurely activating netdev during
 htb_destroy_class_offload

Hi Maxim,

I am working on an updated patch based on your comments. I have further
comments below.

-- Rahul Rameshbabu

Maxim Mikityanskiy <maxtram95@...il.com> writes:

> On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@...dia.com> wrote:
>>
>> Maxim Mikityanskiy <maxtram95@...il.com> writes:
>>
>> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@...dia.com> wrote:
>> >>
>> >> When the netdev qdisc is updated correctly with the new qdisc before
>> >> destroying the old qdisc, the netdev should not be activated till cleanup
>> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
>> >> netdev may be activated before cleanup is completed.
>> >
>> > Oh, so that's what was happening! Now I get the full picture:
>> >
>> > 1. The user does RTM_DELQDISC.
>> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
>> > NULL, but keeps dev_queue->qdisc_sleeping.
>> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
>> > where new is NULL, for each queue.
>> > 4. Then we get into htb_destroy_class_offload, and it's important
>> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
>> > (after Eric's patch).
>> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
>> > activates the netdev, attach_default_qdiscs will be called, and
>> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
>> > hence the WARN_ON triggering.
>> >
>> > Nice catch indeed, premature activation of the netdev wasn't intended.
>> >
>> >> The new netdev qdisc
>> >> may be used prematurely by queues before cleanup is done. Call
>> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
>> >> prevent premature netdev activation.
>> >>
>> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
>> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@...dia.com>
>> >> Acked-by: Saeed Mahameed <saeedm@...dia.com>
>> >> Cc: Eric Dumazet <edumazet@...gle.com>
>> >> Cc: Maxim Mikityanskiy <maxtram95@...il.com>
>> >> ---
>> >>  net/sched/sch_htb.c | 8 +++++---
>> >>  1 file changed, 5 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
>> >> index 2238edece1a4..f62334ef016a 100644
>> >> --- a/net/sched/sch_htb.c
>> >> +++ b/net/sched/sch_htb.c
>> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>> >>
>> >>         WARN_ON(!q);
>> >>         dev_queue = htb_offload_get_queue(cl);
>> >> -       old = htb_graft_helper(dev_queue, NULL);
>> >> -       if (destroying)
>> >> +       if (destroying) {
>> >> +               old = dev_graft_qdisc(dev_queue, NULL);
>> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>> >>                  * all queues.
>> >>                  */
>> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
>> >
>> > Now regarding this WARN_ON, I have concerns about its correctness.
>> >
>> > Can the user replace the root qdisc from HTB to something else with a
>> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
>> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
>> > called with new != NULL? If that is possible, then:
>> >
>> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
>> > support the attach callback) or the old one left from HTB (old == q,
>> > if the new qdisc supports the attach callback). WARN_ON should
>> > trigger.
>> >
>> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
>> > is true). Likewise, we shouldn't try to revert it on errors or call
>> > qdisc_put on it.
>> >
>> > Could you please try to reproduce this scenario of triggering WARN_ON?
>> > I remember testing it, and something actually prevented me from doing
>> > a replacement, but maybe I just missed something back then.
>> >
>>
>> Reproduction steps
>>
>>   ip link set dev eth2 up
>>   ip link set dev eth2 up
>>   ip addr add 194.237.173.123/16 dev eth2
>>   tc qdisc add dev eth2 clsact
>>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
>>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
>>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
>>   tc qdisc replace dev eth2 root pfifo
>>
>> The warning is indeed triggered because the new root is pfifo rather
>> than noop_qdisc. I agree with both points you brought up in the patch.
>> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
>> was worried about the case when new was defined as a new qdisc rather
>> than defaulting to noop_qdisc but assumed there were some guaratees for
>> htb.
>
> OK, so that means my WARN_ON relies on false guarantees.
>
>> However, I see a number of functions, not
>> just offload related ones, in the htb implementation that seem to depend
>> on the assumption that the old qdisc can safely be accessed with helpers
>> such as htb_graft_helper. One such example is htb_change_class.
>
> htb_graft_helper is only used by the offload, you can see that all
> usages are guarded by if (q->offload).
>
>> The
>> trivial solution I see is to change qdisc_graft to first do a
>> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
>> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
>> know your thoughts on this.
>
> I don't think it's a good idea to introduce such a change to generic
> code just to fix HTB offload. It would basically remove the ability to
> replace the qdisc atomically, and there will be periods of time when
> all packets are dropped.
>
>> I believe the correct fix for a robust implementation of
>> htb_destroy_class_offload would be to not depend on functions that
>> retrieve the top level qdisc.
>
> This is actually the case. The source of truth is internal data
> structures of HTB, specifically cl->leaf.q points to the queue qdisc,
> and cl->leaf.offload_queue points to the queue itself (the latter is
> very useful when the qdisc is noop_qdisc, and we can't retrieve
> dev_queue from the qdisc itself). Whenever we take a qdisc from the
> netdev_queue itself, it should be only to check consistency with the
> source of truth (please tell me if you spotted some places where it's
> used for something else).

Yeah, I did catch the qdisc taken from the netdev_queue used in such a
way but only in one place. This is in htb_change_class in what appears
to be an offload context (I misread in my earlier comment). It's just
the call to _bstats_update though. Should it be changed to
parent->leaf.q (since it should be the source of truth for the htb
structure in this context) in the event the WARN_ON is encountered
(shouldn't happen normally hence the WARN_ON to ensure this guarantee)?

  dev_queue = htb_offload_get_queue(parent);
  old_q = htb_graft_helper(dev_queue, NULL);
  WARN_ON(old_q != parent->leaf.q);
  offload_opt = (struct tc_htb_qopt_offload) {
    .command = TC_HTB_LEAF_TO_INNER,
    .classid = cl->common.classid,
    .parent_classid =
      TC_H_MIN(parent->common.classid),
    .rate = max_t(u64, hopt->rate.rate, rate64),
    .ceil = max_t(u64, hopt->ceil.rate, ceil64),
    .extack = extack,
  };
  err = htb_offload(dev, &offload_opt);
  if (err) {
    pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n",
            err);
    htb_graft_helper(dev_queue, old_q);
    goto err_kill_estimator;
  }
  _bstats_update(&parent->bstats_bias,
            u64_stats_read(&old_q->bstats.bytes),
            u64_stats_read(&old_q->bstats.packets));

>
> What I suggest to do to fix this particular bug:
>
> 1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
> invalid. Rephrase the comment to explain why you WARN_ON(old != q)
> only when !destroying.
>
> 2. Don't graft anything on destroy, the kernel has already replaced
> the qdisc with the right one (noop_qdisc on delete, the new qdisc on
> replace).
>
> 3. qdisc_put down below shouldn't be called when destroying, because
> we didn't graft anything ourselves. We should still call
> qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
> cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
> the same qdisc). And we should still graft old back on errors, but not
> when destroying (we ignore all errors on destroy and just go on, we
> can't fail).

Ack.

>
>>
>>   [  384.474535] ------------[ cut here ]------------
>>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
>>   [ 384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack
>> xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
>> br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi
>> rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
>>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
>>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
>>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
>>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
>>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
>>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
>>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
>>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
>>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
>>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
>>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
>>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   [  384.505699] Call Trace:
>>   [  384.506231]  <TASK>
>>   [  384.506691]  ? tcf_block_put+0x74/0xa0
>>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
>>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
>>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
>>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
>>   [  384.510293]  qdisc_destroy+0x3b/0xd0
>>   [  384.510943]  qdisc_graft+0x40b/0x590
>>   [  384.511600]  tc_modify_qdisc+0x577/0x870
>>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
>>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
>>   [  384.513806]  netlink_rcv_skb+0x54/0x100
>>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
>>   [  384.515190]  netlink_sendmsg+0x237/0x490
>>   [  384.515890]  sock_sendmsg+0x33/0x40
>>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
>>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
>>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
>>   [  384.518631]  __sys_sendmsg+0x51/0x90
>>   [  384.519289]  do_syscall_64+0x3d/0x90
>>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
>>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
>>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
>>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
>>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
>>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
>>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
>>   [  384.531221]  </TASK>
>>   [  384.531689] ---[ end trace 0000000000000000 ]---
>>
>> >> -       else
>> >> +       } else {
>> >> +               old = htb_graft_helper(dev_queue, NULL);
>> >>                 WARN_ON(old != q);
>> >> +       }
>> >>
>> >>         if (cl->parent) {
>> >>                 _bstats_update(&cl->parent->bstats_bias,
>> >> --
>> >> 2.36.2
>> >>