lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACGkMEu_9aeb5xzKd7r+OFfud=kWOZYREa-8GcDzzppceKszjw@mail.gmail.com>
Date: Fri, 29 Nov 2024 10:18:04 +0800
From: Jason Wang <jasowang@...hat.com>
To: Koichiro Den <koichiro.den@...onical.com>
Cc: virtualization@...ts.linux.dev, mst@...hat.com, xuanzhuo@...ux.alibaba.com, 
	eperezma@...hat.com, andrew+netdev@...n.ch, davem@...emloft.net, 
	edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, 
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] virtio_net: drain unconsumed tx completions if any before dql_reset

On Thu, Nov 28, 2024 at 12:25 PM Koichiro Den
<koichiro.den@...onical.com> wrote:
>
> On Thu, Nov 28, 2024 at 10:57:01AM +0800, Jason Wang wrote:
> > On Wed, Nov 27, 2024 at 12:08 PM Koichiro Den
> > <koichiro.den@...onical.com> wrote:
> > >
> > > On Wed, Nov 27, 2024 at 11:24:15AM +0800, Jason Wang wrote:
> > > > On Tue, Nov 26, 2024 at 12:44 PM Koichiro Den
> > > > <koichiro.den@...onical.com> wrote:
> > > > >
> > > > > On Tue, Nov 26, 2024 at 11:50:17AM +0800, Jason Wang wrote:
> > > > > > On Tue, Nov 26, 2024 at 10:42 AM Koichiro Den
> > > > > > <koichiro.den@...onical.com> wrote:
> > > > > > >
> > > > > > > When virtnet_close is followed by virtnet_open, there is a slight chance
> > > > > > > that some TX completions remain unconsumed. Those are handled during the
> > > > > > > first NAPI poll, but since dql_reset occurs just beforehand, it can lead
> > > > > > > to a crash [1].
> > > > > > >
> > > > > > > This issue can be reproduced by running: `while :; do ip l set DEV down;
> > > > > > > ip l set DEV up; done` under heavy network TX load from inside of the
> > > > > > > machine.
> > > > > > >
> > > > > > > To fix this, drain unconsumed TX completions if any before dql_reset,
> > > > > > > allowing BQL to start cleanly.
> > > > > > >
> > > > > > > ------------[ cut here ]------------
> > > > > > > kernel BUG at lib/dynamic_queue_limits.c:99!
> > > > > > > Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
> > > > > > > Tainted: [N]=TEST
> > > > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
> > > > > > > BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > > > > > > RIP: 0010:dql_completed+0x26b/0x290
> > > > > > > Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
> > > > > > > 4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
> > > > > > > d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
> > > > > > > RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
> > > > > > > RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
> > > > > > > RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
> > > > > > > RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
> > > > > > > R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
> > > > > > > R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
> > > > > > > FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
> > > > > > > knlGS:0000000000000000
> > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
> > > > > > > PKRU: 55555554
> > > > > > > Call Trace:
> > > > > > >  <IRQ>
> > > > > > >  ? die+0x32/0x80
> > > > > > >  ? do_trap+0xd9/0x100
> > > > > > >  ? dql_completed+0x26b/0x290
> > > > > > >  ? dql_completed+0x26b/0x290
> > > > > > >  ? do_error_trap+0x6d/0xb0
> > > > > > >  ? dql_completed+0x26b/0x290
> > > > > > >  ? exc_invalid_op+0x4c/0x60
> > > > > > >  ? dql_completed+0x26b/0x290
> > > > > > >  ? asm_exc_invalid_op+0x16/0x20
> > > > > > >  ? dql_completed+0x26b/0x290
> > > > > > >  __free_old_xmit+0xff/0x170 [virtio_net]
> > > > > > >  free_old_xmit+0x54/0xc0 [virtio_net]
> > > > > > >  virtnet_poll+0xf4/0xe30 [virtio_net]
> > > > > > >  ? __update_load_avg_cfs_rq+0x264/0x2d0
> > > > > > >  ? update_curr+0x35/0x260
> > > > > > >  ? reweight_entity+0x1be/0x260
> > > > > > >  __napi_poll.constprop.0+0x28/0x1c0
> > > > > > >  net_rx_action+0x329/0x420
> > > > > > >  ? enqueue_hrtimer+0x35/0x90
> > > > > > >  ? trace_hardirqs_on+0x1d/0x80
> > > > > > >  ? kvm_sched_clock_read+0xd/0x20
> > > > > > >  ? sched_clock+0xc/0x30
> > > > > > >  ? kvm_sched_clock_read+0xd/0x20
> > > > > > >  ? sched_clock+0xc/0x30
> > > > > > >  ? sched_clock_cpu+0xd/0x1a0
> > > > > > >  handle_softirqs+0x138/0x3e0
> > > > > > >  do_softirq.part.0+0x89/0xc0
> > > > > > >  </IRQ>
> > > > > > >  <TASK>
> > > > > > >  __local_bh_enable_ip+0xa7/0xb0
> > > > > > >  virtnet_open+0xc8/0x310 [virtio_net]
> > > > > > >  __dev_open+0xfa/0x1b0
> > > > > > >  __dev_change_flags+0x1de/0x250
> > > > > > >  dev_change_flags+0x22/0x60
> > > > > > >  do_setlink.isra.0+0x2df/0x10b0
> > > > > > >  ? rtnetlink_rcv_msg+0x34f/0x3f0
> > > > > > >  ? netlink_rcv_skb+0x54/0x100
> > > > > > >  ? netlink_unicast+0x23e/0x390
> > > > > > >  ? netlink_sendmsg+0x21e/0x490
> > > > > > >  ? ____sys_sendmsg+0x31b/0x350
> > > > > > >  ? avc_has_perm_noaudit+0x67/0xf0
> > > > > > >  ? cred_has_capability.isra.0+0x75/0x110
> > > > > > >  ? __nla_validate_parse+0x5f/0xee0
> > > > > > >  ? __pfx___probestub_irq_enable+0x3/0x10
> > > > > > >  ? __create_object+0x5e/0x90
> > > > > > >  ? security_capable+0x3b/0x70
> > > > > > >  rtnl_newlink+0x784/0xaf0
> > > > > > >  ? avc_has_perm_noaudit+0x67/0xf0
> > > > > > >  ? cred_has_capability.isra.0+0x75/0x110
> > > > > > >  ? stack_depot_save_flags+0x24/0x6d0
> > > > > > >  ? __pfx_rtnl_newlink+0x10/0x10
> > > > > > >  rtnetlink_rcv_msg+0x34f/0x3f0
> > > > > > >  ? do_syscall_64+0x6c/0x180
> > > > > > >  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > > >  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> > > > > > >  netlink_rcv_skb+0x54/0x100
> > > > > > >  netlink_unicast+0x23e/0x390
> > > > > > >  netlink_sendmsg+0x21e/0x490
> > > > > > >  ____sys_sendmsg+0x31b/0x350
> > > > > > >  ? copy_msghdr_from_user+0x6d/0xa0
> > > > > > >  ___sys_sendmsg+0x86/0xd0
> > > > > > >  ? __pte_offset_map+0x17/0x160
> > > > > > >  ? preempt_count_add+0x69/0xa0
> > > > > > >  ? __call_rcu_common.constprop.0+0x147/0x610
> > > > > > >  ? preempt_count_add+0x69/0xa0
> > > > > > >  ? preempt_count_add+0x69/0xa0
> > > > > > >  ? _raw_spin_trylock+0x13/0x60
> > > > > > >  ? trace_hardirqs_on+0x1d/0x80
> > > > > > >  __sys_sendmsg+0x66/0xc0
> > > > > > >  do_syscall_64+0x6c/0x180
> > > > > > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > > > RIP: 0033:0x7f41defe5b34
> > > > > > > Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
> > > > > > > f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
> > > > > > > f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
> > > > > > > RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
> > > > > > > RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
> > > > > > > RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
> > > > > > > RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
> > > > > > > R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
> > > > > > > R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
> > > > > > >  </TASK>
> > > > > > > [...]
> > > > > > > ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> > > > > > >
> > > > > > > Fixes: c8bd1f7f3e61 ("virtio_net: add support for Byte Queue Limits")
> > > > > > > Cc: <stable@...r.kernel.org> # v6.11+
> > > > > > > Signed-off-by: Koichiro Den <koichiro.den@...onical.com>
> > > > > > > ---
> > > > > > >  drivers/net/virtio_net.c | 37 +++++++++++++++++++++++++++++--------
> > > > > > >  1 file changed, 29 insertions(+), 8 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > > index 64c87bb48a41..3e36c0470600 100644
> > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > @@ -513,7 +513,7 @@ static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
> > > > > > >                                                struct sk_buff *curr_skb,
> > > > > > >                                                struct page *page, void *buf,
> > > > > > >                                                int len, int truesize);
> > > > > > > -static void virtnet_xsk_completed(struct send_queue *sq, int num);
> > > > > > > +static void virtnet_xsk_completed(struct send_queue *sq, int num, bool drain);
> > > > > > >
> > > > > > >  enum virtnet_xmit_type {
> > > > > > >         VIRTNET_XMIT_TYPE_SKB,
> > > > > > > @@ -580,7 +580,8 @@ static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > > > > >  }
> > > > > > >
> > > > > > >  static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> > > > > > > -                           bool in_napi, struct virtnet_sq_free_stats *stats)
> > > > > > > +                           bool in_napi, struct virtnet_sq_free_stats *stats,
> > > > > > > +                           bool drain)
> > > > > > >  {
> > > > > > >         struct xdp_frame *frame;
> > > > > > >         struct sk_buff *skb;
> > > > > > > @@ -620,7 +621,8 @@ static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> > > > > > >                         break;
> > > > > > >                 }
> > > > > > >         }
> > > > > > > -       netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes);
> > > > > > > +       if (!drain)
> > > > > > > +               netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes);
> > > > > > >  }
> > > > > > >
> > > > > > >  static void virtnet_free_old_xmit(struct send_queue *sq,
> > > > > > > @@ -628,10 +630,21 @@ static void virtnet_free_old_xmit(struct send_queue *sq,
> > > > > > >                                   bool in_napi,
> > > > > > >                                   struct virtnet_sq_free_stats *stats)
> > > > > > >  {
> > > > > > > -       __free_old_xmit(sq, txq, in_napi, stats);
> > > > > > > +       __free_old_xmit(sq, txq, in_napi, stats, false);
> > > > > > >
> > > > > > >         if (stats->xsk)
> > > > > > > -               virtnet_xsk_completed(sq, stats->xsk);
> > > > > > > +               virtnet_xsk_completed(sq, stats->xsk, false);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void virtnet_drain_old_xmit(struct send_queue *sq,
> > > > > > > +                                  struct netdev_queue *txq)
> > > > > > > +{
> > > > > > > +       struct virtnet_sq_free_stats stats = {0};
> > > > > > > +
> > > > > > > +       __free_old_xmit(sq, txq, false, &stats, true);
> > > > > > > +
> > > > > > > +       if (stats.xsk)
> > > > > > > +               virtnet_xsk_completed(sq, stats.xsk, true);
> > > > > > >  }
> > > > > >
> > > > > > Are we sure this can drain the queue? Note that the device is not stopped.
> > > > >
> > > > > Thanks for reviewing. netif_tx_wake_queue can be invoked before the "drain"
> > > > > point I added e.g. via virtnet_config_changed_work, so it seems that I need
> > > > > to ensure it's stopped (DRV_XOFF) before the "drain" and wake it afterwards.
> > > > > Please let me know if I’m mistaken.
> > > >
> > > > Not sure I get you, but I meant we don't reset the device so it can
> > >
> > > I was wondering whether there would be a scenario where the tx queue is
> > > woken up and some new packets from the upper layer reach dql_queued()
> > > before the drain point, which also could cause the crash.
> >
> > Ok.
> >
> > >
> > > > keep raising tx interrupts:
> > > >
> > > > virtnet_drain_old_xmit()
> > > > netdev_tx_reset_queue()
> > > > skb_xmit_done()
> > > > napi_enable()
> > > > netdev_tx_completed_queue() // here we might still surprise the bql?
> > >
> > > Indeed, virtqueue_disable_cb() is needed before the drain point.
> >
> > Two problems:
> >
> > 1) device/virtqueue is not reset, it can still process the packets
> > after virtnet_drain_old_xmit()
> > 2) virtqueue_disable_cb() just does its best effort, it can't
> > guarantee no interrupt after that.
> >
> > To drain TX, the only reliable seems to be:
> >
> > 1) reset a virtqueue (or a device)
> > 2) drain by using free_old_xmit()
> > 3) netif_reset_tx_queue() // btw this seems to be better done in close not open
>
> Thank you for the clarification.
> As for 1), I may be missing something but what if VIRTIO_F_RING_RESET is
> not supported. A device reset in this context feels a bit excessive to me
> (just my two cents though) if in virtnet_open, so as you said, it seems
> better done in close in that regard as well.

Probably, most NIC has a way to stop its TX which is missed in
virtio-net. So we don't have more choices other than device reset when
there's no virtqueue reset.

Or we can limit the BQL to virtqueue reset, AF_XDP used to suffer from
the exact issue when there's no virtqueue reset. That's why virtqueue
reset is invented and AF_XDP is limited to virtqueue reset.

>
> >
> > Or I wonder if this can be easily fixed by just removing
> > netdev_tx_reset_queue()?
>
> Perhaps introducing a variant of dql_reset() that clears all stall history
> would suffice? To avoid excessively long stall_max to be possibly recorded.

Not sure, it looks like the problem still, bql might still be
surprised when it might get completed tx packets when it think it
hasn't sent any?

Thanks

>
> Thanks
>
> >
> > Thanks
> >
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >  /* Converting between virtqueue no. and kernel tx/rx queue no.
> > > > > > > @@ -1499,7 +1512,8 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> > > > > > >         /* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
> > > > > > >          * free_old_xmit().
> > > > > > >          */
> > > > > > > -       __free_old_xmit(sq, netdev_get_tx_queue(dev, sq - vi->sq), true, &stats);
> > > > > > > +       __free_old_xmit(sq, netdev_get_tx_queue(dev, sq - vi->sq), true,
> > > > > > > +                       &stats, false);
> > > > > > >
> > > > > > >         if (stats.xsk)
> > > > > > >                 xsk_tx_completed(sq->xsk_pool, stats.xsk);
> > > > > > > @@ -1556,10 +1570,13 @@ static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> > > > > > >         return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -static void virtnet_xsk_completed(struct send_queue *sq, int num)
> > > > > > > +static void virtnet_xsk_completed(struct send_queue *sq, int num, bool drain)
> > > > > > >  {
> > > > > > >         xsk_tx_completed(sq->xsk_pool, num);
> > > > > > >
> > > > > > > +       if (drain)
> > > > > > > +               return;
> > > > > > > +
> > > > > > >         /* If this is called by rx poll, start_xmit and xdp xmit we should
> > > > > > >          * wakeup the tx napi to consume the xsk tx queue, because the tx
> > > > > > >          * interrupt may not be triggered.
> > > > > > > @@ -3041,6 +3058,7 @@ static void virtnet_disable_queue_pair(struct virtnet_info *vi, int qp_index)
> > > > > > >
> > > > > > >  static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> > > > > > >  {
> > > > > > > +       struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, qp_index);
> > > > > > >         struct net_device *dev = vi->dev;
> > > > > > >         int err;
> > > > > > >
> > > > > > > @@ -3054,7 +3072,10 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> > > > > > >         if (err < 0)
> > > > > > >                 goto err_xdp_reg_mem_model;
> > > > > > >
> > > > > > > -       netdev_tx_reset_queue(netdev_get_tx_queue(vi->dev, qp_index));
> > > > > > > +       /* Drain any unconsumed TX skbs transmitted before the last virtnet_close */
> > > > > > > +       virtnet_drain_old_xmit(&vi->sq[qp_index], txq);
> > > > > > > +
> > > > > > > +       netdev_tx_reset_queue(txq);
> > > > > > >         virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> > > > > > >         virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi);
> > > > > > >
> > > > > > > --
> > > > > > > 2.43.0
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ