netdev - Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACGkMEvggRncZegemhR9fnkRDGJh1G3jgjycDG0ZX8RKg2-X-Q@mail.gmail.com>
Date: Thu, 11 Sep 2025 13:49:05 +0800
From: Jason Wang <jasowang@...hat.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Jon Kohler <jon@...anix.com>, 
	"patchwork-bot+netdevbpf@...nel.org" <patchwork-bot+netdevbpf@...nel.org>, 
	"eperezma@...hat.com" <eperezma@...hat.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>, 
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue
 re-enable until after sendmsg

On Thu, Sep 11, 2025 at 4:11 AM Michael S. Tsirkin <mst@...hat.com> wrote:
>
> On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote:
> >
> >
> > > On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@...nel.org wrote:
> > >
> > > Hello:
> > >
> > > This patch was applied to netdev/net-next.git (main)
> > > by Jakub Kicinski <kuba@...nel.org>:
> >
> > Hey all,
> > Writing to fire up a flare and point out a problem that we’re seeing
> > with this patch internally, specifically when we enable iommu on the
> > virtio-net device.
> >
> > With this patch applied on 6.12.y-based bare metal instance and then
> > starting a 6.12.y based guest with iommu enabled, we see lockups
> > within the guest in short order, as well as vmm (qemu) stuck in a tight
> > loop responding to iommu misses from vhost net loop.
> >
> > We've bisected this in our internal tree, and for sure it is this
> > patch that is alledgedly causing the problem, so I wanted to point out
> > there is some sort of issue here.
> >
> > Working on trying to figure this out, but if jumps off the page to
> > anyone, happy to take advice!
> >
> > Flamegraph:
> > https://gist.github.com/JonKohler/0e83c014230ab59ddc950f10441335f1#file-iotlb-lockup-svg
> >
> > Guest dmesg errors like so:
> > [   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
> > [   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
> > [  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
> > [  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> > [  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
> > [  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> >
> > Host level top output
> > 3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
> > 3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread
> >
> > For qemu-kvm main thread:
> > Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
> >   Children      Self  Shared Object     Symbol
> > -   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
> >    - 87.11% entry_SYSCALL_64_after_hwframe
> >       - do_syscall_64
> >          - 44.79% ksys_write
> >             - 43.74% vfs_write
> >                - 40.96% vhost_chr_write_iter
> >                   - 38.22% vhost_process_iotlb_msg
> >                      - 13.72% vhost_iotlb_add_range_ctx
> >                         - 7.43% vhost_iotlb_map_free
> >                            - 4.37% vhost_iotlb_itree_remove
> >                                 rb_next
> >                              1.78% __rb_erase_color
> >                              0.73% kfree
> >                           1.15% __rb_insert_augmented
> >                           0.68% __kmalloc_cache_noprof
> >                      - 10.73% vhost_vq_work_queue
> >                         - 7.65% try_to_wake_up
> >                            - 2.55% ttwu_queue_wakelist
> >                               - 1.72% __smp_call_single_queue
> >                                    1.36% call_function_single_prep_ipi
> >                            - 1.32% __task_rq_lock
> >                               - _raw_spin_lock
> >                                    native_queued_spin_lock_slowpath
> >                            - 1.30% select_task_rq
> >                               - select_task_rq_fair
> >                                  - 0.88% wake_affine
> >                                       available_idle_cpu
> >                           2.06% llist_add_batch
> >                      - 4.05% __mutex_lock.constprop.0
> >                           2.14% mutex_spin_on_owner
> >                           0.72% osq_lock
> >                        3.00% mutex_lock
> >                      - 1.72% kfree
> >                         - 1.16% __slab_free
> >                              slab_update_freelist.constprop.0.isra.0
> >                        1.37% _raw_spin_lock
> >                        1.08% mutex_unlock
> >                     1.98% _copy_from_iter
> >                - 1.86% rw_verify_area
> >                   - security_file_permission
> >                      - 1.13% file_has_perm
> >                           0.69% avc_has_perm
> >               0.63% fdget_pos
> >          - 27.86% syscall_exit_to_user_mode
> >             - syscall_exit_to_user_mode_prepare
> >                - 25.96% __audit_syscall_exit
> >                   - 25.03% __audit_filter_op
> >                        6.66% audit_filter_rules.constprop.0
> >                  1.27% audit_reset_context.part.0.constprop.0
> >          - 10.86% ksys_read
> >             - 9.37% vfs_read
> >                - 6.67% vhost_chr_read_iter
> >                     1.48% _copy_to_iter
> >                     1.36% _raw_spin_lock
> >                   - 1.30% __wake_up
> >                        0.81% _raw_spin_lock_irqsave
> >                   - 1.25% vhost_enqueue_msg
> >                        _raw_spin_lock
> >                - 1.83% rw_verify_area
> >                   - security_file_permission
> >                      - 1.03% file_has_perm
> >                           0.64% avc_has_perm
> >               0.65% fdget_pos
> >               0.57% fput
> >          - 2.56% syscall_trace_enter
> >             - 1.25% __seccomp_filter
> >                  seccomp_run_filters
> >               0.54% __audit_syscall_entry
> >
> > vhost-net thread
> > Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
> >   Children      Self  Shared Object     Symbol
> > -  100.00%     3.38%  [kernel]          [k] vhost_task_fn
> >      38.26% 0xffffffff930bb8c0
> >    - 3.36% 0
> >         ret_from_fork_asm
> >         ret_from_fork
> >    - 1.16% vhost_task_fn
> >       - 2.35% vhost_run_work_list
> >          - 1.67% handle_tx
> >             - 7.09% __mutex_lock.constprop.0
> >                  6.64% mutex_spin_on_owner
> >             - 0.84% vq_meta_prefetch
> >                - 3.22% iotlb_access_ok
> >                     2.50% vhost_iotlb_itree_first
> >               0.80% mutex_lock
> >             - 0.75% handle_tx_copy
> >            0.86% llist_reverse_order
> >
> > >
> > > On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
> > >> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> > >> batches up to 64 messages before calling sock->sendmsg.
> > >>
> > >> Currently, when there are no more messages on the ring to dequeue,
> > >> handle_tx_copy re-enables kicks on the ring *before* firing off the
> > >> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> > >> especially if it needs to wake up a thread (e.g., another vhost worker).
> > >>
> > >> [...]
> > >
> > > Here is the summary with links:
> > >  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
> > >    https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e=
> > >
> > > You are awesome, thank you!
> > > --
> > > Deet-doot-dot, I am a bot.
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e=
> > >
> > >
> >
>
>
> Well it seems that if  get_tx_bufs failed with -EAGAIN then we
> previously bailed out, but now we will redo poll and so on, forever.

Something like this, the vhost_vq_avail_empty() will cause the
vhost_poll_queue() to be queued in this case.

Let me post a patch to fix that.

Thanks

>
>
> No?
>
>
> --
> MST
>