linux-kernel - vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <154EA998-3FBB-41E9-B07E-4841B027B1B5@nutanix.com>
Date: Wed, 10 Sep 2025 18:58:18 +0000
From: Jon Kohler <jon@...anix.com>
To: "patchwork-bot+netdevbpf@...nel.org" <patchwork-bot+netdevbpf@...nel.org>
CC: "mst@...hat.com" <mst@...hat.com>,
        "jasowang@...hat.com"
	<jasowang@...hat.com>,
        "eperezma@...hat.com" <eperezma@...hat.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "virtualization@...ts.linux.dev"
	<virtualization@...ts.linux.dev>,
        "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue
 re-enable until after sendmsg



> On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@...nel.org wrote:
> 
> Hello:
> 
> This patch was applied to netdev/net-next.git (main)
> by Jakub Kicinski <kuba@...nel.org>:

Hey all,
Writing to fire up a flare and point out a problem that we’re seeing
with this patch internally, specifically when we enable iommu on the
virtio-net device.

With this patch applied on 6.12.y-based bare metal instance and then
starting a 6.12.y based guest with iommu enabled, we see lockups
within the guest in short order, as well as vmm (qemu) stuck in a tight
loop responding to iommu misses from vhost net loop.

We've bisected this in our internal tree, and for sure it is this
patch that is alledgedly causing the problem, so I wanted to point out
there is some sort of issue here. 

Working on trying to figure this out, but if jumps off the page to
anyone, happy to take advice!

Flamegraph:
https://gist.github.com/JonKohler/0e83c014230ab59ddc950f10441335f1#file-iotlb-lockup-svg

Guest dmesg errors like so:
[   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
[   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
[  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
[  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
[  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
[  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago

Host level top output
3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread

For qemu-kvm main thread: 
Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
  Children      Self  Shared Object     Symbol
-   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
   - 87.11% entry_SYSCALL_64_after_hwframe
      - do_syscall_64
         - 44.79% ksys_write
            - 43.74% vfs_write
               - 40.96% vhost_chr_write_iter
                  - 38.22% vhost_process_iotlb_msg
                     - 13.72% vhost_iotlb_add_range_ctx
                        - 7.43% vhost_iotlb_map_free
                           - 4.37% vhost_iotlb_itree_remove
                                rb_next
                             1.78% __rb_erase_color
                             0.73% kfree
                          1.15% __rb_insert_augmented
                          0.68% __kmalloc_cache_noprof
                     - 10.73% vhost_vq_work_queue
                        - 7.65% try_to_wake_up
                           - 2.55% ttwu_queue_wakelist
                              - 1.72% __smp_call_single_queue
                                   1.36% call_function_single_prep_ipi
                           - 1.32% __task_rq_lock
                              - _raw_spin_lock
                                   native_queued_spin_lock_slowpath
                           - 1.30% select_task_rq
                              - select_task_rq_fair
                                 - 0.88% wake_affine
                                      available_idle_cpu
                          2.06% llist_add_batch
                     - 4.05% __mutex_lock.constprop.0
                          2.14% mutex_spin_on_owner
                          0.72% osq_lock
                       3.00% mutex_lock
                     - 1.72% kfree
                        - 1.16% __slab_free
                             slab_update_freelist.constprop.0.isra.0
                       1.37% _raw_spin_lock
                       1.08% mutex_unlock
                    1.98% _copy_from_iter
               - 1.86% rw_verify_area
                  - security_file_permission
                     - 1.13% file_has_perm
                          0.69% avc_has_perm
              0.63% fdget_pos
         - 27.86% syscall_exit_to_user_mode
            - syscall_exit_to_user_mode_prepare
               - 25.96% __audit_syscall_exit
                  - 25.03% __audit_filter_op
                       6.66% audit_filter_rules.constprop.0
                 1.27% audit_reset_context.part.0.constprop.0
         - 10.86% ksys_read
            - 9.37% vfs_read
               - 6.67% vhost_chr_read_iter
                    1.48% _copy_to_iter
                    1.36% _raw_spin_lock
                  - 1.30% __wake_up
                       0.81% _raw_spin_lock_irqsave
                  - 1.25% vhost_enqueue_msg
                       _raw_spin_lock
               - 1.83% rw_verify_area
                  - security_file_permission
                     - 1.03% file_has_perm
                          0.64% avc_has_perm
              0.65% fdget_pos
              0.57% fput
         - 2.56% syscall_trace_enter
            - 1.25% __seccomp_filter
                 seccomp_run_filters
              0.54% __audit_syscall_entry
              
vhost-net thread
Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
  Children      Self  Shared Object     Symbol
-  100.00%     3.38%  [kernel]          [k] vhost_task_fn
     38.26% 0xffffffff930bb8c0
   - 3.36% 0
        ret_from_fork_asm
        ret_from_fork
   - 1.16% vhost_task_fn
      - 2.35% vhost_run_work_list
         - 1.67% handle_tx
            - 7.09% __mutex_lock.constprop.0
                 6.64% mutex_spin_on_owner
            - 0.84% vq_meta_prefetch
               - 3.22% iotlb_access_ok
                    2.50% vhost_iotlb_itree_first
              0.80% mutex_lock
            - 0.75% handle_tx_copy
           0.86% llist_reverse_order

> 
> On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>> batches up to 64 messages before calling sock->sendmsg.
>> 
>> Currently, when there are no more messages on the ring to dequeue,
>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>> especially if it needs to wake up a thread (e.g., another vhost worker).
>> 
>> [...]
> 
> Here is the summary with links:
>  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
>    https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e= 
> 
> You are awesome, thank you!
> -- 
> Deet-doot-dot, I am a bot.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e= 
> 
>