[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250826034937-mutt-send-email-mst@kernel.org>
Date: Tue, 26 Aug 2025 03:52:59 -0400
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Jason Wang <jasowang@...hat.com>,
	kvm@...r.kernel.org, virtualization@...ts.linux.dev,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery
 thread if vhost task was killed
On Mon, Aug 25, 2025 at 05:40:09PM -0700, Sean Christopherson wrote:
> Add a vhost_task_wake_safe() variant to handle the case where a vhost task
> has exited due to a signal, i.e. before being explicitly stopped by the
> owner of the task, and use the "safe" API in KVM when waking NX hugepage
> recovery tasks.  This fixes a bug where KVM will attempt to wake a task
> that has exited, which ultimately results in all manner of badness, e.g.
> 
>   Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP
>   CPU: 51 UID: 0 PID: 53807 Comm: tee Tainted: G S         O        6.17.0-smp--38183c31756a-next #826 NONE
>   Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
>   Hardware name: Google LLC Indus/Indus_QC_03, BIOS 30.110.0 09/13/2024
>   RIP: 0010:queued_spin_lock_slowpath+0x123/0x250
>   Code: ... <48> 89 8c 02 c0 da 47 a2 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8
>   RSP: 0018:ffffbf55cffe7cf8 EFLAGS: 00010006
>   RAX: ff0e899fff0e8562 RBX: 0000000000d00000 RCX: ffffa39b40aefac0
>   RDX: 0000000000000030 RSI: fffffffffffffff8 RDI: ffffa39d0592e68c
>   RBP: 0000000000d00000 R08: 00000000ffffff80 R09: 0000000400000000
>   R10: ffffa36cce4fe401 R11: 0000000000000800 R12: 0000000000000003
>   R13: 0000000000000000 R14: ffffa39d0592e68c R15: ffffa39b9e672000
>   FS:  00007f233b2e9740(0000) GS:ffffa39b9e672000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f233b39fda0 CR3: 00000004d031f002 CR4: 00000000007726f0
>   PKRU: 55555554
>   Call Trace:
>    <TASK>
>    _raw_spin_lock_irqsave+0x50/0x60
>    try_to_wake_up+0x4f/0x5d0
>    set_nx_huge_pages+0xe4/0x1c0 [kvm]
>    param_attr_store+0x89/0xf0
>    module_attr_store+0x1e/0x30
>    kernfs_fop_write_iter+0xe4/0x160
>    vfs_write+0x2cb/0x420
>    ksys_write+0x7f/0xf0
>    do_syscall_64+0x6f/0x1f0
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>   RIP: 0033:0x7f233b4178b3
>   R13: 0000000000000002 R14: 00000000226ff3d0 R15: 0000000000000002
>    </TASK>
> 
> Provide an API in vhost task instead of forcing KVM to solve the problem,
> as KVM would literally just add an equivalent to VHOST_TASK_FLAGS_KILLED,
> along with a new lock to protect said flag.  In general, forcing simple
> usage of vhost task to care about signals _and_ take non-trivial action to
> do the right thing isn't developer friendly, and is likely to lead to
> similar bugs in the future.
> 
> Debugged-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> Link: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com
> Link: https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> Suggested-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
> Cc: stable@...r.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
OK but I dislike the API.
Default APIs should be safe. So vhost_task_wake_safe should be
vhost_task_wake
This also reduces the changes to kvm.
It does not look like we need the "unsafe" variant, so pls drop it.
If we do need it, it should be called __vhost_task_wake.
> ---
>  arch/x86/kvm/mmu/mmu.c           |  2 +-
>  include/linux/sched/vhost_task.h |  1 +
>  kernel/vhost_task.c              | 42 +++++++++++++++++++++++++++++---
>  3 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..d11730467fd4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7376,7 +7376,7 @@ static void kvm_wake_nx_recovery_thread(struct kvm *kvm)
>  	struct vhost_task *nx_thread = READ_ONCE(kvm->arch.nx_huge_page_recovery_thread);
>  
>  	if (nx_thread)
> -		vhost_task_wake(nx_thread);
> +		vhost_task_wake_safe(nx_thread);
>  }
>  
>  static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
> diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
> index 25446c5d3508..5d5c187088f7 100644
> --- a/include/linux/sched/vhost_task.h
> +++ b/include/linux/sched/vhost_task.h
> @@ -10,5 +10,6 @@ struct vhost_task *vhost_task_create(bool (*fn)(void *),
>  void vhost_task_start(struct vhost_task *vtsk);
>  void vhost_task_stop(struct vhost_task *vtsk);
>  void vhost_task_wake(struct vhost_task *vtsk);
> +void vhost_task_wake_safe(struct vhost_task *vtsk);
>  
>  #endif /* _LINUX_SCHED_VHOST_TASK_H */
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index bc738fa90c1d..5aa8ddf88d01 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -67,18 +67,54 @@ static int vhost_task_fn(void *data)
>  	do_exit(0);
>  }
>  
> +static void __vhost_task_wake(struct vhost_task *vtsk)
> +{
> +	wake_up_process(vtsk->task);
> +}
> +
>  /**
>   * vhost_task_wake - wakeup the vhost_task
>   * @vtsk: vhost_task to wake
>   *
> - * wake up the vhost_task worker thread
> + * Wake up the vhost_task worker thread.  The caller is responsible for ensuring
> + * that the task hasn't exited.
>   */
>  void vhost_task_wake(struct vhost_task *vtsk)
>  {
> -	wake_up_process(vtsk->task);
> +	/*
> +	 * Checking VHOST_TASK_FLAGS_KILLED can race with signal delivery, but
> +	 * a race can only result in false negatives and this is just a sanity
> +	 * check, i.e. if KILLED is set, the caller is buggy no matter what.
> +	 */
> +	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)))
> +		return;
> +
> +	__vhost_task_wake(vtsk);
>  }
>  EXPORT_SYMBOL_GPL(vhost_task_wake);
>  
> +/**
> + * vhost_task_wake_safe - wakeup the vhost_task if it hasn't been killed
> + * @vtsk: vhost_task to wake
> + *
> + * Wake up the vhost_task worker thread if the task hasn't exited, e.g. due to
> + * a signal.
> + */
> +void vhost_task_wake_safe(struct vhost_task *vtsk)
> +{
> +	guard(mutex)(&vtsk->exit_mutex);
> +
> +	/* Attempting to wake a task that has been explicitly stopped is a bug. */
> +	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)))
> +		return;
> +
> +	if (test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
> +		return;
> +
> +	__vhost_task_wake(vtsk);
> +}
> +EXPORT_SYMBOL_GPL(vhost_task_wake_safe);
> +
>  /**
>   * vhost_task_stop - stop a vhost_task
>   * @vtsk: vhost_task to stop
> @@ -91,7 +127,7 @@ void vhost_task_stop(struct vhost_task *vtsk)
>  	mutex_lock(&vtsk->exit_mutex);
>  	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)) {
>  		set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
> -		vhost_task_wake(vtsk);
> +		__vhost_task_wake(vtsk);
>  	}
>  	mutex_unlock(&vtsk->exit_mutex);
>  
> -- 
> 2.51.0.261.g7ce5a0a67e-goog
Powered by blists - more mailing lists
 
