linux-kernel - Re: [GIT PULL] locking/urgent for v6.17-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aKz38QetUrDfKP8P@google.com>
Date: Mon, 25 Aug 2025 16:55:29 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Borislav Petkov <bp@...en8.de>, 
	Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra <peterz@...radead.org>, x86-ml <x86@...nel.org>, 
	lkml <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] locking/urgent for v6.17-rc1

On Mon, Aug 25, 2025, Sebastian Andrzej Siewior wrote:
> On 2025-08-22 17:28:02 [-0700], Sean Christopherson wrote:
> kvm-nx-lpage-recovery shares the mm but it grabs a reference.
> It might be a coincidence but the task, on which the wakeup chokes,
> seems to be gone according to my traces. And with
> 
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,7 +84,10 @@ static int vhost_task_fn(void *data)
>   */
>  void vhost_task_wake(struct vhost_task *vtsk)
>  {
> -	wake_up_process(vtsk->task);
> +	mutex_lock(&vtsk->exit_mutex);
> +	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
> +		wake_up_process(vtsk->task);
> +	mutex_unlock(&vtsk->exit_mutex);
>  }
>  EXPORT_SYMBOL_GPL(vhost_task_wake);
>  
> it doesn't crash anymore. Could it attempts to wake a task that is gone?

Oh fudge, that indeed is what's happening.

Each VM that KVM creates has a kvm-nx-lpage-recovery task, and KVM wakes all such
tasks across all VMs in response to any change to the hugepage recovery settings,
i.e. when privileged userspace changes any of the associate module params.

KVM holds a global lock when walking the list of VMs and so guarantees the VM
hasn't fully exited, but nothing prevents the recovery task from getting a signal
and exiting long before the VM is destroyed.  hardware_disable_test is (deliberately?)
not very tidy, and exits without explicitly closing the VM and vCPU fds, and so
its recovery task gets terminated via signal instead of by KVM explicitly calling
vhost_task_stop() when the VM is being destroyed.

The basic gist of the above diff works, but unfortunately simply taking
vtsk->exit_mutex in vhost_task_wake() doesn't appear to be an option because the
vhost code appears to have gone through a lot of effort to avoid waking an exited
task.

I think we can also add some sanity checks and hints to help future users of the
vhost task code from running into the same problem.

I'll post a proper series.

Thanks a ton, I owe you a drink of your choice :-)

> > Strace on hardware_disable_test spewed a whole pile of these
> > 
> >   wait4(32861, 0x7ffc66475dec, WNOHANG, NULL) = 0
> >   futex(0x7fb735c43000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
> 
> That is a shared FUTEX and is probably part pthread_join().
> 
> > immediately before the crash.  I assume it corresponds to this:
> > 
> > 		/* Child is still running, keep waiting. */
> > 		if (pid != waitpid(pid, &status, WNOHANG))
> > 			continue;
> > 
> > I also got a new splat on the "WARN_ON_ONCE(ret < 0);" at the end of __futex_ref_atomic_end().
> > This happened during boot; AFAICT our userspace was setting up cgroups.  In this
> > case, the system hung and I had to reboot.
> 
> This is odd
> 
> >   ------------[ cut here ]------------
> >   WARNING: CPU: 45 PID: 0 at kernel/futex/core.c:1604 futex_ref_rcu+0xbf/0xf0
> …
> > Heh, and two more when booting a different system.  Guess it's my lucky day.
> > This time whatever went sideways didn't appear to be fatal as the system booted
> > and I could ssh in.  One is the same WARN as above, and the second WARN on the
> > system hit the
> > 
> >   WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
> > 
> > in futex_hash_allocate().
> 
> This means the counter don't add up after the switch. Not sure how. This
> seems to be a random task but it might be part of the previous splat.

Yeah, IIRC, those only showed up when I kexec'd into a new kernel instead of doing
a normal reboot, so it may have been some weird leftovers and/or PEBKAC?  I'll
file a new bug report if I see either of those warnings again.