linux-kernel - Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8c69b7a-43ca-41b7-af8a-6eb1772c55a4@efficios.com>
Date: Mon, 25 Aug 2025 13:54:19 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>
Cc: Jens Axboe <axboe@...nel.dk>, Paolo Bonzini <pbonzini@...hat.com>,
 Sean Christopherson <seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>,
 Dexuan Cui <decui@...rosoft.com>, Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 x86@...nel.org, Arnd Bergmann <arnd@...db.de>,
 Heiko Carstens <hca@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>, Huacai Chen <chenhuacai@...nel.org>,
 Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>
Subject: Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Hypervisors invoke resume_user_mode_work() before entering the guest, which
> clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> space context available to them, so the rseq notify handler skips
> inspecting the critical section, but updates the CPU/MM CID values
> unconditionally so that the eventual pending rseq event is not lost on the
> way to user space.
> 
> This is a pointless exercise as the task might be rescheduled before
> actually returning to user space and it creates unnecessary work in the
> vcpu_run() loops.

One question here: AFAIU, this removes the updates to the cpu_id_start,
cpu_id, mm_cid, and node_id fields on exit to virt usermode. This means
that while the virt guest is running in usermode, the host hypervisor
process has stale rseq fields, until it eventually returns to the
hypervisor's host userspace (from ioctl).

Considering the rseq uapi documentation, this should not matter.
Each of those fields have this statement:

"This field should only be read by the thread which registered this data
structure."

I can however think of use-cases for reading the rseq fields from other
hypervisor threads to figure out information about thread placement.
Doing so would however go against the documented uapi.

I'd rather ask whether anyone is misusing this uapi in that way before
going ahead with the change, just to prevent surprises.

I'm OK with the re-trigger of rseq, as it does indeed appear to fix
an issue, but I'm concerned about the ABI impact of skipping the
rseq_update_cpu_node_id() on return to virt userspace.

Thoughts ?

Thanks,

Mathieu

> 
> It's way more efficient to ignore that invocation based on @regs == NULL
> and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
> vcpu_run() loop before returning from the ioctl().
> 
> This ensures that a pending RSEQ update is not lost and the IDs are updated
> before returning to user space.
> 
> Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
> a NOOP.
> 
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Cc: Paolo Bonzini <pbonzini@...hat.com>
> Cc: Sean Christopherson <seanjc@...gle.com>
> Cc: Wei Liu <wei.liu@...nel.org>
> Cc: Dexuan Cui <decui@...rosoft.com>
> ---
>   drivers/hv/mshv_root_main.c |    2 +
>   include/linux/rseq.h        |   17 +++++++++
>   kernel/rseq.c               |   76 +++++++++++++++++++++++---------------------
>   virt/kvm/kvm_main.c         |    3 +
>   4 files changed, 62 insertions(+), 36 deletions(-)
> 
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu
>   		}
>   	} while (!vp->run.flags.intercept_suspend);
>   
> +	rseq_virt_userspace_exit();
> +
>   	return ret;
>   }
>   
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
>   }
>   
>   /*
> + * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
> + * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
> + * that case just to do it eventually again before returning to user space,
> + * the entry resume_user_mode_work() invocation is ignored as the register
> + * argument is NULL.
> + *
> + * After returning from guest mode, they have to invoke this function to
> + * re-raise TIF_NOTIFY_RESUME if necessary.
> + */
> +static inline void rseq_virt_userspace_exit(void)
> +{
> +	if (current->rseq_event_pending)
> +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +}
> +
> +/*
>    * If parent process has a registered restartable sequences area, the
>    * child inherits. Unregister rseq for a clone with CLONE_VM set.
>    */
> @@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
>   static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
> +static inline void rseq_virt_userspace_exit(void) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
>   {
>   	struct task_struct *t = current;
>   	int ret, sig;
> +	bool event;
> +
> +	/*
> +	 * If invoked from hypervisors before entering the guest via
> +	 * resume_user_mode_work(), then @regs is a NULL pointer.
> +	 *
> +	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
> +	 * it before returning from the ioctl() to user space when
> +	 * rseq_event.sched_switch is set.
> +	 *
> +	 * So it's safe to ignore here instead of pointlessly updating it
> +	 * in the vcpu_run() loop.
> +	 */
> +	if (!regs)
> +		return;
>   
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
>   	/*
> -	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
> -	 * pointer, so fixup cannot be done. If the syscall which led to
> -	 * this invocation was invoked inside a critical section, then it
> -	 * will either end up in this code again or a possible violation of
> -	 * a syscall inside a critical region can only be detected by the
> -	 * debug code in rseq_syscall() in a debug enabled kernel.
> +	 * Read and clear the event pending bit first. If the task
> +	 * was not preempted or migrated or a signal is on the way,
> +	 * there is no point in doing any of the heavy lifting here
> +	 * on production kernels. In that case TIF_NOTIFY_RESUME
> +	 * was raised by some other functionality.
> +	 *
> +	 * This is correct because the read/clear operation is
> +	 * guarded against scheduler preemption, which makes it CPU
> +	 * local atomic. If the task is preempted right after
> +	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> +	 * again and this function is invoked another time _before_
> +	 * the task is able to return to user mode.
> +	 *
> +	 * On a debug kernel, invoke the fixup code unconditionally
> +	 * with the result handed in to allow the detection of
> +	 * inconsistencies.
>   	 */
> -	if (regs) {
> -		/*
> -		 * Read and clear the event pending bit first. If the task
> -		 * was not preempted or migrated or a signal is on the way,
> -		 * there is no point in doing any of the heavy lifting here
> -		 * on production kernels. In that case TIF_NOTIFY_RESUME
> -		 * was raised by some other functionality.
> -		 *
> -		 * This is correct because the read/clear operation is
> -		 * guarded against scheduler preemption, which makes it CPU
> -		 * local atomic. If the task is preempted right after
> -		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> -		 * again and this function is invoked another time _before_
> -		 * the task is able to return to user mode.
> -		 *
> -		 * On a debug kernel, invoke the fixup code unconditionally
> -		 * with the result handed in to allow the detection of
> -		 * inconsistencies.
> -		 */
> -		bool event;
> -
> -		scoped_guard(RSEQ_EVENT_GUARD) {
> -			event = t->rseq_event_pending;
> -			t->rseq_event_pending = false;
> -		}
> +	scoped_guard(RSEQ_EVENT_GUARD) {
> +		event = t->rseq_event_pending;
> +		t->rseq_event_pending = false;
> +	}
>   
> -		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> -			ret = rseq_ip_fixup(regs, event);
> -			if (unlikely(ret < 0))
> -				goto error;
> -		}
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> +		ret = rseq_ip_fixup(regs, event);
> +		if (unlikely(ret < 0))
> +			goto error;
>   	}
> +
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
>   	return;
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -49,6 +49,7 @@
>   #include <linux/lockdep.h>
>   #include <linux/kthread.h>
>   #include <linux/suspend.h>
> +#include <linux/rseq.h>
>   
>   #include <asm/processor.h>
>   #include <asm/ioctl.h>
> @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
>   		r = kvm_arch_vcpu_ioctl_run(vcpu);
>   		vcpu->wants_to_run = false;
>   
> +		rseq_virt_userspace_exit();
> +
>   		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>   		break;
>   	}
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com