linux-kernel - Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZhRoDfoz-YqsGhIB@google.com>
Date: Mon, 8 Apr 2024 14:56:29 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Marcelo Tosatti <mtosatti@...hat.com>, Leonardo Bras <leobras@...hat.com>, 
	Paolo Bonzini <pbonzini@...hat.com>, Frederic Weisbecker <frederic@...nel.org>, 
	Neeraj Upadhyay <quic_neeraju@...cinc.com>, Joel Fernandes <joel@...lfernandes.org>, 
	Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, rcu@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> > On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > > > > > +	if (vcpu->wants_to_run)
> > > > > > +		context_tracking_guest_start_run_loop();
> > > > > 
> > > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > > quiescent states until the grace period is at least one second old.
> > > > 
> > > > I don't think I follow the "will no longer report quiescent states" issue.  Are
> > > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > > that the CPU is entering a quiescent state?  If so, that's an issue that would
> > > > need to be resolved regardless of what heuristic we use to determine whether or
> > > > not a CPU is likely to enter a KVM guest.
> > > 
> > > Please allow me to start over.  Are interrupts disabled at this point,
> > 
> > Nope, IRQs are enabled.
> > 
> > Oof, I'm glad you asked, because I was going to say that there's one exception,
> > kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> > back in.  But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> > i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
> > 
> > > and, if so, will they remain disabled until the transfer of control to
> > > the guest has become visible to RCU via the context-tracking code?
> > > 
> > > Or has the context-tracking code already made the transfer of control
> > > to the guest visible to RCU?
> > 
> > Nope.  The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> > happens later, just before the actual VM-Enter.  And that call does happen with
> > IRQs disabled (and IRQs stay disabled until the CPU enters the guest).
> 
> OK, then we can have difficulties with long-running interrupts hitting
> this range of code.  It is unfortunately not unheard-of for interrupts
> plus trailing softirqs to run for tens of seconds, even minutes.

Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
be a massive lag before the CPU gets back into a quiescent state.

> One counter-argument is that that softirq would take scheduling-clock
> interrupts, and would eventually make rcu_core() run.

Considering that this behavior would be unique to nohz_full CPUs, how much
responsibility does RCU have to ensure a sane setup?  E.g. if a softirq runs for
multiple seconds on a nohz_full CPU whose primary role is to run a KVM vCPU, then
whatever real-time workaround the vCPU is running is already doomed.

> But does a rcu_sched_clock_irq() from a guest OS have its "user"
> argument set?

No, and it shouldn't, at least not on x86 (I assume other architectures are
similar, but I don't actually no for sure).

On x86, the IRQ that the kernel sees comes looks like it comes from host kernel
code.  And on AMD (SVM), the IRQ doesn't just "look" like it came from host kernel,
the IRQ really does get vectored/handled in the host kernel.  Intel CPUs have a
performance optimization where the IRQ gets "eaten" as part of the VM-Exit, and
so KVM synthesizes a stack frame and does a manual CALL to invoke the IRQ handler.

And that's just for IRQs that actually arrive while the guest is running.  IRQs
arrive while KVM is active, e.g. running its large vcpu_run(), are "pure" host
IRQs.