linux-kernel - Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3b2c222b-9ef7-43e2-8ab3-653a5ee824d4@paulmck-laptop>
Date: Fri, 3 May 2024 15:00:49 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Leonardo Bras <leobras@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>,
	Frederic Weisbecker <frederic@...nel.org>,
	Neeraj Upadhyay <quic_neeraju@...cinc.com>,
	Joel Fernandes <joel@...lfernandes.org>,
	Josh Triplett <josh@...htriplett.org>,
	Boqun Feng <boqun.feng@...il.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Lai Jiangshan <jiangshanlai@...il.com>,
	Zqiang <qiang.zhang1211@...il.com>,
	Marcelo Tosatti <mtosatti@...hat.com>, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org, rcu@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> On Fri, May 03, 2024, Leonardo Bras wrote:
> > > KVM can provide that information with much better precision, e.g. KVM
> > > knows when when it's in the core vCPU run loop.
> > 
> > That would not be enough.
> > I need to present the application/problem to make a point:
> > 
> > - There is multiple  isolated physical CPU (nohz_full) on which we want to 
> >   run KVM_RT vcpus, which will be running a real-time (low latency) task.
> > - This task should not miss deadlines (RT), so we test the VM to make sure 
> >   the maximum latency on a long run does not exceed the latency requirement
> > - This vcpu will run on SCHED_FIFO, but has to run on lower priority than
> >   rcuc, so we can avoid stalling other cpus.
> > - There may be some scenarios where the vcpu will go back to userspace
> >   (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
> >   this to run other stuff (like rcuc).
> >
> > Now, I understand it will cover most of our issues if we have a context 
> > tracking around the vcpu_run loop, since we can use that to decide not to 
> > run rcuc on the cpu if the interruption hapenned inside the loop.
> > 
> > But IIUC we can have a thread that "just got out of the loop" getting 
> > interrupted by the timer, and asked to run rcu_core which will be bad for 
> > latency.
> > 
> > I understand that the chance may be statistically low, but happening once 
> > may be enough to crush the latency numbers.
> > 
> > Now, I can't think on a place to put this context trackers in kvm code that 
> > would avoid the chance of rcuc running improperly, that's why the suggested 
> > timeout, even though its ugly.
> > 
> > About the false-positive, IIUC we could reduce it if we reset the per-cpu 
> > last_guest_exit on kvm_put.
> 
> Which then opens up the window that you're trying to avoid (IRQ arriving just
> after the vCPU is put, before the CPU exits to userspace).
> 
> If you want the "entry to guest is imminent" status to be preserved across an exit
> to userspace, then it seems liek the flag really should be a property of the task,
> not a property of the physical CPU.  Similar to how rcu_is_cpu_rrupt_from_idle()
> detects that an idle task was interrupted, that goal is to detect if a vCPU task
> was interrupted.
> 
> PF_VCPU is already "taken" for similar tracking, but if we want to track "this
> task will soon enter an extended quiescent state", I don't see any reason to make
> it specific to vCPU tasks.  Unless the kernel/KVM dynamically manages the flag,
> which as above will create windows for false negatives, the kernel needs to
> trust userspace to a certaine extent no matter what.  E.g. even if KVM sets a
> PF_xxx flag on the first KVM_RUN, nothing would prevent userspace from calling
> into KVM to get KVM to set the flag, and then doing something else entirely with
> the task.
> 
> So if we're comfortable relying on the 1 second timeout to guard against a
> misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> userspace communicate to the kernel that it's a real-time task that spends the
> overwhelming majority of its time in userspace or guest context, i.e. should be
> given extra leniency with respect to rcuc if the task happens to be interrupted
> while it's in kernel context.

But if the task is executing in host kernel context for quite some time,
then the host kernel's RCU really does need to take evasive action.

On the other hand, if that task is executing in guest context (either
kernel or userspace), then the host kernel's RCU can immediately report
that task's quiescent state.

Too much to ask for the host kernel's RCU to be able to sense the
difference?  ;-)

							Thanx, Paul