linux-kernel - Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170925025751.GB30813@amt.cnet>
Date:   Sun, 24 Sep 2017 23:57:53 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        mingo@...hat.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO
 hypercall

On Sun, Sep 24, 2017 at 09:05:44AM -0400, Paolo Bonzini wrote:
> 
> 
> ----- Original Message -----
> > From: "Peter Zijlstra" <peterz@...radead.org>
> > To: "Paolo Bonzini" <pbonzini@...hat.com>
> > Cc: "Marcelo Tosatti" <mtosatti@...hat.com>, "Konrad Rzeszutek Wilk" <konrad.wilk@...cle.com>, mingo@...hat.com,
> > kvm@...r.kernel.org, linux-kernel@...r.kernel.org, "Thomas Gleixner" <tglx@...utronix.de>
> > Sent: Saturday, September 23, 2017 3:41:14 PM
> > Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
> > 
> > On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote:
> > > On 22/09/2017 14:55, Peter Zijlstra wrote:
> > > > You just explained it yourself. If the thread that needs to complete
> > > > what you're waiting on has lower priority, it will _never_ get to run if
> > > > you're busy waiting on it.
> > > > 
> > > > This is _trivial_.
> > > > 
> > > > And even for !RT it can be quite costly, because you can end up having
> > > > to burn your entire slot of CPU time before you run the other task.
> > > > 
> > > > Userspace spinning is _bad_, do not do this.
> > > 
> > > This is not userspace spinning, it is guest spinning---which has
> > > effectively the same effect but you cannot quite avoid.
> > 
> > So I'm virt illiterate and have no clue on how all this works; but
> > wasn't this a vmexit ? (that's what marcelo traced). And once you've
> > done a vmexit you're a regular task again, not a vcpu.
> 
> His trace simply shows that the timer tick happened and the SCHED_NORMAL
> thread was preempted.  Bumping the vCPU thread to SCHED_FIFO drops
> the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency
> of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not
> preempted anymore.
> 
> > > But I agree that the solution is properly prioritizing threads that can
> > > interrupt the VCPU, and using PI mutexes.

Thats exactly what the patch does, the prioritization is not fixed in
time, and depends on whether or not vcpu-0 is in spinlock protected
section.

Are you suggesting a different prioritization? Can you describe it
please, even if incomplete?

> > 
> > Right, if you want to run RT VCPUs the whole emulator/vcpu interaction
> > needs to be designed for RT.
> > 
> > > I'm not a priori opposed to paravirt scheduling primitives, but I am not
> > > at all sure that it's required.
> > 
> > Problem is that the proposed thing doesn't solve anything. There is
> > nothing that prohibits the guest from triggering a vmexit while holding
> > a spinlock and landing in the self-same problems.
> 
> Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU
> and guest kernel+userspace) is that vmexits while holding a spinlock are either
> confined to one vCPU or are handled in the host hypervisor very quickly, like
> less than 2000 clock cycles.
> 
> So I'm not denying that Marcelo's approach solves the problem, but it's very
> heavyweight and it masks an important misconfiguration (as you write above,
> everything needs to be RT and the priorities must be designed carefully).

I think you are missing the following point:

"vcpu0 can be interrupted when its not in a spinlock protected section, 
otherwise it can't."

So you _have_ to communicate to the host when the guest enters/leaves a
critical section.

So this point of "everything needs to be RT and the priorities must be
designed carefully", is this: 

	WHEN in spinlock protected section (more specifically, when 
	spinlock protected section _shared with realtime vcpus_),

	priority of vcpu0 > priority of emulator thread

	OTHERWISE

	priority of vcpu0 < priority of emulator thread.

(*)

So emulator thread can interrupt and inject interrupts to vcpu0.

> 
> _However_, even if you do this, you may want to put the less important vCPUs
> and the emulator threads on the same physical CPU.  In that case, the vCPU
> can be placed at SCHED_RR to avoid starvation (while the emulator thread needs
> to stay at SCHED_FIFO and higher priority).  Some kind of trick that bumps
> spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time only,
> might still be useful.

Anything that violates (*) above is going to cause excessive latencies
in realtime vcpus, via:

PCPU-0:
	* vcpu-0 grabs spinlock A.
	* event wakes up emulator thread, vcpu-0 sched out, vcpu-0 sched
	in.
PCPU-1:
	* realtime vcpu grabs spinlock-A, busy spins on emulator threads
	completion.

So its more than useful, its necessary.

I'm open to suggestions as better ways to solve this problem 
while sharing emulator thread with vcpu-0 (which is something users
are interested in, for obvious economical reasons), but:

	1) Don't get the point of Peters rejection.

	2) Don't get how SCHED_RR can help the situation.