linux-kernel - Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170926224925.GA9119@amt.cnet>
Date:   Tue, 26 Sep 2017 19:49:28 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        mingo@...hat.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO
 hypercall\

On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
> On 25/09/2017 11:13, Peter Zijlstra wrote:
> > On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> >> I think you are missing the following point:
> >>
> >> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> >> otherwise it can't."
> 
> Who says that?  Certainly a driver can dedicate a single VCPU to
> periodic polling of the device, in such a way that the polling does not
> require a spinlock.

This sequence:


VCPU-0					VCPU-1 (running realtime workload)

takes spinlock A
scheduled out				
					spinlock(A) (busy spins until
						     VCPU-0 is scheduled
						     back in)
scheduled in
finishes execution of 
code under protected section
releases spinlock(A)			

					takes spinlock(A)

You get that point, right?

(*)

> >> So you _have_ to communicate to the host when the guest enters/leaves a
> >> critical section.
> >>
> >> So this point of "everything needs to be RT and the priorities must be
> >> designed carefully", is this: 
> >>
> >> 	WHEN in spinlock protected section (more specifically, when 
> >> 	spinlock protected section _shared with realtime vcpus_),
> >>
> >> 	priority of vcpu0 > priority of emulator thread
> >>
> >> 	OTHERWISE
> >>
> >> 	priority of vcpu0 < priority of emulator thread.
> 
> This is _not_ designed carefully, this is messy.

This is very precise to me. What is "messy" about it? (its clearly
defined).

> The emulator thread can interrupt the VCPU thread, so it has to be at
> higher RT priority (+ priority inheritance of mutexes).  

It can only do that _when_ the VCPU thread is not running a critical
section which a higher priority task depends on.

> Once you have
> done that we can decide on other approaches that e.g. let you get more
> sharing by placing housekeeping VCPUs at SCHED_NORMAL or SCHED_RR.

Well, if someone looks at (*) he sees that if the interruption delay 
(the length between "scheduled out" and "scheduled in" in that diagram)
exceeds a given threshold, that causes the realtime vcpu1 to also 
exceed processing of the realtime task for a given threshold. 

So when you say "The emulator thread can interrupt the VCPU thread", 
you're saying that it has to be modified to interrupt for a maximum
amount of time (say 15us).

Is that what you are suggesting?

> >> So emulator thread can interrupt and inject interrupts to vcpu0.
> > 
> > spinlock protected regions are not everything. What about lock-free
> > constructs where CPU's spin-wait on one another (there's plenty).
> > 
> > And I'm clearly ignorant of how this emulation thread works, but why
> > would it run for a long time? Either it is needed for forward progress
> > of the VCPU or its not. If its not, it shouldn't run.
> 
> The emulator thread 1) should not run for long period of times indeed,
> and 2) it is needed for forward progress of the VCPU.  So it has to be
> at higher RT priority.  I agree with Peter, sorry.  Spinlocks are a red
> herring here.
> 
> Paolo

Paolo, you don't control how many interruptions of the emulator thread
happen per second. So if you let the emulator thread interrupt the
emulator thread at all times, without some kind of bounding 
of these interruptions per time unit, you have a similar
problem as (*) (where the realtime task is scheduled).

Another approach to the problem was suggested to OpenStack.