linux-kernel - Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YTfgemRBKRv2UNjsOLhokxvvmHbVVj1JLtVmhywKtqeHA@mail.gmail.com>
Date: Fri, 15 Dec 2023 10:20:03 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vineeth Remanan Pillai <vineeth@...byteword.org>, Ben Segall <bsegall@...gle.com>, 
	Borislav Petkov <bp@...en8.de>, Daniel Bristot de Oliveira <bristot@...hat.com>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	"H . Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Mel Gorman <mgorman@...e.de>, Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski <luto@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Valentin Schneider <vschneid@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Vitaly Kuznetsov <vkuznets@...hat.com>, 
	Wanpeng Li <wanpengli@...cent.com>, Suleiman Souhlal <suleiman@...gle.com>, 
	Masami Hiramatsu <mhiramat@...gle.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	x86@...nel.org, Tejun Heo <tj@...nel.org>, Josh Don <joshdon@...gle.com>, 
	Barret Rhoden <brho@...gle.com>, David Vernet <dvernet@...a.com>
Subject: Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Hi Sean,
Nice to see your quick response to the RFC, thanks. I wanted to
clarify some points below:

On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > Now when I think about it, the implementation seems to
> > suggest that we are putting policies in kvm. Ideally, the goal is:
> > - guest scheduler communicates the priority requirements of the workload
> > - kvm applies the priority to the vcpu task.
>
> Why?  Tasks are tasks, why does KVM need to get involved?  E.g. if the problem
> is that userspace doesn't have the right knobs to adjust the priority of a task
> quickly and efficiently, then wouldn't it be better to solve that problem in a
> generic way?

No, it is not only about tasks. We are boosting anything RT or above
such as softirq, irq etc as well. Could you please see the other
patches? Also, Vineeth please make this clear in the next revision.

> > > Pushing the scheduling policies to host userspace would allow for far more control
> > > and flexibility.  E.g. a heavily paravirtualized environment where host userspace
> > > knows *exactly* what workloads are being run could have wildly different policies
> > > than an environment where the guest is a fairly vanilla Linux VM that has received
> > > a small amount of enlightment.
> > >
> > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > tailor made for the problems you are trying to solve.
> > >
> > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > >
> > You are right, sched_ext is a good choice to have policies
> > implemented. In our case, we would need a communication mechanism as
> > well and hence we thought kvm would work best to be a medium between
> > the guest and the host.
>
> Making KVM be the medium may be convenient and the quickest way to get a PoC
> out the door, but effectively making KVM a middle-man is going to be a huge net
> negative in the long term.  Userspace can communicate with the guest just as
> easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> define a relatively rigid guest/host ABI.

At the moment, the only ABI is a shared memory structure and a custom
MSR. This is no different from the existing steal time accounting
where a shared structure is similarly shared between host and guest,
we could perhaps augment that structure with other fields instead of
adding a new one? On the ABI point, we have deliberately tried to keep
it simple (for example, a few months ago we had hypercalls and we went
to great lengths to eliminate those).

> If instead the contract is between host userspace and the guest, the ABI can be
> much more fluid, e.g. if you (or any setup) can control at least some amount of
> code that runs in the guest

I see your point of view. One way to achieve this is to have a BPF
program run to implement the boosting part, in the VMEXIT path. KVM
then just calls a hook. Would that alleviate some of your concerns?

> then the contract between the guest and host doesn't
> even need to be formally defined, it could simply be a matter of bundling host
> and guest code appropriately.
>
> If you want to land support for a given contract in upstream repositories, e.g.
> to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> and/or guests, then yeah, you'll need a formal ABI.  But that's still not a good
> reason to have KVM define the ABI.  Doing it in KVM might be a wee bit easier because
> it's largely just a matter of writing code, and LKML provides a centralized channel
> for getting buyin from all parties.  But defining an ABI that's independent of the
> kernel is absolutely doable, e.g. see the many virtio specs.
>
> I'm not saying KVM can't help, e.g. if there is information that is known only
> to KVM, but the vast majority of the contract doesn't need to be defined by KVM.

The key to making this working of the patch is VMEXIT path, that is
only available to KVM. If we do anything later, then it might be too
late. We have to intervene *before* the scheduler takes the vCPU
thread off the CPU. Similarly, in the case of an interrupt injected
into the guest, we have to boost the vCPU before the "vCPU run" stage
-- anything later might be too late.

Also you mentioned something about the tick path in the other email,
we have no control over the host tick preempting the vCPU thread. The
guest *will VMEXIT* on the host tick. On ChromeOS, we run multiple VMs
and overcommitting is very common especially on devices with smaller
number of CPUs.

Just to clarify, this isn't a "quick POC". We have been working on
this for many months and it was hard to get working correctly and
handle all corner cases. We are finally at a point where - it just
works (TM) and is roughly half the code size of when we initially
started.

thanks,

 - Joel