linux-kernel - Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19ecf8c8-d5ac-4cfb-a650-cf072ced81ce@efficios.com>
Date: Fri, 12 Jul 2024 10:09:03 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Joel Fernandes <joel@...lfernandes.org>,
 Vineeth Remanan Pillai <vineeth@...byteword.org>
Cc: Sean Christopherson <seanjc@...gle.com>, Ben Segall <bsegall@...gle.com>,
 Borislav Petkov <bp@...en8.de>,
 Daniel Bristot de Oliveira <bristot@...hat.com>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>, "H . Peter Anvin"
 <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
 Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>,
 Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski <luto@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>,
 Valentin Schneider <vschneid@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Vitaly Kuznetsov <vkuznets@...hat.com>, Wanpeng Li <wanpengli@...cent.com>,
 Steven Rostedt <rostedt@...dmis.org>, Suleiman Souhlal
 <suleiman@...gle.com>, Masami Hiramatsu <mhiramat@...nel.org>,
 himadrics@...ia.fr, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 x86@...nel.org, graf@...zon.com, drjunior.org@...il.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority
 management)

On 2024-07-12 08:57, Joel Fernandes wrote:
> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
[...]
>> Existing use cases
>> -------------------------
>>
>> - A latency sensitive workload on the guest might need more than one
>> time slice to complete, but should not block any higher priority task
>> in the host. In our design, the latency sensitive workload shares its
>> priority requirements to host(RT priority, cfs nice value etc). Host
>> implementation of the protocol sets the priority of the vcpu task
>> accordingly so that the host scheduler can make an educated decision
>> on the next task to run. This makes sure that host processes and vcpu
>> tasks compete fairly for the cpu resource.

AFAIU, the information you need to convey to achieve this is the priority
of the task within the guest. This information need to reach the host
scheduler to make informed decision.

One thing that is unclear about this is what is the acceptable
overhead/latency to push this information from guest to host ?
Is an hypercall OK or does it need to be exchanged over a memory
mapping shared between guest and host ?

Hypercalls provide simple ABIs across guest/host, and they allow
the guest to immediately notify the host (similar to an interrupt).

Shared memory mapping will require a carefully crafted ABI layout,
and will only allow the host to use the information provided when
the host runs. Therefore, if the choice is to share this information
only through shared memory, the host scheduler will only be able to
read it when it runs, so in hypercall, interrupt, and so on.

>> - Guest should be able to notify the host that it is running a lower
>> priority task so that the host can reschedule it if needed. As
>> mentioned before, the guest shares the priority with the host and the
>> host takes a better scheduling decision.

It is unclear to me whether this information needs to be "pushed"
from guest to host (e.g. hypercall) in a way that allows the host
to immediately act on this information, or if it is OK to have the
host read this information when its scheduler happens to run.

>> - Proactive vcpu boosting for events like interrupt injection.
>> Depending on the guest for boost request might be too late as the vcpu
>> might not be scheduled to run even after interrupt injection. Host
>> implementation of the protocol boosts the vcpu tasks priority so that
>> it gets a better chance of immediately being scheduled and guest can
>> handle the interrupt with minimal latency. Once the guest is done
>> handling the interrupt, it can notify the host and lower the priority
>> of the vcpu task.

This appears to be a scenario where the host sets a "high priority", and
the guest clears it when it is done with the irq handler. I guess it can
be done either ways (hypercall or shared memory), but the choice would
depend on the parameters identified above: acceptable overhead vs acceptable
latency to inform the host scheduler.

>> - Guests which assign specialized tasks to specific vcpus can share
>> that information with the host so that host can try to avoid
>> colocation of those cpus in a single physical cpu. for eg: there are
>> interrupt pinning use cases where specific cpus are chosen to handle
>> critical interrupts and passing this information to the host could be
>> useful.

How frequently is this topology expected to change ? Is it something that
is set once when the guest starts and then is fixed ? How often it changes
will likely affect the tradeoffs here.

>> - Another use case is the sharing of cpu capacity details between
>> guest and host. Sharing the host cpu's load with the guest will enable
>> the guest to schedule latency sensitive tasks on the best possible
>> vcpu. This could be partially achievable by steal time, but steal time
>> is more apparent on busy vcpus. There are workloads which are mostly
>> sleepers, but wake up intermittently to serve short latency sensitive
>> workloads. input event handlers in chrome is one such example.

OK so for this use-case information goes the other way around: from host
to guest. Here the shared mapping seems better than polling the state
through an hypercall.

>>
>> Data from the prototype implementation shows promising improvement in
>> reducing latencies. Data was shared in the v1 cover letter. We have
>> not implemented the capacity based placement policies yet, but plan to
>> do that soon and have some real numbers to share.
>>
>> Ideas brought up during offlist discussion
>> -------------------------------------------------------
>>
>> 1. rseq based timeslice extension mechanism[1]
>>
>> While the rseq based mechanism helps in giving the vcpu task one more
>> time slice, it will not help in the other use cases. We had a chat
>> with Steve and the rseq mechanism was mainly for improving lock
>> contention and would not work best with vcpu boosting considering all
>> the use cases above. RT or high priority tasks in the VM would often
>> need more than one time slice to complete its work and at the same,
>> should not be hurting the host workloads. The goal for the above use
>> cases is not requesting an extra slice, but to modify the priority in
>> such a way that host processes and guest processes get a fair way to
>> compete for cpu resources. This also means that vcpu task can request
>> a lower priority when it is running lower priority tasks in the VM.
> 
> I was looking at the rseq on request from the KVM call, however it does not
> make sense to me yet how to expose the rseq area via the Guest VA to the host
> kernel.  rseq is for userspace to kernel, not VM to kernel.
> 
> Steven Rostedt said as much as well, thoughts? Add Mathieu as well.

I'm not sure that rseq would help at all here, but I think we may want to
borrow concepts of data sitting in shared memory across privilege levels
and apply them to VMs.

If some of the ideas end up being useful *outside* of the context of VMs,
then I'd be willing to consider adding fields to rseq. But as long as it is
VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
you can safely share across host/guest kernels.

> 
> This idea seems to suffer from the same vDSO over-engineering below, rseq
> does not seem to fit.
> 
> Steven Rostedt told me, what we instead need is a tracepoint callback in a
> driver, that does the boosting.

I utterly dislike changing the system behavior through tracepoints. They were
designed to observe the system, not modify its behavior. If people start abusing
them, then subsystem maintainers will stop adding them. Please don't do that.
Add a notifier or think about integrating what you are planning to add into the
driver instead.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com