linux-kernel - Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO7JXPhMfibNsX6Nx902PRo7_A2b4Rnc3UP=bpKYeOuQnHvtrw@mail.gmail.com>
Date: Mon, 24 Jun 2024 07:01:19 -0400
From: Vineeth Remanan Pillai <vineeth@...byteword.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Ben Segall <bsegall@...gle.com>, Borislav Petkov <bp@...en8.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, "H . Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>, 
	Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski <luto@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Valentin Schneider <vschneid@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Vitaly Kuznetsov <vkuznets@...hat.com>, Wanpeng Li <wanpengli@...cent.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Joel Fernandes <joel@...lfernandes.org>, 
	Suleiman Souhlal <suleiman@...gle.com>, Masami Hiramatsu <mhiramat@...nel.org>, himadrics@...ia.fr, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org, x86@...nel.org, 
	graf@...zon.com, drjunior.org@...il.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

> > Roughly summarazing an off-list discussion.
> >
> >  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
> >    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
> >
> >  - "Negotiating" features/hooks should also be handled outside of the kernel,
> >    e.g. similar to how VirtIO devices negotiate features between host and guest.
> >
> >  - Pushing PV scheduler entities to KVM should either be done through an exported
> >    API, e.g. if the scheduler is provided by a separate kernel module, or by a
> >    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
> >
> > I think those were the main takeaways?  Vineeth and Joel, please chime in on
> > anything I've missed or misremembered.
> >
> Thanks for the brief about the offlist discussion, all the points are
> captured, just some minor additions. v2 implementation removed the
> scheduling policies outside of kvm to a separate entity called pvsched
> driver and could be implemented as a kernel module or bpf program. But
> the handshake between guest and host to decide on what pvsched driver
> to attach was still going through kvm. So it was suggested to move
> this handshake(discovery and negotiation) outside of kvm. The idea is
> to have a virtual device exposed by the VMM which would take care of
> the handshake. Guest driver for this device would talk to the device
> to understand the pvsched details on the host and pass the shared
> memory details. Once the handshake is completed, the device is
> responsible for loading the pvsched driver(bpf program or kernel
> module responsible for implementing the policies). The pvsched driver
> will register to the trace points exported by kvm and handle the
> callbacks from then on. The scheduling will be taken care of by the
> host scheduler, pvsched driver on host is responsible only for setting
> the policies(placement, priorities etc).
>
> With the above approach, the only change in kvm would be the internal
> tracepoints for pvsched. Host kernel will also be unchanged and all
> the complexities move to the VMM and the pvsched driver. Guest kernel
> will have a new driver to talk to the virtual pvsched device and this
> driver would hook into the guest kernel for passing scheduling
> information to the host(via tracepoints).
>
Noting down the recent offlist discussion and details of our response.

Based on the previous discussions, we had come up with a modified
design focusing on minimum kvm changes. The design is as follows:
- Guest and host share scheduling information via shared memory
region. Details of the layout of the memory region, information shared
and actions and policies are defined by the pvsched protocol. And this
protocol is implemented by a BPF program or a kernel module.
- Host exposes a virtual device(pvsched device to the guest). This
device is the mechanism for host and guest for handshake and
negotiation to reach a decision on the pvsched protocol to use. The
virtual device is implemented in the VMM in userland as it doesn't
come in the performance critical path.
- Guest loads a pvsched driver during device enumeration. the driver
initiates the protocol handshake and negotiation with the host and
decides on the protocol. This driver creates a per-cpu shared memory
region and shares the GFN with the device in the host. Guest also
loads the BPF program that implements the protocol in the guest.
- Once the VMM has all the information needed(per-cpu shared memory
GFN, vcpu task pids etc), it loads the BPF program which implements
the protocol on the host.
- BPF program on the host registers the trace points in kvm to get
callbacks on interested events like VMENTER, VMEXIT, interrupt
injection etc. Similarly, the guest BPF program registers tracepoints
in the guest kernel for interested events like sched wakeup, sched
switch, enqueue, dequeue, irq entry/exit etc.

The above design is minimally invasive to the kvm and core kernel and
implements the protocol as loadable programs and protocol handshake
and negotiation through the virtual device framework. Protocol
implementation takes care of information sharing and policy
enforcements and scheduler handles the actual scheduling decisions.
Sample policy implementation(boosting for latency sensitive workloads
as an example) could be included in the kernel for reference.

We had an offlist discussion about the above design and a couple of
ideas were suggested as an alternative. We had taken an action item to
study the alternatives for the feasibility. Rest of the mail lists the
use cases(not conclusive) and our feasibility investigations.

Existing use cases
-------------------------

- A latency sensitive workload on the guest might need more than one
time slice to complete, but should not block any higher priority task
in the host. In our design, the latency sensitive workload shares its
priority requirements to host(RT priority, cfs nice value etc). Host
implementation of the protocol sets the priority of the vcpu task
accordingly so that the host scheduler can make an educated decision
on the next task to run. This makes sure that host processes and vcpu
tasks compete fairly for the cpu resource.
- Guest should be able to notify the host that it is running a lower
priority task so that the host can reschedule it if needed. As
mentioned before, the guest shares the priority with the host and the
host takes a better scheduling decision.
- Proactive vcpu boosting for events like interrupt injection.
Depending on the guest for boost request might be too late as the vcpu
might not be scheduled to run even after interrupt injection. Host
implementation of the protocol boosts the vcpu tasks priority so that
it gets a better chance of immediately being scheduled and guest can
handle the interrupt with minimal latency. Once the guest is done
handling the interrupt, it can notify the host and lower the priority
of the vcpu task.
- Guests which assign specialized tasks to specific vcpus can share
that information with the host so that host can try to avoid
colocation of those cpus in a single physical cpu. for eg: there are
interrupt pinning use cases where specific cpus are chosen to handle
critical interrupts and passing this information to the host could be
useful.
- Another use case is the sharing of cpu capacity details between
guest and host. Sharing the host cpu's load with the guest will enable
the guest to schedule latency sensitive tasks on the best possible
vcpu. This could be partially achievable by steal time, but steal time
is more apparent on busy vcpus. There are workloads which are mostly
sleepers, but wake up intermittently to serve short latency sensitive
workloads. input event handlers in chrome is one such example.

Data from the prototype implementation shows promising improvement in
reducing latencies. Data was shared in the v1 cover letter. We have
not implemented the capacity based placement policies yet, but plan to
do that soon and have some real numbers to share.

Ideas brought up during offlist discussion
-------------------------------------------------------

1. rseq based timeslice extension mechanism[1]

While the rseq based mechanism helps in giving the vcpu task one more
time slice, it will not help in the other use cases. We had a chat
with Steve and the rseq mechanism was mainly for improving lock
contention and would not work best with vcpu boosting considering all
the use cases above. RT or high priority tasks in the VM would often
need more than one time slice to complete its work and at the same,
should not be hurting the host workloads. The goal for the above use
cases is not requesting an extra slice, but to modify the priority in
such a way that host processes and guest processes get a fair way to
compete for cpu resources. This also means that vcpu task can request
a lower priority when it is running lower priority tasks in the VM.

2. vDSO approach
Regarding the vDSO approach, we had a look at that and feel that
without a major redesign of vDSO, it might be difficult to achieve the
requirements. vDSO is currently implemented as a shared read-only
memory region with the processes. For this to work with
virtualization, we would need to map a similar region to the guest and
it has to be read-write. This is more or less what we are also
proposing, but with minimal changes in the core kernel. With the
current design, the shared memory region would be the responsibility
of the virtual pvsched device framework.

Sorry for the long mail. Please have a look and let us know your thoughts :-)

Thanks,

[1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/