[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8cd99b4c4f93a581203449db9caee29b9751373.camel@amazon.com>
Date: Wed, 26 Feb 2025 20:27:22 +0000
From: "Sieber, Fernand" <sieberf@...zon.com>
To: "seanjc@...gle.com" <seanjc@...gle.com>
CC: "x86@...nel.org" <x86@...nel.org>, "peterz@...radead.org"
<peterz@...radead.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "mingo@...hat.com" <mingo@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "nh-open-source@...zon.com"
<nh-open-source@...zon.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
>
>
> On Tue, Feb 18, 2025, Fernand Sieber wrote:
> > With guest hlt, pause and mwait pass through, the hypervisor loses
> > visibility on real guest cpu activity. From the point of view of
> > the
> > host, such vcpus are always 100% active even when the guest is
> > completely halted.
> >
> > Typically hlt, pause and mwait pass through is only implemented on
> > non-timeshared pcpus. However, there are cases where this
> > assumption
> > cannot be strictly met as some occasional housekeeping work needs
> > to be
>
> What housekeeping work?
In the case that want to solve, housekeeping work is mainly userspace
tasks implementing hypervisor functionality such as gathering metrics,
performing health checks, handling VM lifecycle, etc.
The platforms don't have dedicated cpus for housekeeping purposes and
try as much as possible to fully dedicate the cpus to VMs, hence
HLT/MWAIT pass through. The housekeeping work is low but can still
interfere with guests that are running very latency sensitive
operations on a subset of vCPUs (e.g idle poll), which is what we want
to detect and avoid.
>
> > scheduled on such cpus while we generally want to preserve the pass
> > through performance gains. This applies for system which don't have
> > dedicated cpus for housekeeping purposes.
> >
> > In such cases, the lack of visibility of the hypervisor is
> > problematic
> > from a load balancing point of view. In the absence of a better
> > signal,
> > it will preemt vcpus at random. For example it could decide to
> > interrupt
> > a vcpu doing critical idle poll work while another vcpu sits idle.
> >
> > Another motivation for gaining visibility into real guest cpu
> > activity
> > is to enable the hypervisor to vend metrics about it for external
> > consumption.
>
> Such as?
An example is feeding VM utilisation metrics to other systems like auto
scaling of guest applications. While it is possible to implement this
functionality purely on the guest side, having the hypervisor handling
it means that it's available out of the box for all VMs in a standard
way without relying on guest side configuration.
>
> > In this RFC we introduce the concept of guest halted time to
> > address
> > these concerns. Guest halted time (gtime_halted) accounts for
> > cycles
> > spent in guest mode while the cpu is halted. gtime_halted relies on
> > measuring the mperf msr register (x86) around VM enter/exits to
> > compute
> > the number of unhalted cycles; halted cycles are then derived from
> > the
> > tsc difference minus the mperf difference.
>
> IMO, there are better ways to solve this than having KVM sample MPERF
> on every
> entry and exit.
>
> The kernel already samples APERF/MPREF on every tick and provides
> that information
> via /proc/cpuinfo, just use that. If your userspace is unable to use
> /proc/cpuinfo
> or similar, that needs to be explained.
If I understand correctly what you are suggesting is to have userspace
regularly sampling these values to detect the most idle CPUs and then
use CPU affinity to repin housekeeping tasks to these. While it's
possible this essentially requires to implement another scheduling
layer in userspace through constant re-pinning of tasks. This also
requires to constantly identify the full set of tasks that can induce
undesirable overhead so that they can be pinned accordingly. For these
reasons we would rather want the logic to be implemented directly in
the scheduler.
>
> And if you're running vCPUs on tickless CPUs, and you're doing
> HLT/MWAIT passthrough,
> *and* you want to schedule other tasks on those CPUs, then IMO you're
> abusing all
> of those things and it's not KVM's problem to solve, especially now
> that sched_ext
> is a thing.
We are running vCPUs with ticks, the rest of your observations are
correct.
>
> > gtime_halted is exposed to proc/<pid>/stat as a new entry, which
> > enables
> > users to monitor real guest activity.
> >
> > gtime_halted is also plumbed to the scheduler infrastructure to
> > discount
> > halted cycles from fair load accounting. This enlightens the load
> > balancer to real guest activity for better task placement.
> >
> > This initial RFC has a few limitations and open questions:
> > * only the x86 infrastructure is supported as it relies on
> > architecture
> > dependent registers. Future development will extend this to ARM.
> > * we assume that mperf accumulates as the same rate as tsc. While I
> > am
> > not certain whether this assumption is ever violated, the spec
> > doesn't
> > seem to offer this guarantee [1] so we may want to calibrate
> > mperf.
> > * the sched enlightenment logic relies on periodic gtime_halted
> > updates.
> > As such, it is incompatible with nohz full because this could
> > result
> > in long periods of no update followed by a massive halted time
> > update
> > which doesn't play well with the existing PELT integration. It is
> > possible to address this limitation with generalized, more
> > complex
> > accounting.
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
Powered by blists - more mailing lists