[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z755r4S_7BLbHlWa@google.com>
Date: Tue, 25 Feb 2025 18:17:19 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Fernand Sieber <sieberf@...zon.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>, Paolo Bonzini <pbonzini@...hat.com>, x86@...nel.org,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, nh-open-source@...zon.com
Subject: Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
On Tue, Feb 18, 2025, Fernand Sieber wrote:
> With guest hlt, pause and mwait pass through, the hypervisor loses
> visibility on real guest cpu activity. From the point of view of the
> host, such vcpus are always 100% active even when the guest is
> completely halted.
>
> Typically hlt, pause and mwait pass through is only implemented on
> non-timeshared pcpus. However, there are cases where this assumption
> cannot be strictly met as some occasional housekeeping work needs to be
What housekeeping work?
> scheduled on such cpus while we generally want to preserve the pass
> through performance gains. This applies for system which don't have
> dedicated cpus for housekeeping purposes.
>
> In such cases, the lack of visibility of the hypervisor is problematic
> from a load balancing point of view. In the absence of a better signal,
> it will preemt vcpus at random. For example it could decide to interrupt
> a vcpu doing critical idle poll work while another vcpu sits idle.
>
> Another motivation for gaining visibility into real guest cpu activity
> is to enable the hypervisor to vend metrics about it for external
> consumption.
Such as?
> In this RFC we introduce the concept of guest halted time to address
> these concerns. Guest halted time (gtime_halted) accounts for cycles
> spent in guest mode while the cpu is halted. gtime_halted relies on
> measuring the mperf msr register (x86) around VM enter/exits to compute
> the number of unhalted cycles; halted cycles are then derived from the
> tsc difference minus the mperf difference.
IMO, there are better ways to solve this than having KVM sample MPERF on every
entry and exit.
The kernel already samples APERF/MPREF on every tick and provides that information
via /proc/cpuinfo, just use that. If your userspace is unable to use /proc/cpuinfo
or similar, that needs to be explained.
And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT passthrough,
*and* you want to schedule other tasks on those CPUs, then IMO you're abusing all
of those things and it's not KVM's problem to solve, especially now that sched_ext
is a thing.
> gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables
> users to monitor real guest activity.
>
> gtime_halted is also plumbed to the scheduler infrastructure to discount
> halted cycles from fair load accounting. This enlightens the load
> balancer to real guest activity for better task placement.
>
> This initial RFC has a few limitations and open questions:
> * only the x86 infrastructure is supported as it relies on architecture
> dependent registers. Future development will extend this to ARM.
> * we assume that mperf accumulates as the same rate as tsc. While I am
> not certain whether this assumption is ever violated, the spec doesn't
> seem to offer this guarantee [1] so we may want to calibrate mperf.
> * the sched enlightenment logic relies on periodic gtime_halted updates.
> As such, it is incompatible with nohz full because this could result
> in long periods of no update followed by a massive halted time update
> which doesn't play well with the existing PELT integration. It is
> possible to address this limitation with generalized, more complex
> accounting.
Powered by blists - more mailing lists