linux-kernel - Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z755r4S_7BLbHlWa@google.com>
Date: Tue, 25 Feb 2025 18:17:19 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Fernand Sieber <sieberf@...zon.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Paolo Bonzini <pbonzini@...hat.com>, x86@...nel.org, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org, nh-open-source@...zon.com
Subject: Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted

On Tue, Feb 18, 2025, Fernand Sieber wrote:
> With guest hlt, pause and mwait pass through, the hypervisor loses
> visibility on real guest cpu activity. From the point of view of the
> host, such vcpus are always 100% active even when the guest is
> completely halted.
> 
> Typically hlt, pause and mwait pass through is only implemented on
> non-timeshared pcpus. However, there are cases where this assumption
> cannot be strictly met as some occasional housekeeping work needs to be

What housekeeping work?

> scheduled on such cpus while we generally want to preserve the pass
> through performance gains. This applies for system which don't have
> dedicated cpus for housekeeping purposes.
> 
> In such cases, the lack of visibility of the hypervisor is problematic
> from a load balancing point of view. In the absence of a better signal,
> it will preemt vcpus at random. For example it could decide to interrupt
> a vcpu doing critical idle poll work while another vcpu sits idle.
> 
> Another motivation for gaining visibility into real guest cpu activity
> is to enable the hypervisor to vend metrics about it for external
> consumption.

Such as?

> In this RFC we introduce the concept of guest halted time to address
> these concerns. Guest halted time (gtime_halted) accounts for cycles
> spent in guest mode while the cpu is halted. gtime_halted relies on
> measuring the mperf msr register (x86) around VM enter/exits to compute
> the number of unhalted cycles; halted cycles are then derived from the
> tsc difference minus the mperf difference.

IMO, there are better ways to solve this than having KVM sample MPERF on every
entry and exit.

The kernel already samples APERF/MPREF on every tick and provides that information
via /proc/cpuinfo, just use that.  If your userspace is unable to use /proc/cpuinfo
or similar, that needs to be explained.

And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT passthrough,
*and* you want to schedule other tasks on those CPUs, then IMO you're abusing all
of those things and it's not KVM's problem to solve, especially now that sched_ext
is a thing.

> gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables
> users to monitor real guest activity.
> 
> gtime_halted is also plumbed to the scheduler infrastructure to discount
> halted cycles from fair load accounting. This enlightens the load
> balancer to real guest activity for better task placement.
> 
> This initial RFC has a few limitations and open questions:
> * only the x86 infrastructure is supported as it relies on architecture
>   dependent registers. Future development will extend this to ARM.
> * we assume that mperf accumulates as the same rate as tsc. While I am
>   not certain whether this assumption is ever violated, the spec doesn't
>   seem to offer this guarantee [1] so we may want to calibrate mperf.
> * the sched enlightenment logic relies on periodic gtime_halted updates.
>   As such, it is incompatible with nohz full because this could result
>   in long periods of no update followed by a massive halted time update
>   which doesn't play well with the existing PELT integration. It is
>   possible to address this limitation with generalized, more complex
>   accounting.