linux-kernel - [RFC PATCH 0/3] kvm,sched: Add gtime halted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250218202618.567363-1-sieberf@amazon.com>
Date: Tue, 18 Feb 2025 22:26:00 +0200
From: Fernand Sieber <sieberf@...zon.com>
To: <sieberf@...zon.com>, Ingo Molnar <mingo@...hat.com>, Peter Zijlstra
	<peterz@...radead.org>, Vincent Guittot <vincent.guittot@...aro.org>, "Paolo
 Bonzini" <pbonzini@...hat.com>, <x86@...nel.org>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <nh-open-source@...zon.com>
Subject: [RFC PATCH 0/3] kvm,sched: Add gtime halted

With guest hlt, pause and mwait pass through, the hypervisor loses
visibility on real guest cpu activity. From the point of view of the
host, such vcpus are always 100% active even when the guest is
completely halted.

Typically hlt, pause and mwait pass through is only implemented on
non-timeshared pcpus. However, there are cases where this assumption
cannot be strictly met as some occasional housekeeping work needs to be
scheduled on such cpus while we generally want to preserve the pass
through performance gains. This applies for system which don't have
dedicated cpus for housekeeping purposes.

In such cases, the lack of visibility of the hypervisor is problematic
from a load balancing point of view. In the absence of a better signal,
it will preemt vcpus at random. For example it could decide to interrupt
a vcpu doing critical idle poll work while another vcpu sits idle.

Another motivation for gaining visibility into real guest cpu activity
is to enable the hypervisor to vend metrics about it for external
consumption.

In this RFC we introduce the concept of guest halted time to address
these concerns. Guest halted time (gtime_halted) accounts for cycles
spent in guest mode while the cpu is halted. gtime_halted relies on
measuring the mperf msr register (x86) around VM enter/exits to compute
the number of unhalted cycles; halted cycles are then derived from the
tsc difference minus the mperf difference.

gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables
users to monitor real guest activity.

gtime_halted is also plumbed to the scheduler infrastructure to discount
halted cycles from fair load accounting. This enlightens the load
balancer to real guest activity for better task placement.

This initial RFC has a few limitations and open questions:
* only the x86 infrastructure is supported as it relies on architecture
  dependent registers. Future development will extend this to ARM.
* we assume that mperf accumulates as the same rate as tsc. While I am
  not certain whether this assumption is ever violated, the spec doesn't
  seem to offer this guarantee [1] so we may want to calibrate mperf.
* the sched enlightenment logic relies on periodic gtime_halted updates.
  As such, it is incompatible with nohz full because this could result
  in long periods of no update followed by a massive halted time update
  which doesn't play well with the existing PELT integration. It is
  possible to address this limitation with generalized, more complex
  accounting.

[1]
https://cdrdv2.intel.com/v1/dl/getContent/671427
"The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at close to the
maximum non-turbo frequency, which is equal to the product of scalable
bus frequency and maximum non-turbo ratio."

Fernand Sieber (3):
  fs/proc: Add gtime halted to proc/<pid>/stat
  kvm/x86: Add support for gtime halted
  sched,x86: Make the scheduler guest unhalted aware

 Documentation/filesystems/proc.rst |  1 +
 arch/x86/include/asm/tsc.h         |  1 +
 arch/x86/kernel/tsc.c              | 13 +++++++++
 arch/x86/kvm/x86.c                 | 30 +++++++++++++++++++++
 fs/proc/array.c                    |  7 ++++-
 include/linux/sched.h              |  5 ++++
 include/linux/sched/signal.h       |  1 +
 kernel/exit.c                      |  1 +
 kernel/fork.c                      |  2 +-
 kernel/sched/core.c                |  1 +
 kernel/sched/fair.c                | 25 ++++++++++++++++++
 kernel/sched/pelt.c                | 42 +++++++++++++++++++++++++-----
 kernel/sched/sched.h               |  2 ++
 13 files changed, 122 insertions(+), 9 deletions(-)

=== TESTING ===

For testing I use a host running a VM via qEMU and I simulate host
interference via instances of stress.

The VM uses 16 vCPUs, which are pinned to pCPUs 0-15. Each vCPU is
pinned to a dedicated pCPU which follows the 'mostly non-timeshared CPU'
model.

We use the -overcommit cpu-pm=on qEMU flag to enable hlt, mwait and
pause pass through.

On the host, alongside qEMU, there are 8 stressors pinned to the same
CPUs (taskset -c 0-15 stress --cpu 8).

The VM then runs rtla on 8 cores to measure host interference. With the
enlightenment in the patch we expect the load balancer to move the
stressors to the remaining 8 idle cores and to mostly eliminate
interference.

With enlightenment:
rtla hwnoise -c 0-7 -P f:50 -p 27000 -r 26000 -d 2m -T 1000 -q --warm-up 60

Hardware-related Noise
duration:   0 00:02:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI
  0 #4443      115518000            0   100.00000           0            0           0            0
  1 #4442      115512416       144178    99.87518        4006         4006          37            0
  2 #4443      115518000            0   100.00000           0            0           0            0
  3 #4443      115518000            0   100.00000           0            0           0            0
  4 #4443      115518000            0   100.00000           0            0           0            0
  5 #4443      115518000            0   100.00000           0            0           0            0
  6 #4444      115547479        11018    99.99046        4006         4006           3            0
  7 #4444      115544000        12015    99.98960        4005         4005           3            0

Baseline without patches:
rtla hwnoise -c 0-7 -P f:50 -p 27000 -r 26000 -d 2m -T 1000 -q --warm-up 60

Hardware-related Noise
duration:   0 00:02:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI
  0 #4171      112394904     36139505    67.84595       29015        13006        4533            0
  1 #4153      111960227     38277963    65.81110       29015        13006        4748            0
  2 #3882      108016483     73845612    31.63486       29017        16005        8628            0
  3 #3881      108088929     73946692    31.58717       30017        14006        8636            0
  4 #4177      112380299     36646487    67.39064       28018        14007        4551            0
  5 #4157      112059732     37863899    66.21096       28017        13005        4689            0
  6 #4166      112312643     37458217    66.64826       29016        14005        4653            0
  7 #4157      112034934     36922368    67.04387       29015        14006        4609            0

--
2.43.0



Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07