[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f114eb3a8a21e1cd1a120db32258340504464458.camel@amazon.com>
Date: Thu, 27 Feb 2025 07:20:11 +0000
From: "Sieber, Fernand" <sieberf@...zon.com>
To: "seanjc@...gle.com" <seanjc@...gle.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"x86@...nel.org" <x86@...nel.org>, "peterz@...radead.org"
<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "nh-open-source@...zon.com"
<nh-open-source@...zon.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
On Wed, 2025-02-26 at 13:00 -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Fernand Sieber wrote:
> > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
> > > > In this RFC we introduce the concept of guest halted time to
> > > > address
> > > > these concerns. Guest halted time (gtime_halted) accounts for
> > > > cycles
> > > > spent in guest mode while the cpu is halted. gtime_halted
> > > > relies on
> > > > measuring the mperf msr register (x86) around VM enter/exits to
> > > > compute
> > > > the number of unhalted cycles; halted cycles are then derived
> > > > from the
> > > > tsc difference minus the mperf difference.
> > >
> > > IMO, there are better ways to solve this than having KVM sample
> > > MPERF on
> > > every entry and exit.
> > >
> > > The kernel already samples APERF/MPREF on every tick and provides
> > > that
> > > information via /proc/cpuinfo, just use that. If your userspace
> > > is unable
> > > to use /proc/cpuinfo or similar, that needs to be explained.
> >
> > If I understand correctly what you are suggesting is to have
> > userspace
> > regularly sampling these values to detect the most idle CPUs and
> > then
> > use CPU affinity to repin housekeeping tasks to these. While it's
> > possible this essentially requires to implement another scheduling
> > layer in userspace through constant re-pinning of tasks. This also
> > requires to constantly identify the full set of tasks that can
> > induce
> > undesirable overhead so that they can be pinned accordingly. For
> > these
> > reasons we would rather want the logic to be implemented directly
> > in
> > the scheduler.
> >
> > > And if you're running vCPUs on tickless CPUs, and you're doing
> > > HLT/MWAIT
> > > passthrough, *and* you want to schedule other tasks on those
> > > CPUs, then IMO
> > > you're abusing all of those things and it's not KVM's problem to
> > > solve,
> > > especially now that sched_ext is a thing.
> >
> > We are running vCPUs with ticks, the rest of your observations are
> > correct.
>
> If there's a host tick, why do you need KVM's help to make scheduling
> decisions?
> It sounds like what you want is a scheduler that is primarily driven
> by MPERF
> (and APERF?), and sched_tick() => arch_scale_freq_tick() already
> knows about MPERF.
Having the measure around VM enter/exit makes it easy to attribute the
unhalted cycles to a specific task (vCPU), which solves both our use
cases of VM metrics and scheduling. That said we may be able to avoid
it and achieve the same results.
i.e
* the VM metrics use case can be solved by using /proc/cpuinfo from
userspace.
* for the scheduling use case, the tick based sampling of MPERF means
we could potentially introduce a correcting factor on PELT accounting
of pinned vCPU tasks based on its value (similar to what I do in the
last patch of the series).
The combination of these would remove the requirement of adding any
logic around VM entrer/exit to support our use cases.
I'm happy to prototype that if we think it's going in the right
direction?
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
Powered by blists - more mailing lists