linux-kernel - Re: [PATCH v3] sched: Don't try to catch up excess steal time.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5d3cd82f54aed784a0fd912654a86330eda40c9a.camel@infradead.org>
Date: Mon, 18 Nov 2024 10:14:50 +0000
From: David Woodhouse <dwmw2@...radead.org>
To: Peter Zijlstra <peterz@...radead.org>, Suleiman Souhlal
	 <suleiman@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
 Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
 Schneider <vschneid@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>,
 seanjc@...gle.com, Srikar Dronamraju <srikar@...ux.ibm.com>, 
 joelaf@...gle.com, vineethrp@...gle.com, linux-kernel@...r.kernel.org, 
 kvm@...r.kernel.org, ssouhlal@...ebsd.org
Subject: Re: [PATCH v3] sched: Don't try to catch up excess steal time.

On Mon, 2024-11-18 at 10:34 +0100, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 01:37:45PM +0900, Suleiman Souhlal wrote:
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future things using
> > clock_task, in some situations, as they end up getting additional steal
> > time that did not actually happen.
> > This is because there is a window between reading the elapsed time in
> > update_rq_clock() and sampling the steal time in update_rq_clock_task().
> > If the VCPU gets preempted between those two points, any additional
> > steal time is accounted to the outgoing task even though the calculated
> > delta did not actually contain any of that "stolen" time.
> > When this race happens, we can end up with steal time that exceeds the
> > calculated delta, and the previous code would try to catch up that excess
> > steal time in future clock updates, which is given to the next,
> > incoming task, even though it did not actually have any time stolen.
> > 
> > This behavior is particularly bad when steal time can be very long,
> > which we've seen when trying to extend steal time to contain the duration
> > that the host was suspended [0]. When this happens, clock_task stays
> > frozen, during which the running task stays running for the whole
> > duration, since its run time doesn't increase.
> > However the race can happen even under normal operation.
> > 
> > Ideally we would read the elapsed cpu time and the steal time atomically,
> > to prevent this race from happening in the first place, but doing so
> > is non-trivial.
> > 
> > Since the time between those two points isn't otherwise accounted anywhere,
> > neither to the outgoing task nor the incoming task (because the "end of
> > outgoing task" and "start of incoming task" timestamps are the same),
> > I would argue that the right thing to do is to simply drop any excess steal
> > time, in order to prevent these issues.
> > 
> > [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> > 
> > Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> 
> Right.. uhm.. I don't particularly care much either way. Are other
> people with virt clue okay with this?

I'm slightly dubious because now we may systemically lose accounted
steal time where before it was all at least accounted *somewhere*, and
we might reasonably have expected the slight inaccuracies to balance
out over time.

But this *does* fix the main problem I was seeing, that the kernel will
currently just keep attributing steal time to processes *forever* if a
buggy hypervisor lets it step backwards. So I can live with it.

Acked-by: David Woodhouse <dwmw@...zon.co.uk>

Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)