linux-kernel - Re: [PATCH v3] sched: Don't try to catch up excess steal time.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241118093454.GF39245@noisy.programming.kicks-ass.net>
Date: Mon, 18 Nov 2024 10:34:54 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Paolo Bonzini <pbonzini@...hat.com>, seanjc@...gle.com,
	Srikar Dronamraju <srikar@...ux.ibm.com>,
	David Woodhouse <dwmw2@...radead.org>, joelaf@...gle.com,
	vineethrp@...gle.com, linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org, ssouhlal@...ebsd.org
Subject: Re: [PATCH v3] sched: Don't try to catch up excess steal time.

On Mon, Nov 18, 2024 at 01:37:45PM +0900, Suleiman Souhlal wrote:
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, in some situations, as they end up getting additional steal
> time that did not actually happen.
> This is because there is a window between reading the elapsed time in
> update_rq_clock() and sampling the steal time in update_rq_clock_task().
> If the VCPU gets preempted between those two points, any additional
> steal time is accounted to the outgoing task even though the calculated
> delta did not actually contain any of that "stolen" time.
> When this race happens, we can end up with steal time that exceeds the
> calculated delta, and the previous code would try to catch up that excess
> steal time in future clock updates, which is given to the next,
> incoming task, even though it did not actually have any time stolen.
> 
> This behavior is particularly bad when steal time can be very long,
> which we've seen when trying to extend steal time to contain the duration
> that the host was suspended [0]. When this happens, clock_task stays
> frozen, during which the running task stays running for the whole
> duration, since its run time doesn't increase.
> However the race can happen even under normal operation.
> 
> Ideally we would read the elapsed cpu time and the steal time atomically,
> to prevent this race from happening in the first place, but doing so
> is non-trivial.
> 
> Since the time between those two points isn't otherwise accounted anywhere,
> neither to the outgoing task nor the incoming task (because the "end of
> outgoing task" and "start of incoming task" timestamps are the same),
> I would argue that the right thing to do is to simply drop any excess steal
> time, in order to prevent these issues.
> 
> [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> 
> Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>

Right.. uhm.. I don't particularly care much either way. Are other
people with virt clue okay with this?

> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a1c353a62c56..13f70316ef39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>  	if (static_key_false((&paravirt_steal_rq_enabled))) {
> -		steal = paravirt_steal_clock(cpu_of(rq));
> +		u64 prev_steal;
> +
> +		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>  		steal -= rq->prev_steal_time_rq;
>  
>  		if (unlikely(steal > delta))
>  			steal = delta;
>  
> -		rq->prev_steal_time_rq += steal;
> +		rq->prev_steal_time_rq = prev_steal;
>  		delta -= steal;
>  	}
>  #endif
> -- 
> 2.47.0.338.g60cca15819-goog
>