linux-kernel - Re: [PATCH v3] sched: Don't try to catch up excess steal time.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241118143311.3ca94405@gandalf.local.home>
Date: Mon, 18 Nov 2024 14:33:11 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
 Schneider <vschneid@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>,
 seanjc@...gle.com, Srikar Dronamraju <srikar@...ux.ibm.com>, David
 Woodhouse <dwmw2@...radead.org>, joelaf@...gle.com, vineethrp@...gle.com,
 linux-kernel@...r.kernel.org, kvm@...r.kernel.org, ssouhlal@...ebsd.org
Subject: Re: [PATCH v3] sched: Don't try to catch up excess steal time.

On Mon, 18 Nov 2024 13:37:45 +0900
Suleiman Souhlal <suleiman@...gle.com> wrote:

> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, in some situations, as they end up getting additional steal
> time that did not actually happen.
> This is because there is a window between reading the elapsed time in
> update_rq_clock() and sampling the steal time in update_rq_clock_task().
> If the VCPU gets preempted between those two points, any additional
> steal time is accounted to the outgoing task even though the calculated
> delta did not actually contain any of that "stolen" time.
> When this race happens, we can end up with steal time that exceeds the
> calculated delta, and the previous code would try to catch up that excess
> steal time in future clock updates, which is given to the next,
> incoming task, even though it did not actually have any time stolen.
> 
> This behavior is particularly bad when steal time can be very long,
> which we've seen when trying to extend steal time to contain the duration
> that the host was suspended [0]. When this happens, clock_task stays
> frozen, during which the running task stays running for the whole
> duration, since its run time doesn't increase.
> However the race can happen even under normal operation.
> 
> Ideally we would read the elapsed cpu time and the steal time atomically,
> to prevent this race from happening in the first place, but doing so
> is non-trivial.
> 
> Since the time between those two points isn't otherwise accounted anywhere,
> neither to the outgoing task nor the incoming task (because the "end of
> outgoing task" and "start of incoming task" timestamps are the same),
> I would argue that the right thing to do is to simply drop any excess steal
> time, in order to prevent these issues.
> 
> [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> 
> Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> ---
> v3:
> - Reword commit message.
> - Revert back to v1 code, since it's more understandable.
> 
> v2: https://lore.kernel.org/lkml/20240911111522.1110074-1-suleiman@google.com
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a1c353a62c56..13f70316ef39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>  	if (static_key_false((&paravirt_steal_rq_enabled))) {
> -		steal = paravirt_steal_clock(cpu_of(rq));
> +		u64 prev_steal;
> +
> +		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>  		steal -= rq->prev_steal_time_rq;
>  
>  		if (unlikely(steal > delta))
>  			steal = delta;

So is the problem just the above if statement? That is, delta is already
calculated, but if we get interrupted by the host before steal is
calculated and the time then becomes greater than delta, the time
difference between delta and steal gets pushed off to the next task, right?

-- Steve



>  
> -		rq->prev_steal_time_rq += steal;
> +		rq->prev_steal_time_rq = prev_steal;
>  		delta -= steal;
>  	}
>  #endif