linux-kernel - Re: [PATCH] sched: Don't try to catch up excess steal time.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJWu+oqp9sUDOvKB23p+_C1cTvFj8sQptfz30UwrWJyKhf1ckg@mail.gmail.com>
Date: Tue, 6 Aug 2024 18:51:36 -0400
From: Joel Fernandes <joelaf@...gle.com>
To: Suleiman Souhlal <suleiman@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>, vineethrp@...gle.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, ssouhlal@...ebsd.org
Subject: Re: [PATCH] sched: Don't try to catch up excess steal time.

On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@...gle.com> wrote:
>
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future clock_task
> measurements, as they end up getting additional steal time that did not
> actually happen, from the previous excess steal time being paid back.
>
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different task runs on the same rq for 10ms without any time stolen.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> In other words, tasks can end up getting more run time than they should
> actually get.
>
> So, we instead don't make future updates pay back past excess stolen time.
>
> Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bcf2c4cc0522..42b37da2bda6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>         if (static_key_false((&paravirt_steal_rq_enabled))) {
> -               steal = paravirt_steal_clock(cpu_of(rq));
> +               u64 prev_steal;
> +
> +               steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>                 steal -= rq->prev_steal_time_rq;
>
>                 if (unlikely(steal > delta))
>                         steal = delta;
>
> -               rq->prev_steal_time_rq += steal;
> +               rq->prev_steal_time_rq = prev_steal;
>                 delta -= steal;

Makes sense, but wouldn't this patch also do the following: If vCPU
task is the only one running and has a large steal time, then
sched_tick() will only freeze the clock for a shorter period, and not
give future credits to the vCPU task itself?  Maybe it does not matter
(and I probably don't understand the code enough) but thought I would
mention.

I am also not sure if the purpose of stealtime is to credit individual
tasks, or rather all tasks on the runqueue because the "whole
runqueue" had time stolen.. No where in this function is it dealing
with individual tasks but rather the rq itself.

Thoughts?

 - Joel



>         }
>  #endif
> --
> 2.46.0.rc2.264.g509ed76dc8-goog
>