lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrK65hAa-yDoG6F7@freefall.freebsd.org>
Date: Wed, 7 Aug 2024 00:08:06 +0000
From: Suleiman Souhlal <ssouhlal@...ebsd.org>
To: Joel Fernandes <joelaf@...gle.com>
Cc: Suleiman Souhlal <suleiman@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Paolo Bonzini <pbonzini@...hat.com>, vineethrp@...gle.com,
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [PATCH] sched: Don't try to catch up excess steal time.

On Tue, Aug 06, 2024 at 06:51:36PM -0400, Joel Fernandes wrote:
> On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@...gle.com> wrote:
> >
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future clock_task
> > measurements, as they end up getting additional steal time that did not
> > actually happen, from the previous excess steal time being paid back.
> >
> > For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> > time reported while it ran. clock_task rightly doesn't advance. Then, a
> > different task runs on the same rq for 10ms without any time stolen.
> > Because of the current catch up mechanism, clock_sched inaccurately ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > In other words, tasks can end up getting more run time than they should
> > actually get.
> >
> > So, we instead don't make future updates pay back past excess stolen time.
> >
> > Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> > ---
> >  kernel/sched/core.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bcf2c4cc0522..42b37da2bda6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >  #endif
> >  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >         if (static_key_false((&paravirt_steal_rq_enabled))) {
> > -               steal = paravirt_steal_clock(cpu_of(rq));
> > +               u64 prev_steal;
> > +
> > +               steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> >                 steal -= rq->prev_steal_time_rq;
> >
> >                 if (unlikely(steal > delta))
> >                         steal = delta;
> >
> > -               rq->prev_steal_time_rq += steal;
> > +               rq->prev_steal_time_rq = prev_steal;
> >                 delta -= steal;
> 
> Makes sense, but wouldn't this patch also do the following: If vCPU
> task is the only one running and has a large steal time, then
> sched_tick() will only freeze the clock for a shorter period, and not
> give future credits to the vCPU task itself?  Maybe it does not matter
> (and I probably don't understand the code enough) but thought I would
> mention.

The patch should still be doing the right thing in that situation:
The clock will be frozen for the whole duration that it ran, and delta
will be 0.
The current excess amount is not relevant to the future, as far as I can
tell.
The pre-patch code is giving the rq extra time that it hadn't measured.
I don't really see why it should be getting that extra time.

> 
> I am also not sure if the purpose of stealtime is to credit individual
> tasks, or rather all tasks on the runqueue because the "whole
> runqueue" had time stolen.. No where in this function is it dealing
> with individual tasks but rather the rq itself.

This function is used to update clock_task, which *is* relevant to
individual tasks. It is used to calculate how long tasks ran for (and
for load averages).

-- Suleiman

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ