[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJWu+oqR=SMKHd1EqvRm3nvz7e1r4e7Rj76hJ8jhDQQkNVo0ig@mail.gmail.com>
Date: Tue, 19 Nov 2024 09:10:41 +0900
From: Joel Fernandes <joelaf@...gle.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Suleiman Souhlal <suleiman@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>, seanjc@...gle.com,
Srikar Dronamraju <srikar@...ux.ibm.com>, David Woodhouse <dwmw2@...radead.org>, vineethrp@...gle.com,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org, ssouhlal@...ebsd.org
Subject: Re: [PATCH v3] sched: Don't try to catch up excess steal time.
On Tue, Nov 19, 2024 at 4:32 AM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Mon, 18 Nov 2024 13:37:45 +0900
> Suleiman Souhlal <suleiman@...gle.com> wrote:
>
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future things using
> > clock_task, in some situations, as they end up getting additional steal
> > time that did not actually happen.
> > This is because there is a window between reading the elapsed time in
> > update_rq_clock() and sampling the steal time in update_rq_clock_task().
> > If the VCPU gets preempted between those two points, any additional
> > steal time is accounted to the outgoing task even though the calculated
> > delta did not actually contain any of that "stolen" time.
> > When this race happens, we can end up with steal time that exceeds the
> > calculated delta, and the previous code would try to catch up that excess
> > steal time in future clock updates, which is given to the next,
> > incoming task, even though it did not actually have any time stolen.
> >
> > This behavior is particularly bad when steal time can be very long,
> > which we've seen when trying to extend steal time to contain the duration
> > that the host was suspended [0]. When this happens, clock_task stays
> > frozen, during which the running task stays running for the whole
> > duration, since its run time doesn't increase.
> > However the race can happen even under normal operation.
> >
> > Ideally we would read the elapsed cpu time and the steal time atomically,
> > to prevent this race from happening in the first place, but doing so
> > is non-trivial.
> >
> > Since the time between those two points isn't otherwise accounted anywhere,
> > neither to the outgoing task nor the incoming task (because the "end of
> > outgoing task" and "start of incoming task" timestamps are the same),
> > I would argue that the right thing to do is to simply drop any excess steal
> > time, in order to prevent these issues.
> >
> > [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> >
> > Signed-off-by: Suleiman Souhlal <suleiman@...gle.com>
> > ---
> > v3:
> > - Reword commit message.
> > - Revert back to v1 code, since it's more understandable.
> >
> > v2: https://lore.kernel.org/lkml/20240911111522.1110074-1-suleiman@google.com
> > - Slightly changed to simply moving one line up instead of adding
> > new variable.
> >
> > v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> > ---
> > kernel/sched/core.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a1c353a62c56..13f70316ef39 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > #endif
> > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > if (static_key_false((¶virt_steal_rq_enabled))) {
> > - steal = paravirt_steal_clock(cpu_of(rq));
> > + u64 prev_steal;
> > +
> > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> > steal -= rq->prev_steal_time_rq;
> >
> > if (unlikely(steal > delta))
> > steal = delta;
>
> So is the problem just the above if statement? That is, delta is already
> calculated, but if we get interrupted by the host before steal is
> calculated and the time then becomes greater than delta, the time
> difference between delta and steal gets pushed off to the next task, right?
Pretty much.. the steal being capped to delta means the rest of the
steal is pushed off to the future. Instead he discards the remaining
steal after this patch.
Thanks
Joel
Powered by blists - more mailing lists