[<prev] [next>] [day] [month] [year] [list]
Message-ID: <Y9pU4cunJd3aI9+S@lothringen>
Date: Wed, 1 Feb 2023 13:02:41 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Hillf Danton <hdanton@...a.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Yu Liao <liaoyu15@...wei.com>, fweisbec@...il.com,
mingo@...nel.org, liwei391@...wei.com, adobriyan@...il.com,
mirsad.todorovac@....unizg.hr, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote:
> On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@...utronix.de>
> >
> > Seriously this procfs accuracy is the least of the problems and if this
> > would be the only issue then we could trivially fix it by declaring that
> > the procfs output might go backwards. It's an estimate after all. If
> > there would be a real reason to ensure monotonicity there then we could
> > easily do that in the readout code.
> >
> > But the real issue is that both get_cpu_idle_time_us() and
> > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way
> > worse than the above procfs idle time going backwards.
> >
> > If update_ts_time_stats() is invoked concurrently for the same CPU then
> > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random
> > numbers.
> >
> > This has been broken 12 years ago in commit 595aac488b54 ("sched:
> > Introduce a function to update the idle statistics").
>
> [...]
>
> >
> > P.S.: I hate the spinlock in the idle code path, but I don't have a
> > better idea.
>
> Provided the percpu rule is enforced, the random numbers mentioned above
> could be erased without another spinlock added.
>
> Hillf
> +++ b/kernel/time/tick-sched.c
> @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
> /*
> * Updates the per-CPU time idle statistics counters
> */
> -static void
> -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
> +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
> + int io, u64 *last_update_time)
> {
> ktime_t delta;
>
> + if (last_update_time)
> + *last_update_time = ktime_to_us(now);
> +
> if (ts->idle_active) {
> delta = ktime_sub(now, ts->idle_entrytime);
> +
> + /* update is only expected on the local CPU */
> + if (cpu != smp_processor_id()) {
Why not just updating it only on idle exit then?
> + if (io)
I fear it's not up to the caller to decides if the idle time is IO or not.
> + delta = ktime_add(ts->iowait_sleeptime, delta);
> + else
> + delta = ktime_add(ts->idle_sleeptime, delta);
> + return ktime_to_us(delta);
> + }
> +
> if (nr_iowait_cpu(cpu) > 0)
> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> else
But you kept the old update above.
So if this is not the local CPU, what do you do?
You'd need to return (without updating iowait_sleeptime):
ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
Right? But then you may race with the local updater, risking to return
the delta added twice. So you need at least a seqcount.
But in the end, nr_iowait_cpu() is broken because that counter can be
decremented remotely and so the whole thing is beyond repair:
CPU 0 CPU 1 CPU 2
----- ----- ------
//io_schedule() TASK A
current->in_iowait = 1
rq(0)->nr_iowait++
//switch to idle
// READ /proc/stat
// See nr_iowait_cpu(0) == 1
return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
//try_to_wake_up(TASK A)
rq(0)->nr_iowait--
//idle exit
// See nr_iowait_cpu(0) == 0
ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)
Thanks.
Powered by blists - more mailing lists