[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191105124351.GN4131@hirez.programming.kicks-ass.net>
Date: Tue, 5 Nov 2019 13:43:51 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Scott Wood <swood@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>,
Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] timers/nohz: Update nohz load even if tick already
stopped
On Tue, Nov 05, 2019 at 01:30:58AM -0600, Scott Wood wrote:
> The warning is due to kernel/sched/idle.c not updating curr->se.exec_start.
Ah, indeed so.
> While debugging I noticed an issue with a particular load pattern. The CPU
> goes non-nohz for a brief time at an interval very close to twice
> tick_period. When the tick is started, the timer expiration is more than
> tick_period in the past, so hrtimer_forward() tries to catch up by adding
> 2*tick_period to the expiration. Then the tick is stopped before that new
> expiration, and when the tick is woken up the expiry is again advanced by
> 2*tick_period with the timer never actually running. sched_tick_remote()
> does fire every second, but there are streaks of several seconds where it
> keeps catching the CPU in a non-nohz state, so neither the normal nor remote
> ticks are calling calc_load_nohz_remote().
>
> Is there a reason to not just remove the hrtimer_forward() from
> tick_nohz_restart(), letting the timer fire if it's in the past, which will
> take care of doing hrtimer_forward()?
I'll have to look into that. I always get confused by all that nohz code
:/
> As for the warning in sched_tick_remote(), it seems like a test for time
> since the last tick on this cpu (remote or otherwise) would be better than
> relying on curr->se.exec_start, in order to detect things like this.
I don't think we have a timestamp that is shared between the remote and
local tick. Also, there is a reason this warning uses the task time
accounting, there used to be (as in, I can't find it in a hurry) code
that could not deal with >u32 (~4s) clock updates.
The below should have idle keep the timestamp up-to-date. Keeping
accurate idle->se.sum_exec_runtime doesn't seem too interesting, the
idle code already keeps track of total idle times.
---
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -381,6 +381,7 @@ static void put_prev_task_idle(struct rq
static void set_next_task_idle(struct rq *rq, struct task_struct *next)
{
+ curr->se.exec_start = rq_clock_task(rq);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
}
@@ -417,6 +418,7 @@ dequeue_task_idle(struct rq *rq, struct
*/
static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
{
+ curr->se.exec_start = rq_clock_task(rq);
}
static void switched_to_idle(struct rq *rq, struct task_struct *p)
Powered by blists - more mailing lists