[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180817102734.GA4253@codeblueprint.co.uk>
Date: Fri, 17 Aug 2018 11:27:34 +0100
From: Matt Fleming <matt@...eblueprint.co.uk>
To: Valentin Schneider <valentin.schneider@....com>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...nel.org>,
Mike Galbraith <umgwanakikbuti@...il.com>
Subject: Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains
On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote:
> On 05/07/18 14:27, Matt Fleming wrote:
> > On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote:
> >> Hi,
> >>
> >> On 04/07/18 15:24, Matt Fleming wrote:
> >>> It's possible that the CPU doing nohz idle balance hasn't had its own
> >>> load updated for many seconds. This can lead to huge deltas between
> >>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to
> >>> cause the following crash:
> >>>
> >>> divide error: 0000 [#1] SMP
> >>> Call Trace:
> >>> [<ffffffff810bcba8>] update_sd_lb_stats+0xe8/0x560
>
> My confusion comes from not seeing where that crash happens. Would you mind
> sharing the associated line number? I can feel the "how did I not see this"
> from there but it can't be helped :(
The divide by zero comes from scale_rt_capacity() where 'total' is a
u64 but gets truncated when passed to div_u64() since the divisor
parameter is u32.
Sure, you could use div64_u64() instead, but the real issue is that
the load hasn't been updated for a very long time and that we're
trying to balance the domains with stale data.
Powered by blists - more mailing lists