linux-kernel - Re: [Linux 5.18-rc1] WARNING: CPU: 1 PID: 0 at kernel/sched/fair.c:3355 update_blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <675544de-3369-e26e-65ba-3b28fff5c126@gnuweeb.org>
Date:   Tue, 5 Apr 2022 20:13:42 +0700
From:   Ammar Faizi <ammarfaizi2@...weeb.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Cc:     Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        GNU/Weeb Mailing List <gwml@...r.gnuweeb.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [Linux 5.18-rc1] WARNING: CPU: 1 PID: 0 at
 kernel/sched/fair.c:3355 update_blocked_averages

On 4/5/22 7:21 PM, Dietmar Eggemann wrote:
> Tried to recreate the issue but no success so far. I used you config
> file, clang-14 and a Xeon CPU E5-2690 v2 (2 sockets 40 CPUs) with 20
> two-level cgoupv1 taskgroups '/X/Y' with 'hackbench (10 groups, 40 fds)
> + idling' running in all '/X/Y/'.
> 
> What userspace are you running?

HP Laptop, Intel i7-1165G7, 8 CPUs, with 16 GB of RAM. Ubuntu 21.10. Just for
daily workstation. Compiling kernel, browsing and coding stuff.

> There seemed to be some pressure on your machine when it happened?

Yeah, might be, I don't fully remember the activity at the time it
happened, though.

>> <6>[13420.623334][    C7] perf: interrupt took too long (2530 > 2500),
>> lowering kernel.perf_event_max_sample_rate to 78900
> 
> Maybe you could split the SCHED_WARN_ON so we know which signal causes this?

OK, I will apply the diff on top of 5.18-rc1 and will start using it for daily
routine tomorrow morning. Let's see if I can hit this bug again. Will send an
update later...

Thank you.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..0d45e09e5bfc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3350,9 +3350,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
> *cfs_rq)
>           * Make sure that rounding and/or propagation of PELT values never
>           * break this.
>           */
> -       SCHED_WARN_ON(cfs_rq->avg.load_avg ||
> -                     cfs_rq->avg.util_avg ||
> -                     cfs_rq->avg.runnable_avg);
> +       SCHED_WARN_ON(cfs_rq->avg.load_avg);
> +       SCHED_WARN_ON(cfs_rq->avg.util_avg);
> +       SCHED_WARN_ON(cfs_rq->avg.runnable_avg);
> 
>          return true;
>   }
> 
> [...]


-- 
Ammar Faizi