linux-kernel - Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87msr0l94p.ffs@tglx>
Date: Thu, 14 Mar 2024 23:51:50 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Florian Fainelli <f.fainelli@...il.com>, Boqun Feng <boqun.feng@...il.com>
Cc: Frederic Weisbecker <frederic@...nel.org>, "Russell King (Oracle)"
 <linux@...linux.org.uk>, Joel Fernandes <joel@...lfernandes.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>, Linus Torvalds
 <torvalds@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 kernel-team@...a.com, paulmck@...nel.org, mingo@...nel.org,
 rcu@...r.kernel.org, neeraj.upadhyay@....com, urezki@...il.com,
 qiang.zhang1211@...il.com, bigeasy@...utronix.de, chenzhongjin@...wei.com,
 yangjihong1@...wei.com, rostedt@...dmis.org, Justin Chen
 <justin.chen@...adcom.com>
Subject: Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes
 for v6.9]

On Thu, Mar 14 2024 at 14:53, Florian Fainelli wrote:
> On 3/14/24 14:21, Thomas Gleixner wrote:
>> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
>> explain the fail. I haven't yet spotted where this goes into lala land.
>
> It was a lot harder to generate the same issue on cold boot against 
> 8ca1836769d758e4fbf5851bb81e181c52193f5d,

That's good as it points into the exactly right direction as far as I
can tell from the data we have, but I might be wrong at the end.

> however, just like against 36e40df35d2c1891fe58241640c7c95de4aa739b,
> it would happen resuming from suspend to DRAM whereby the CPU core(s)
> lost their power and had to be re-initialized. Eventually I got a cold
> boot log:
>
> https://gist.github.com/ffainelli/b5684585c78518a5492cbbf1c7dce16e

The picture is similar to the one before:

<idle>-0         2d.... 3016627us : tmigr_update_events: child=00000000 group=6b74d49d group_lvl=0 child_active=0 group_active=1 nextevt=3023000000 next_expiry=3023000000 child_evt_expiry=0 child_evtcpu=0
<idle>-0         2d.... 3016628us : tmigr_group_set_cpu_inactive: group=6b74d49d lvl=0 numa=0 active=1 migrator=1 parent=00000000 childmask=4
<idle>-0         2d.... 3016629us : tmigr_cpu_idle: cpu=2 parent=6b74d49d nextevt=3023000000 wakeup=9223372036854775807

<idle>-0         0d.... 3016684us : tmigr_group_set_cpu_inactive: group=6b74d49d lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=1
<idle>-0         0d.... 3016685us : tmigr_cpu_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0         0d.... 3024623us : tmigr_cpu_new_timer_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807

<idle>-0         1d.s.. 162080532us : timer_cancel: timer=2e281df7

Just a different CPU this time.

The first expiring timer:

kcompact-42        1d.... 2552891us : timer_start: timer=2e281df7 function=process_timeout expires=4294670348 [timeout=500] bucket_expiry=4294670352 cpu=1 idx=66 flags=

Last expiry before going south:

   <idle>-0         1..s.. 3006620us : timer_expire_entry: timer=6f47b280 function=process_timeout now=4294670302 baseclk=4294670302

4294670352 - 4294670302 = 50

3006620us + 50*1000us = 3056620us

So the previous observation of hitting the exact point of the last CPU
going idle does not matter.

What really bothers me is:

<idle>-0         2d.... 3016629us : tmigr_cpu_idle: cpu=2 parent=6b74d49d nextevt=3023000000 wakeup=9223372036854775807

which has an event between these events:

<idle>-0         0d.... 3016685us : tmigr_cpu_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0         0d.... 3024623us : tmigr_cpu_new_timer_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807

But that event is before the next expiring timer. Confused, but Frederic
just told me on IRC he's on to something.

> Does the consistent ~159s mean anything?

I don't think so. It might be the limitation of the clockevent device,
the maximum sleep time restriction or some other unrelated event (device
interrupt, hrtimer) which happens to pop up after this time for some
reason.

But it's definitely not relevant to the problem itself. It's just the
thing which brings the machine back to life. Otherwise it might sit
there forever.

Thanks,

        tglx