linux-kernel - Re: imx6q random crashing using 4 cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOf5uwkJ9p=WywtsyhWk+x=7M_AodmAjg6mV2_-AeTjmJFUGAQ@mail.gmail.com>
Date: Mon, 12 Aug 2024 10:45:05 +0200
From: Michael Nazzareno Trimarchi <michael@...rulasolutions.com>
To: Peng Fan <peng.fan@....com>
Cc: LKML <linux-kernel@...r.kernel.org>, dl-linux-imx <linux-imx@....com>, 
	Fabio Estevam <festevam@...il.com>, Shawn Guo <shawnguo@...nel.org>
Subject: Re: imx6q random crashing using 4 cpus

Hi Peng

On Mon, Aug 12, 2024 at 10:33 AM Peng Fan <peng.fan@....com> wrote:
>
> Hi,
> > Subject: imx6q random crashing using 4 cpus
> >
> > Hi all
> >
> > I'm getting random crashes including segmentation fault of service if I
> > boot a custom imx6q design with all the cpus (nr_cpus=3 works). I did
> > not find anyone that were raising this problem in the past but I would
> > like to know if you get this in your experience. The revision silicon is
> > 1.6 for imx6q
> >
> > I have tested
> >
> > 6.10.3
>
> Upstream kernel?
>

This is upstream kernel

> > 6.6
>

6.6-fslc but I have tested on 6.6 lts too, same instability

> This is upstream kernel or NXP released 6.6 kernel?
>
> Does older version kernel works well?
>

What revision do you suggest? I can test easily them all

> >
> > I have tested to remove idle state, increase the voltage core etc.
>
> cpuidle.off=1 does not help, right?
>

I have got rid of cpuidle init in mach-imx6q end tested cpuidle.off=1 too.

> I could not recall clear about LDO, I remember there is LDO enabled
> and LDO disabled. Have you checked LDO?

I can try to not use LDO from pmic and use the internal one

>
> > Those cpus are industrial
> > grade and they can run up to 800Mhz
> >
> > All kernels look ok if I reduce the number of cpus. Some of the
> > backtrace for instance
> >
> > [  OK  ] Stopped target Preparation for Network.
> > [  134.671302] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > [  134.677247] rcu:     2-...0: (1 GPs behind) idle=3c74/1/0x40000000
> > softirq=1197/1201 fqs=421
>
> CPU 2 seems stuck.

I have seen but I don't have stuck with 3 cpus. I have seen the power supply is
0-1 group and 2-3 group. Is it possible that it's something connected
to power supply
or anything that makes the core unstable?

>
> > [  134.685445] rcu:     (detected by 0, t=2106 jiffies, g=1449, q=175
> > ncpus=4)
> > [  134.692158] Sending NMI from CPU 0 to CPUs 2:
> > [  144.696530] rcu: rcu_sched kthread starved for 995 jiffies! g1449
> > f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=1
> > [  144.706543] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > time, OOM is now expected behavior.
> > [  144.715506] rcu: RCU grace-period kthread stack dump:
> > [  144.720563] task:rcu_sched       state:I stack:0     pid:14
> > tgid:14    ppid:2      flags:0x00000000
> > [  144.729890] Call trace:
> > [  144.729902]  __schedule from schedule+0x24/0x90 [  144.737008]
> > schedule from schedule_timeout+0x88/0x100 [  144.742175]
> > schedule_timeout from rcu_gp_fqs_loop+0xec/0x4c4 [  144.747955]
> > rcu_gp_fqs_loop from rcu_gp_kthread+0xc4/0x154 [  144.753556]
> > rcu_gp_kthread from kthread+0xdc/0xfc [  144.758381]  kthread from
> > ret_from_fork+0x14/0x20 [  144.763108] Exception stack(0xf0875fb0
> > to 0xf0875ff8)
> > [  144.768172] 5fa0:                                     00000000
> > 00000000 00000000 00000000
> > [  144.776360] 5fc0: 00000000 00000000 00000000 00000000
> > 00000000
> > 00000000 00000000 00000000
> > [  144.784546] 5fe0: 00000000 00000000 00000000 00000000
> > 00000013 00000000 [  144.791169] rcu: Stack dump where RCU GP
> > kthread last ran:
> > [  144.796659] Sending NMI from CPU 0 to CPUs 1:
> > [  144.801027] NMI backtrace for cpu 1 skipped: idling at
> > default_idle_call+0x28/0x3c [  144.809643] sysrq: This sysrq operation
> > is disabled.
>
> Have you ever tried use jtag to see cpu status?
> cpu in idle loop?
> cpu runs in invalid address and hang?

Need to check

Michael

>
> Regards,
> Peng.
>
> >
> > What I'm trying to figure out what could be the problem but I don't
> > have similar reference
> >
> > Michael
> >
> > --
> > Michael Nazzareno Trimarchi
> > Co-Founder & Chief Executive Officer
> > M. +39 347 913 2170
> > michael@...rulasolutions.com
> > __________________________________
> >
> > Amarula Solutions BV
> > Joop Geesinkweg 125, 1114 AB, Amsterdam, NL T. +31 (0)85 111 9172
> > info@...rulasolutions.com
> > https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> > www.amarulasolutions.com%2F&data=05%7C02%7Cpeng.fan%40nxp.
> > com%7C0cfef2a8598047ed1e1808dcbaa62d0d%7C686ea1d3bc2b4c6f
> > a92cd99c5c301635%7C0%7C0%7C638590470075161250%7CUnknow
> > n%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> > 6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9wzW6km41s
> > pIH2J4DjAVZFtW%2FGjIDWeEB%2FJkL74477o%3D&reserved=0



--
Michael Nazzareno Trimarchi
Co-Founder & Chief Executive Officer
M. +39 347 913 2170
michael@...rulasolutions.com
__________________________________

Amarula Solutions BV
Joop Geesinkweg 125, 1114 AB, Amsterdam, NL
T. +31 (0)85 111 9172
info@...rulasolutions.com
www.amarulasolutions.com