netdev - Re: stmmac on Banana PI CPU stalls since Linux 6.6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8efb36c2-a696-4de7-b3d7-2238d4ab5ebb@lunn.ch>
Date: Sun, 21 Jan 2024 22:52:56 +0100
From: Andrew Lunn <andrew@...n.ch>
To: Marc Haber <mh+netdev@...schlus.de>
Cc: alexandre.torgue@...s.st.com, Jose Abreu <joabreu@...opsys.com>,
	Chen-Yu Tsai <wens@...e.org>,
	Jernej Skrabec <jernej.skrabec@...il.com>,
	Samuel Holland <samuel@...lland.org>,
	Jisheng Zhang <jszhang@...nel.org>, netdev@...r.kernel.org
Subject: Re: stmmac on Banana PI CPU stalls since Linux 6.6

On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> Hi,
> 
> I am running a bunch of Banana Pis with Debian stable and unstable but
> with a bleeding edge kernel. Since kernel 6.6, especially the test
> system running Debian unstable is plagued by self-detected stalls on
> CPU. The system seems to continue running normally locally but doesn't
> answer on the network any more. Sometimes, after a few hours, things
> heal themselves.
> 
> Here is an example log output:
> [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> [73929.368653] rcu:     1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> [73929.377796] rcu:     (t=5250 jiffies g=851349 q=113 ncpus=2)
> [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G             L     6.6.0-zgbpi-armmp-lpae+ #1
> [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> [73929.383363] LR is at dev_get_stats+0x44/0x144
> [73929.383389] pc : [<bf126db0>]    lr : [<c09525e8>]    psr: 200f0013
> [73929.383401] sp : f0c59c78  ip : f0c59df8  fp : c2bb8000
> [73929.383412] r10: 00800001  r9 : c3443dd8  r8 : 00000143
> [73929.383423] r7 : 00000001  r6 : 00000000  r5 : c2bbb000  r4 : 00000001
> [73929.383434] r3 : 0004c891  r2 : c2bbae48  r1 : f0c59d30  r0 : c2bb8000
> [73929.383447] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [73929.383463] Control: 30c5387d  Table: 49b553c0  DAC: a7f66f60
> [73929.383486]  stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144

Hi Marc

https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949

My _guess_ would be, its stuck in one of the loops which look like:

		do {
			start = u64_stats_fetch_begin(&txq_stats->syncp);
			tx_packets = txq_stats->tx_packets;
			tx_bytes   = txq_stats->tx_bytes;
		} while (u64_stats_fetch_retry(&txq_stats->syncp, start));

Next time you get a backtrace, could you do:

make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
use whatever it is reporting for:

PC is at stmmac_get_stats64+0x64/0x20c [stmmac]

to find where it is in the listing.

Once we know if its the RX or the TX loop, we have a better idea where
to look for an unbalanced u64_stats_update_begin() /
u64_stats_update_end().

> I am running a bisect attempt since before christmas, but since it takes
> up to a day for the issue to show themselves on a "bad" kernel, I'll let
> "good" kernels run for four days until I declare them good. That takes a
> lot of wall clock (or better, wall calendar) time.

You might be able to speed it up with:

while true ; do cat /proc/net/dev > /dev/null ; done

and iperf or similar to generate a lot of traffic.

    Andrew