netdev - Re: stmmac on Banana PI CPU stalls since Linux 6.6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZbOOG_yyCUgK_2b1@torres.zugschlus.de>
Date: Fri, 26 Jan 2024 11:48:59 +0100
From: Marc Haber <mh+netdev@...schlus.de>
To: Andrew Lunn <andrew@...n.ch>
Cc: alexandre.torgue@...s.st.com, Jose Abreu <joabreu@...opsys.com>,
	Chen-Yu Tsai <wens@...e.org>,
	Jernej Skrabec <jernej.skrabec@...il.com>,
	Samuel Holland <samuel@...lland.org>,
	Jisheng Zhang <jszhang@...nel.org>, netdev@...r.kernel.org
Subject: Re: stmmac on Banana PI CPU stalls since Linux 6.6

On Thu, Jan 25, 2024 at 07:01:40PM +0100, Marc Haber wrote:
> On Sun, Jan 21, 2024 at 10:52:56PM +0100, Andrew Lunn wrote:
> > On Sun, Jan 21, 2024 at 09:17:32PM +0100, Marc Haber wrote:
> > > Hi,
> > > 
> > > I am running a bunch of Banana Pis with Debian stable and unstable but
> > > with a bleeding edge kernel. Since kernel 6.6, especially the test
> > > system running Debian unstable is plagued by self-detected stalls on
> > > CPU. The system seems to continue running normally locally but doesn't
> > > answer on the network any more. Sometimes, after a few hours, things
> > > heal themselves.
> > > 
> > > Here is an example log output:
> > > [73929.363030] rcu: INFO: rcu_sched self-detected stall on CPU
> > > [73929.368653] rcu:     1-....: (5249 ticks this GP) idle=d15c/1/0x40000002 softirq=471343/471343 fqs=2625
> > > [73929.377796] rcu:     (t=5250 jiffies g=851349 q=113 ncpus=2)
> > > [73929.383205] CPU: 1 PID: 14512 Comm: atop Tainted: G             L     6.6.0-zgbpi-armmp-lpae+ #1
> > > [73929.383222] Hardware name: Allwinner sun7i (A20) Family
> > > [73929.383233] PC is at stmmac_get_stats64+0x64/0x20c [stmmac]
> > > [73929.383363] LR is at dev_get_stats+0x44/0x144
> > > [73929.383389] pc : [<bf126db0>]    lr : [<c09525e8>]    psr: 200f0013
> > > [73929.383401] sp : f0c59c78  ip : f0c59df8  fp : c2bb8000
> > > [73929.383412] r10: 00800001  r9 : c3443dd8  r8 : 00000143
> > > [73929.383423] r7 : 00000001  r6 : 00000000  r5 : c2bbb000  r4 : 00000001
> > > [73929.383434] r3 : 0004c891  r2 : c2bbae48  r1 : f0c59d30  r0 : c2bb8000
> > > [73929.383447] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > > [73929.383463] Control: 30c5387d  Table: 49b553c0  DAC: a7f66f60
> > > [73929.383486]  stmmac_get_stats64 [stmmac] from dev_get_stats+0x44/0x144
> > 
> > Hi Marc
> > 
> > https://elixir.bootlin.com/linux/v6.7.1/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L6949
> 
> That is just for reference to the source? Or am I supposed to do
> something with that link?
> 
> > My _guess_ would be, its stuck in one of the loops which look like:
> > 
> > 		do {
> > 			start = u64_stats_fetch_begin(&txq_stats->syncp);
> > 			tx_packets = txq_stats->tx_packets;
> > 			tx_bytes   = txq_stats->tx_bytes;
> > 		} while (u64_stats_fetch_retry(&txq_stats->syncp, start));
> > 
> > Next time you get a backtrace, could you do:
> > 
> > make drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst. You can then
> > use whatever it is reporting for:

So, if I have in my current backtrace:
PC is at stmmac_get_stats64+0x48/0x20c [stmmac]
I look in the generated stmmac_main.lst for the function
stmmac_get_stats:
00005e9c <stmmac_get_stats64>:
{
    5e9c:       e92d47f0        push    {r4, r5, r6, r7, r8, r9, sl, lr}
    5ea0:       e52de004        push    {lr}            @ (str lr, [sp, #-4]!)
    5ea4:       ebfffffe        bl      0 <__gnu_mcount_nc>
                        5ea4: R_ARM_CALL        __gnu_mcount_nc
        u32 tx_cnt = priv->plat->tx_queues_to_use;
    5ea8:       e2805a03        add     r5, r0, #12288  @ 0x3000
    5eac:       e59535c0        ldr     r3, [r5, #1472] @ 0x5c0
    5eb0:       e5937078        ldr     r7, [r3, #120]  @ 0x78
        u32 rx_cnt = priv->plat->rx_queues_to_use;
    5eb4:       e5934074        ldr     r4, [r3, #116]  @ 0x74
        for (q = 0; q < tx_cnt; q++) {
    5eb8:       e3570000        cmp     r7, #0
    5ebc:       12802db9        addne   r2, r0, #11840  @ 0x2e40
    5ec0:       12822008        addne   r2, r2, #8
    5ec4:       13a06000        movne   r6, #0
    5ec8:       1a00000b        bne     5efc <stmmac_get_stats64+0x60>
    5ecc:       ea000026        b       5f6c <stmmac_get_stats64+0xd0>
        local_irq_restore(flags);
}

the address in the first line is the base address, so the line in
question is 0x5e9c+0x48=0x5ee4, which is already outside the function?!

> My bisect eventually completed and identified
> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.
> Sadly, it doesnt contain any loops, no calls to u64_stats_update_begin()
> or u64_stats_update_end() or other suspicious things to the casual
> reader.
> 
> I have backed out that commit out of 6.7.1 and have booted that kernel.
> Not long enough to be able to say something yet.

That didn't fix the hangs, PC is at
stmmac_get_stats64+0x34/0x20c
stmmac_get_stats64+0x38/0x20c
stmmac_get_stats64+0x3c/0x20c
stmmac_get_stats64+0x40/0x20c
stmmac_get_stats64+0x44/0x20c
stmmac_get_stats64+0x48/0x20c
stmmac_get_stats64+0x4c/0x20c
stmmac_get_stats64+0x50/0x20c
stmmac_get_stats64+0x54/0x20c
stmmac_get_stats64+0x58/0x20c
stmmac_get_stats64+0x5c/0x20c
stmmac_get_stats64+0x60/0x20c
stmmac_get_stats64+0x64/0x20c
(sorted, uniq, about 66 instances in about 18 hours)

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421