lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240126085122.21e0a8a2@meshulam.tesarici.cz>
Date: Fri, 26 Jan 2024 08:51:22 +0100
From: Petr Tesařík <petr@...arici.cz>
To: Florian Fainelli <f.fainelli@...il.com>
Cc: Andrew Lunn <andrew@...n.ch>, Marc Haber <mh+netdev@...schlus.de>,
 alexandre.torgue@...s.st.com, Jose Abreu <joabreu@...opsys.com>, Chen-Yu
 Tsai <wens@...e.org>, Jernej Skrabec <jernej.skrabec@...il.com>, Samuel
 Holland <samuel@...lland.org>, Jisheng Zhang <jszhang@...nel.org>,
 netdev@...r.kernel.org
Subject: Re: stmmac on Banana PI CPU stalls since Linux 6.6

On Thu, 25 Jan 2024 12:00:46 -0800
Florian Fainelli <f.fainelli@...il.com> wrote:

> On 1/25/24 11:54, Andrew Lunn wrote:
> >> I have checked out 2eb85b750512cc5dc5a93d5ff00e1f83b99651db (which is
> >> the first bad commit that the bisect eventually identified) and tried
> >> running:
> >>
> >> [56/4504]mh@fan:~/linux/git/linux ((2eb85b750512...)) $ make BUILDARCH="amd64" ARCH="arm" KBUILD_DEBARCH="armhf" CROSS_COMPILE="arm-linux-gnueabihf-" drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> >>    SYNC    include/config/auto.conf.cmd
> >>    SYSHDR  arch/arm/include/generated/uapi/asm/unistd-oabi.h
> >>    SYSHDR  arch/arm/include/generated/uapi/asm/unistd-eabi.h
> >>    HOSTCC  scripts/kallsyms
> >>    UPD     include/config/kernel.release
> >>    UPD     include/generated/uapi/linux/version.h
> >>    UPD     include/generated/utsrelease.h
> >>    SYSNR   arch/arm/include/generated/asm/unistd-nr.h
> >>    SYSTBL  arch/arm/include/generated/calls-oabi.S
> >>    SYSTBL  arch/arm/include/generated/calls-eabi.S
> >>    CC      scripts/mod/empty.o
> >>    MKELF   scripts/mod/elfconfig.h
> >>    HOSTCC  scripts/mod/modpost.o
> >>    CC      scripts/mod/devicetable-offsets.s
> >>    UPD     scripts/mod/devicetable-offsets.h
> >>    HOSTCC  scripts/mod/file2alias.o
> >>    HOSTCC  scripts/mod/sumversion.o
> >>    HOSTLD  scripts/mod/modpost
> >>    CC      kernel/bounds.s
> >>    CC      arch/arm/kernel/asm-offsets.s
> >>    UPD     include/generated/asm-offsets.h
> >>    CALL    scripts/checksyscalls.sh
> >>    CHKSHA1 include/linux/atomic/atomic-arch-fallback.h
> >>    CHKSHA1 include/linux/atomic/atomic-instrumented.h
> >>    MKLST   drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst
> >> ./scripts/makelst: 1: arithmetic expression: expecting EOF: "0x - 0x00000000"
> >> [57/4505]mh@fan:~/linux/git/linux ((2eb85b750512...)) $
> >>
> >> That is not what it was suppsoed to yield, right?  
> > 
> > No. But did it actually generate
> > drivers/net/ethernet/stmicro/stmmac/stmmac_main.lst Sometime errors
> > like this are not always fatal.
> >   
> >> My bisect eventually completed and identified
> >> 2eb85b750512cc5dc5a93d5ff00e1f83b99651db as the first bad commit.  
> > 
> > I can make a guess.
> > 
> > -       memset(&priv->xstats, 0, sizeof(struct stmmac_extra_stats));
> > 
> > Its removed, not moved later. Deep within this structure is the
> > stmmac_txq_stats and stmmac_rxq_stats which this function is supposed
> > to return, and the two syncp variables are in it as well.
> > 
> > My guess is, they have an invalid state, when this memset is missing.
> > 
> > Try putting the memset back.
> > 
> > I also guess that is not the real fix, there are missing calls to
> > u64_stats_init().  
> 
> Did not Petr try to address the same problem essentially:
> 
> https://lore.kernel.org/netdev/20240105091556.15516-1-petr@tesarici.cz/
> 
> this was not deemed the proper solution and I don't think one has been 
> posted since then, but it looks about your issue here Marc.

Yes, it looks like the same issue I ran into on my NanoPi. I'm sorry
I've been busy with other things lately, so I could not test and submit
my changes.

Essentially, the write side of the statistics seqlock is not protected
and will eventually miss an increment, causing the read side to spin
forever. The final plan is to split the statistics into three parts:

1. fields updated only under the tx queue lock,
2. fields updated only during NAPI poll,
3. fields updated only from interrupt context,

The first two groups can each have its own seqlock. The third group
(actually a single counter) can be converted to a per-CPU variable. The
read side will then aggregate the values as appropriate.

I hope I can find some time for this bug again during the coming weekend
(it's not for my day job). It's motivating to know that I'm not the
only affected person on the planet. ;-)

Petr T

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ