lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK4VdL08sdZV7o7Bw=cutdmoCEi1NYB-yisstLqRuH7QcHOHvA@mail.gmail.com>
Date:   Sun, 6 Mar 2022 10:40:47 +0100
From:   Erico Nunes <nunes.erico@...il.com>
To:     Heiner Kallweit <hkallweit1@...il.com>
Cc:     Jerome Brunet <jbrunet@...libre.com>,
        Martin Blumenstingl <martin.blumenstingl@...glemail.com>,
        Alexandre Torgue <alexandre.torgue@...s.st.com>,
        Giuseppe Cavallaro <peppe.cavallaro@...com>,
        Jose Abreu <joabreu@...opsys.com>,
        Kevin Hilman <khilman@...libre.com>,
        Neil Armstrong <narmstrong@...libre.com>,
        linux-amlogic@...ts.infradead.org, netdev@...r.kernel.org,
        "open list:ARM/Rockchip SoC..." <linux-rockchip@...ts.infradead.org>,
        linux-sunxi@...ts.linux.dev
Subject: Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up
 at boot

On Wed, Mar 2, 2022 at 5:35 PM Heiner Kallweit <hkallweit1@...il.com> wrote:
> When using polling the time difference between aneg complete and
> PHY state machine run is random in the interval 0 .. 1s.
> Hence there's a certain chance that the difference is too small
> to avoid the issue.
>
> > If I understand the proposed patch correctly, it is mostly about the phy
> > IRQ. Since I reproduce without the IRQ, I suppose it is not the
> > problem we where looking for (might still be a problem worth fixing -
> > the phy is not "rock-solid" when it comes to aneg - I already tried
> > stabilising it a few years ago)
>
> Below is a slightly improved version of the test patch. It doesn't sleep
> in the (threaded) interrupt handler and lets the workqueue do it.
>
> Maybe Amlogic is aware of a potentially related silicon issue?
>
> >
> > TBH, It bothers me that I reproduced w/o the IRQ. The idea makes
> > sense :/
> >
> >>
> [...]
> >
>
>
> diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c
> index 7e7904fee..a3318ae01 100644
> --- a/drivers/net/phy/meson-gxl.c
> +++ b/drivers/net/phy/meson-gxl.c
> @@ -209,12 +209,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev)
>                 if (ret)
>                         return ret;
>
> -               val = INTSRC_ANEG_PR
> -                       | INTSRC_PARALLEL_FAULT
> -                       | INTSRC_ANEG_LP_ACK
> -                       | INTSRC_LINK_DOWN
> -                       | INTSRC_REMOTE_FAULT
> -                       | INTSRC_ANEG_COMPLETE;
> +               val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE;
>                 ret = phy_write(phydev, INTSRC_MASK, val);
>         } else {
>                 val = 0;
> @@ -240,7 +235,10 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev)
>         if (irq_status == 0)
>                 return IRQ_NONE;
>
> -       phy_trigger_machine(phydev);
> +       if (irq_status & INTSRC_ANEG_COMPLETE)
> +               phy_queue_state_machine(phydev, msecs_to_jiffies(100));
> +       else
> +               phy_trigger_machine(phydev);
>
>         return IRQ_HANDLED;
>  }
> --
> 2.35.1

I did a lot of testing with this patch, and it seems to improve things.
To me it completely resolves the original issue which was more easily
reproducible where I would see "Link is Up" but the interface did not
really work.
At least in over a thousand jobs, that never reproduced again with this patch.

I do see a different issue now, but it is even less frequent and
harder to reproduce. In those over a thousand jobs, I have seen it
only about 4 times.
The difference is that now when the issue happens, the link is not
even reported as Up. The output is a bit different than the original
one, but it is consistently the same output in all instances where it
reproduced. Looks like this (note that there is no longer Link is
Down/Link is Up):

[    2.186151] meson8b-dwmac c9410000.ethernet eth0: PHY
[0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48)
[    2.191582] meson8b-dwmac c9410000.ethernet eth0: Register
MEM_TYPE_PAGE_POOL RxQ-0
[    2.208713] meson8b-dwmac c9410000.ethernet eth0: No Safety
Features support found
[    2.210673] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW
[    2.218083] meson8b-dwmac c9410000.ethernet eth0: configuring for
phy/rmii link mode
[   22.227444] Waiting up to 100 more seconds for network.
[   42.231440] Waiting up to 80 more seconds for network.
[   62.235437] Waiting up to 60 more seconds for network.
[   82.239437] Waiting up to 40 more seconds for network.
[  102.243439] Waiting up to 20 more seconds for network.
[  122.243446] Sending DHCP requests ...
[  130.113944] random: fast init done
[  134.219441] ... timed out!
[  194.559562] IP-Config: Retrying forever (NFS root)...
[  194.624630] meson8b-dwmac c9410000.ethernet eth0: PHY
[0.e40908ff:08] driver [Meson GXL Internal PHY] (irq=48)
[  194.630739] meson8b-dwmac c9410000.ethernet eth0: Register
MEM_TYPE_PAGE_POOL RxQ-0
[  194.649138] meson8b-dwmac c9410000.ethernet eth0: No Safety
Features support found
[  194.651113] meson8b-dwmac c9410000.ethernet eth0: PTP not supported by HW
[  194.657931] meson8b-dwmac c9410000.ethernet eth0: configuring for
phy/rmii link mode
[  196.313602] meson8b-dwmac c9410000.ethernet eth0: Link is Up -
100Mbps/Full - flow control off
[  196.339463] Sending DHCP requests ., OK
...


I don't remember seeing an output like this one in the previous tests.
Is there any further improvement we can do to the patch based on this?

Thanks

Erico

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ