lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4d3fef1-d410-c029-cdff-4d90f578e2da@gmail.com>
Date:   Wed, 2 Mar 2022 17:34:58 +0100
From:   Heiner Kallweit <hkallweit1@...il.com>
To:     Jerome Brunet <jbrunet@...libre.com>,
        Erico Nunes <nunes.erico@...il.com>,
        Martin Blumenstingl <martin.blumenstingl@...glemail.com>
Cc:     Alexandre Torgue <alexandre.torgue@...s.st.com>,
        Giuseppe Cavallaro <peppe.cavallaro@...com>,
        Jose Abreu <joabreu@...opsys.com>,
        Kevin Hilman <khilman@...libre.com>,
        Neil Armstrong <narmstrong@...libre.com>,
        linux-amlogic@...ts.infradead.org, netdev@...r.kernel.org,
        "open list:ARM/Rockchip SoC..." <linux-rockchip@...ts.infradead.org>,
        linux-sunxi@...ts.linux.dev
Subject: Re: net: stmmac: dwmac-meson8b: interface sometimes does not come up
 at boot

On 02.03.2022 14:39, Jerome Brunet wrote:
> 
> On Wed 02 Mar 2022 at 12:01, Heiner Kallweit <hkallweit1@...il.com> wrote:
> 
>> On 02.03.2022 11:33, Erico Nunes wrote:
>>> On Sat, Feb 26, 2022 at 2:53 PM Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>> Just to rule out that the PHY may be involved:
>>>> - Does the issue occur with internal and/or external PHY?
>>>
>>> My target boards have the internal phy only. It is not possible for me
>>> at the moment to test it with an external phy.
>>>
>>>> - Issue still occurs in PHY polling mode? (disable PHY interrupt in dts)
>>>
>>> Thanks for suggesting this. I did tests with this and it seems to be a
>>> workaround.
>>> With phy interrupt on recent kernels (around v5.17-rc3) I'm able to
>>> reproduce the issue relatively easily over a batch of a hundred jobs.
>>> With my tests with the phy in polling mode, I have not been able to
>>> reproduce so far, even with several hundred jobs.
>>>
>> It's my understanding that in the problem case the "aneg complete"
>> interrupt fires, but no data flows.
>> This might indicate a timing issue. According to the meson PHY driver
>> (I don't have the datasheet) the PHY doesn't have a "link up" interrupt
>> source, just the mentioned "aneg complete".
>>
>> Below I send an experimental patch that delays the link up processing
>> a little and eliminates not needed interrupt sources.
>> Could you please test it with PHY interrupts enabled?
>>
>>
>> By the way, to all:
>> I found that interrupt mode is broken in fixed (aneg disabled) mode,
>> because link-up isn't signaled. Experiments showed that irq source
>> bit 7 can be used to fix this, but this bit isn't documented in the
>> driver.
>>
>>> For completeness I also tested 46f69ded988d (from my initial analysis)
>>> and setting the phy to polling mode there does not make a difference,
>>> issue still reproduces. So it may have been a different bug. Though I
>>> guess at this point we can disregard that and focus on the current
>>> kernel.
>>>
>>> I tried adding a few debugs and delays to the interrupt code path in
>>> drivers/net/phy/meson-gxl.c but nothing gave me useful info so far.
>>>
>>> Do you have more advice on how to proceed from here?
>>>
>>> Thanks
>>>
>>> Erico
>>
>> Heiner
> 
> Hi,
> 
> I also did some tests on my side as well. Mostly with v5.10.93 ATM
> It is true that I can recall seeing this issue only on boards using the
> internal PHY (g12 and gxl board for me - I don't have meson8b boards)
> 
> I tried on the u200 (g12 based). Being the ref design it has both
> the internal and external interfaces and I can choose.
> 
> To my surprise, I could not reproduce the issue on it with the internal
> PHY ... until I noticed that eMMC was initialising more or less at the
> same time as the network.
> 
> I disabled the eMMC, out of curiosity, and the issue was back.
> Like Heiner, I suspect a timing issue - at this stage, I can't tell if it
> is PHY related though.
> 
> I also tried with the external phy, could not reproduce. Unfortunately,
> as we can see from the first test on the u200, not reproducing is not
> really a proof and it difficult to conclude.
> 
> Like Erico, I tried bisecting but I ended up on a BT merge ... Clearly
> inconclusive :(
> 
> Disabling the IRQ is an interesting test but, on my side, I have mixed
> results (on the libretech-cc this time):
> 
> * I first tried quickly while bisecting, on commit
>   5.6.0-rc3-01434-g8d4ccd7770e7:
>   - With IRQ => NOK
>   - POLL => NOK
> 
> Seeing Erico's report, I thought maybe I mixed things up so I tried again,
> doubled checked IRQ were disabled ... still broken. There was another
> commit I reproduce it without IRQ but I lost it.
> 
> * I also tried on v5.10.93:
>   - With IRQ => NOK
>   - POLL => OK ... (well, I got bored before the issue showed up)
> 
> It seems that switching to polling, in some case, changes the timings
> just enough to hide the issue ... but not always. Unless I forgot to
> consider something else ?? Ideas ?
> 
When using polling the time difference between aneg complete and
PHY state machine run is random in the interval 0 .. 1s.
Hence there's a certain chance that the difference is too small
to avoid the issue.

> If I understand the proposed patch correctly, it is mostly about the phy
> IRQ. Since I reproduce without the IRQ, I suppose it is not the
> problem we where looking for (might still be a problem worth fixing -
> the phy is not "rock-solid" when it comes to aneg - I already tried
> stabilising it a few years ago)

Below is a slightly improved version of the test patch. It doesn't sleep
in the (threaded) interrupt handler and lets the workqueue do it.

Maybe Amlogic is aware of a potentially related silicon issue?

> 
> TBH, It bothers me that I reproduced w/o the IRQ. The idea makes
> sense :/
> 
>>
[...]
> 


diff --git a/drivers/net/phy/meson-gxl.c b/drivers/net/phy/meson-gxl.c
index 7e7904fee..a3318ae01 100644
--- a/drivers/net/phy/meson-gxl.c
+++ b/drivers/net/phy/meson-gxl.c
@@ -209,12 +209,7 @@ static int meson_gxl_config_intr(struct phy_device *phydev)
 		if (ret)
 			return ret;
 
-		val = INTSRC_ANEG_PR
-			| INTSRC_PARALLEL_FAULT
-			| INTSRC_ANEG_LP_ACK
-			| INTSRC_LINK_DOWN
-			| INTSRC_REMOTE_FAULT
-			| INTSRC_ANEG_COMPLETE;
+		val = INTSRC_LINK_DOWN | INTSRC_ANEG_COMPLETE;
 		ret = phy_write(phydev, INTSRC_MASK, val);
 	} else {
 		val = 0;
@@ -240,7 +235,10 @@ static irqreturn_t meson_gxl_handle_interrupt(struct phy_device *phydev)
 	if (irq_status == 0)
 		return IRQ_NONE;
 
-	phy_trigger_machine(phydev);
+	if (irq_status & INTSRC_ANEG_COMPLETE)
+		phy_queue_state_machine(phydev, msecs_to_jiffies(100));
+	else
+		phy_trigger_machine(phydev);
 
 	return IRQ_HANDLED;
 }
-- 
2.35.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ