netdev - Re: Need help with mdiobus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 20 Oct 2016 14:55:17 +0200
From:   Zefir Kurtisi <zefir.kurtisi@...atec.com>
To:     Timur Tabi <timur@...eaurora.org>, Andrew Lunn <andrew@...n.ch>
Cc:     Florian Fainelli <f.fainelli@...il.com>, netdev@...r.kernel.org
Subject: Re: Need help with mdiobus_register and phy

On 10/19/2016 02:16 PM, Timur Tabi wrote:
> Zefir Kurtisi wrote:
>> Ok then, if you can wait some days, I'll prepare and provide you a more detailed
>> failure report to allow you testing if the issue happens with other NICs.
> 
> That sounds great.
> 

I am now able again to reproduce the issue reliably, let's see if it is helpful
enough to check if this a problem unique to our setup.

As described earlier, the HW is as follows:
(copper)--[at8031]--(SGMII)--[eTSEC/mpc85xx]

We are using OpenWRT/LEDE as SW platform, therefore I can't do the tests on the
latest kernel but provide a reference to inspect the problem. The kernel used is
4.4.21 with the relevant drivers (ath803x and gianfar_driver) being either clean
or patched with OWRT/LEDE modifications (issue happens in either case).

The process for detecting the failure is rather trivial: continuously put the
Ethernet interface down and up and try to get some data through until it fails.
The script I used to run on the device (over serial) is:
#----------------------------------------------------------
HOST=192.168.1.100
TIMEOUT=20

while true; do
	ifconfig eth0 down
	ifconfig eth0 up

        ping $HOST -c 1 -w $TIMEOUT || {
                echo "Ping failed after $TIMEOUT seconds"
                exit 1
        }
done
#----------------------------------------------------------
In my setup, it takes less than 5 minutes for the script to exit. Once running
into this state, the relevant registers read from at803x are as follows:
* copper status              (0x01): 0x796d
* fiber  status              (0x01): 0x6149
* copper phy specific status (0x11): 0xbc10
* fiber  phy specific status (0x11): 0x8100
* copper/fiber status        (0x1b): 0x020e

In essence
* the copper link is up with autonegotiation completed
  (switch/peer link indication is also up), but
* the SGMII link is down
  (bits LINK and MR_AN_COMPLETE in PHY specific status fiber cleared)
* although SGMII is synchronized (SYNC_STATUS bit in same register set)

The SGMII link does not recover from this state thereafter, the only reliable way
to get it up again I found is to do a power down/up cycle at the fiber/SGMII side,
which is what the patch I contributed does.

Now the interesting (and new for me) part is: if I remove the at803x driver from
the system and use the generic phy instead, the problem does not happen (or at
least not while running for one full day). I don't see any significant differences
between what at803x does above genphy (gpio based HW reset is not run for 8031 in
at803x_link_change_notify()). Using the genphy is not an option for us, since we
have the PHY also attached to fiber and need dedicated handling.

Hope this helps for the beginning, let me know if you need more information.

Cheers,
Zefir