netdev - Re: [PATCH 2/2] at803x: double check SGMII side autoneg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 1 Jun 2017 13:45:19 +0200
From:   Zefir Kurtisi <zefir.kurtisi@...atec.com>
To:     Timur Tabi <timur@...eaurora.org>,
        Matthias May <matthias.may@...atec.com>,
        Andrew Lunn <andrew@...n.ch>
Cc:     netdev@...r.kernel.org, f.fainelli@...il.com,
        David Miller <davem@...emloft.net>,
        Manoj Iyer <manoj.iyer@...onical.com>, jhugo@...eaurora.org
Subject: Re: [PATCH 2/2] at803x: double check SGMII side autoneg

Hello Timur, all,

sorry for being late to chime in here, as Matthias said I have been AFK while this
discussion took place.

The related patch I committed is meant to double check that the SGMII side is done
aneg when copper side is. Usually it does, but this expectation admittedly might
fail in interrupt mode. What seems to happen is: after a power-down-up cycle the
at803x restarts aneg at copper and SGMII sides. In polling mode, if you run into
the case where copper is up but the SGMII is not, you'll see the message and in
(one of) the next poll cycle(s) SGMII should be ready too and the situation
recovers. In interrupt mode, obviously once you hit into the partially done aneg,
you won't recover since no additional link-change interrupt is going to happen.

Wit that, Tamur is right claiming the double-check breaks the driver in interrupt
mode. I could think of several work-arounds to fix it, e.g.
1) check if there is an SGMII link-change interrupt source and enable it
2) when we run into partial aneg completion, temporary switch to polling mode and
back to interrupt mode when at803x_aneg_done() succeeds

Alas, we need to rewind back to whether these double-checks are required at all.
As Matthias explained, there is at least one combination of devices that have
documented issues establishing the SGMII link - in our case it was the at8031
attached to a P1010's eTSEC (errata A-004187, fixed with chip rev. 2.01). Chances
are high that our company is the only one using this exact early version of
problematic devices. Therefore, and since the problem is at ETH side, we moved our
local workarounds into the gianfar driver who on demand restarts SGMII to recover
from the stale link.

I guess we need to decide whether we generally need to handle permanent aneg
failures on the SGMII link. If we expect that it must not fail (like we assumed
until we saw it failing), I agree with Timur and support reverting of the related
commit f62265b53e. If otherwise we want this potential failure to be handled
correctly, things become arbitrary complex. Essentially, we need to handle such
PHYs as a combination of their two sides (copper + SGMII) as virtual sub-PHYs. The
phylib might support that in a future version, but for now this seems like a lot
of work required to handle a rare problem.

tl;dr: ACK with reverting f62265b53e

Cheers,
Zefir

On 05/24/2017 03:29 PM, Timur Tabi wrote:
> On 5/24/17 2:18 AM, Matthias May wrote:
>> With the patch: When the copper side is seen as up, it also checks if aneg of
>> the SGMII link is done.
>> As far as i know SGMII can not be run without aneg, since it is always Gbit with
>> aneg mandatory.
>> If SGMII aneg is not done, then, well aneg is not done and thus 0 is returned.
>>
>> Internally we have this patch extended so we don't only report that aneg is not
>> done but also reset the link.
>> Eventually aneg on the SGMII side can be completed and the link comes up.
> 
> I would really like to test this patch.
> 
>> Why do you think that frames are able to go through when aneg is reported as not
>> done by the PHY?
> 
> I have two theories:
> 
> 1. The warning message is bogus.  The link actually is okay, but the driver thinks
> that it isn't.
> 
> 2. The link is not okay, but it automatically fixes itself soon after the
> at803x_aneg_done() finishes.
> 
>> Since aneg is mandatory for SGMII this can as well be seen as "link not up", not?
> 
> The problem is that even though the link is up, the driver has returned "0", so
> the kernel thinks that autonegotiation has not finished. at803x_aneg_done() is
> never called again, and so I think the kernel is disabling the interface is some
> secret way.
>