[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <717229a4-f7f6-837d-3d58-756b516a8605@gmx.net>
Date: Fri, 10 Jan 2020 16:53:06 +0000
From: ѽ҉ᶬḳ℠ <vtol@....net>
To: Russell King - ARM Linux admin <linux@...linux.org.uk>
Cc: Andrew Lunn <andrew@...n.ch>, netdev@...r.kernel.org
Subject: Re: [drivers/net/phy/sfp] intermittent failure in state machine
checks
On 10/01/2020 16:32, Russell King - ARM Linux admin wrote:
> On Fri, Jan 10, 2020 at 03:45:15PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On 10/01/2020 15:09, Russell King - ARM Linux admin wrote:
>>> On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote:
>>>> On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:
>>>>>>> Which is also indicating everything is correct. When the problem
>>>>>>> occurs, check the state of the signals again as close as possible
>>>>>>> to the event - it depends how long the transceiver keeps it
>>>>>>> asserted. You will probably find tx-fault is indicating
>>>>>>> "in hi IRQ".
>>>>>> just discovered userland - gpioinfo pca9538 - which seems more verbose
>>>>>>
>>>>>> gpiochip2 - 8 lines:
>>>>>> line 0: unnamed "tx-fault" input active-high [used]
>>>>>> line 1: unnamed "tx-disable" output active-high [used]
>>>>>> line 2: unnamed "rate-select0" input active-high [used]
>>>>>> line 3: unnamed "los" input active-high [used]
>>>>>> line 4: unnamed "mod-def0" input active-low [used]
>>>>>> line 5: unnamed unused input active-high
>>>>>> line 6: unnamed unused input active-high
>>>>>> line 7: unnamed unused input active-high
>>>>>>
>>>>>> The above is depicting the current state with the module
>>>>>> working, i.e. being
>>>>>> online. Will do some testing and report back, not sure yet
>>>>>> how to keep a
>>>>>> close watch relating to the failure events.
>>>>> However, that doesn't give the current levels of the inputs, so it's
>>>>> useless for the purpose I've asked for.
>>>> Fair enough. Operational (online) state
>>>>
>>>> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
>>>> gpio-504 ( |tx-fault ) in lo IRQ
>>>> gpio-505 ( |tx-disable ) out lo
>>>> gpio-506 ( |rate-select0 ) in lo
>>>> gpio-507 ( |los ) in lo IRQ
>>>> gpio-508 ( |mod-def0 ) in lo IRQ
>>>>
>>>> And the same remained (unchanged) during/after the events (as
>>>> closely I was
>>>> able to monitor) -> module transmit fault indicated
>>> Try:
>>>
>>> while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done
>>>
>>> which may have a better chance of catching it.
>>>
>> Suppose you are not interested in what happens with ifdown wan, so just for
>> posterity
>>
>> gpio-504 ( |tx-fault ) in hi IRQ
>> gpio-505 ( |tx-disable ) out hi
>> gpio-506 ( |rate-select0 ) in lo
>> gpio-507 ( |los ) in lo IRQ
>> gpio-508 ( |mod-def0 ) in lo IRQ
> Right, because the state of TX_FAULT is not defined while TX_DISABLE
> is high.
>
>> When the iif is brought up again and happens to trigger a transmit fault the
>> hi is not being triggered however. And it did not try 5 times to recover
>> from the fault, unless dmesg missed some
>>
>> [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down
>> [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not
>> ready
> Here is where you brought the interface down.
>
>> [Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for
>> inband/1000base-x link mode
> Here is where you brought the interface back up.
>
>> [Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated
> The module failed to deassert TX_FAULT within the 300ms window.
>
>> [Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up -
>> 1Gbps/Full - flow control off
>> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered
> I'm not sure why these two messages are this way round; they should
> be the other way (I guess that's something to do with printk() no
> longer being synchronous.) Given that it seems to have taken two
> seconds to recover, that will be two reset attempts.
>
>> [Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down
>> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated
> The module re-asserts TX_FAULT...
>
>> [Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault,
>> disabling
> After another three attempts, the module is still asserting TX_FAULT.
>
> We don't report every attempt made to recover the fault; that would
> flood the log. Instead, we report when the fault occurred, then try
> to reset the fault every second for up to five attempts _total_. If
> we exhaust the attempts, you get "module persistently indicates fault,
> disabling". If successfully recovered, you'll get "module transmit
> fault recovered".
>
> We don't reset the retries after a successful recovery as that would
> mean a transitory safety fault occurring once every few seconds would
> endlessly fill the kernel log, and also may go unnoticed. If it's
> spitting out safety faults like that, the module would be faulty.
>
Seems that the debug avenue has been exhausted, short of running SFP.C
in debug mode.
There is still no explanation why the module passes the 300 ms deassert
TX_FAULT test most of the time but fails intermittently at other times,
being kind of incoherent. Maybe it is just wishful thinking but it seems
a bit far-fetched that the module is really causing this, least the
readings from GPIO do not provide any such indicator.
Could there be something choking / blocking the communication channel
between the module and the kernel, some kernel code getting stuck /
leaked in memory?
Could the ipupdown routine, which has its own implementation in OpenWrt,
be an interfering agent, e.g. the way it constructs or tears down the
iif, though I do not see how?
Powered by blists - more mailing lists