[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3d5e9b4d-d87c-4376-b698-4cccdbfb81ff@savoirfairelinux.com>
Date: Fri, 14 Jun 2024 17:14:48 +0200
From: Enguerrand de Ribaucourt <enguerrand.de-ribaucourt@...oirfairelinux.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: netdev@...r.kernel.org, hkallweit1@...il.com, linux@...linux.org.uk,
woojung.huh@...rochip.com, UNGLinuxDriver@...rochip.com, horms@...nel.org,
Tristram.Ha@...rochip.com, Arun.Ramadoss@...rochip.com
Subject: Re: [PATCH net v6 3/3] net: dsa: microchip: monitor potential faults
in half-duplex mode
On 14/06/2024 15:37, Andrew Lunn wrote:
> On Fri, Jun 14, 2024 at 09:46:42AM +0000, Enguerrand de Ribaucourt wrote:
>> The errata DS80000754 recommends monitoring potential faults in
>> half-duplex mode for the KSZ9477 family.
>>
>> half-duplex is not very common so I just added a critical message
>> when the fault conditions are detected. The switch can be expected
>> to be unable to communicate anymore in these states and a software
>> reset of the switch would be required which I did not implement.
>
> If i'm reading this code correctly, every 30 seconds it will test to
> see if the link is half duplex, and is so, log onetime this could lead
> to problems. Also, every 30 seconds, if the statistics counts indicate
> there has been a late collision, it will log a rate limited
> message. Given the 30 second poll interval, rate limiting is probably
> pointless, and every one will get logged. The last print, i have no
> idea what resource you are talking about. Will it also likely print
> once every 30 seconds?
The MIB statistics are read every 5 seconds. It is defined in:
dev->mib_read_interval = msecs_to_jiffies(5000);
So indeed, _ratelimit having a 5 second back-off, it is redundant.
The second print is about two resources:
- Packet Memory Available Block Count (PMAVBC)
- TX Queue Blocks Used Count (TXQBU/QM_TX_CNT)
According to the errata, they are leaky in half duplex mode. Once
depleted, packets can't be exchanged.
Like the late collision counter, these are not going to go back to
normal values without a reset. So yes, the critical message will keeping
being displayed. Do you think a dev_crit_once would be more appropriate?
>
> Is the idea here, we want to notice when this is happening, and get an
> idea if it is worth implementing the software reset? Do we want to add
> a "Please report if you see this" to the commit message or the log
> messages themselves?
I'm fine with just monitoring and printing in my use case. My devices
are unlikely to encounter half-duplex peers and the app can recover. I
simply need to guarantee I can detect this condition. For my use case,
implementing the software reset wasn't worth the effort. One of the main
obstacles is that the VLAN table would need to be reconfigured, so it
would require remembering it from the driver. Also transmission would
inevitably be stopped for a few seconds.
For generic users, I think the warning is necessary since it indicates
the switch will stop functioning after prolonged use in half-duplex. We
could elaborate the warning message to point to the datasheet. That way
users could understand why their configuration is affected, what are the
risks, and possible avoidance or recovery mechanisms. If this errata is
critical to someone who can't avoid it, then we could implement the
software reset.
Proposal for a v7:
- dev_crit_ratelimit -> dev_crit_once
- dev_warn_once -> point to datasheet + add "report if needed"
What is your opinion on this?
Thanks,
>
> Andrew
--
Savoir-faire Linux
Enguerrand de Ribaucourt
Powered by blists - more mailing lists