netdev - Re: [RFC net-next] net: track locally triggered link loss

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220521112646.0d3c0a8a@kernel.org>
Date:   Sat, 21 May 2022 11:26:46 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     netdev@...r.kernel.org, linux@...linux.org.uk, olteanv@...il.com,
        hkallweit1@...il.com, f.fainelli@...il.com, saeedm@...dia.com,
        michael.chan@...adcom.com
Subject: Re: [RFC net-next] net: track locally triggered link loss

On Sat, 21 May 2022 16:23:16 +0200 Andrew Lunn wrote:
> > For a system which wants to monitor link quality on the local end =>
> > i.e. whether physical hardware has to be replaced - differentiating
> > between (1) and (2) doesn't really matter, they are both non-events.  
> 
> Maybe data centres should learn something from the automotive world.
> It seems like most T1 PHYs have a signal quality value, which is
> exposed via netlink in the link info message. And it is none invasive.

There were attempts at this (also on the PCIe side of the NIC)
but AFAIU there is no general standard of the measurement or the
quality metric so it's hard to generalize.

> Many PHYs also have counters of receive errors, framing errors
> etc. These can be reported via ethtool --phy-stats.

Ack, they are, I've added the APIs already and we use those.
Symbol errors during carrier and FEC corrected/uncorrected blocks.
Basic FCS errors, too.

IDK what the relative false-positive rate of different sources of
information are to be honest. The monitoring team asked me about
the link flaps and the situation in Linux is indeed less than ideal.

> SFPs expose SNR ratios in their module data, transmit and receive
> powers etc, via ethtool -m and hwmon.
> 
> There is also ethtool --cable-test. It is invasive, in that it
> requires the link to go down, but it should tell you about broken
> pairs. However, you probably know that already, a monitoring system
> which has not noticed the link dropping to 100Mbps so it only uses two
> pairs is not worth the money you paired for it.

Last hop in DC is all copper DACs. Not sure there's a standard
--cable-test for DACs :S

> Now, it seems like very few, if any, firmware driven Ethernet card
> actually make use of these features. You need cards which Linux is
> actually driving the hardware. But these APIs are available for
> anybody to use. Don't data centre users have enough purchasing power
> they can influence firmware/driver writers to actually use these APIs?
> And i think the results would be better than trying to count link
> up/down.

Let's separate new and old devices.

For new products customer can stipulate requirements and they usually
get implemented. I'd love to add more requirements for signal quality 
and error reporting. It'd need to be based on standards because each
vendor cooking their own units does not scale. Please send pointers 
my way!

Old products are a different ball game, and that's where we care about
basic info like link flaps. Vendors EOL a product and you're lucky to
get bug fixes. Servers live longer and longer and obviously age
correlates with failure rates so we need to monitor those devices.