netdev - Re: [PATCH net-next v2 2/3] net: dsa: add Arrow SpeedChips XRS700x driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201127140340.3bad5985@kicinski-fedora-pc1c0hjn.DHCP.thefacebook.com>
Date:   Fri, 27 Nov 2020 14:03:40 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Vladimir Oltean <olteanv@...il.com>
Cc:     Andrew Lunn <andrew@...n.ch>,
        George McCollister <george.mccollister@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        "David S . Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        "open list:OPEN FIRMWARE AND..." <devicetree@...r.kernel.org>
Subject: Re: [PATCH net-next v2 2/3] net: dsa: add Arrow SpeedChips XRS700x
 driver

On Fri, 27 Nov 2020 23:23:42 +0200 Vladimir Oltean wrote:
> On Fri, Nov 27, 2020 at 01:13:46PM -0800, Jakub Kicinski wrote:
> > On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote:  
> > > > Is the periodic refresh really that awful? We're mostly talking error
> > > > counters here so every second or every few seconds should be perfectly
> > > > fine.  
> > >
> > > Humm, i would prefer error counts to be more correct than anything
> > > else. When debugging issues, you generally don't care how many packets
> > > worked. It is how many failed you are interesting, and how that number
> > > of failures increases.  
> >
> > Right, but not sure I'd use the word "correct". Perhaps "immediately up
> > to date"?
> >
> > High speed NICs usually go through a layer of firmware before they
> > access the stats, IOW even if we always synchronously ask for the stats
> > in the kernel - in practice a lot of NICs (most?) will return some form
> > of cached information.
> >  
> > > So long as these counters are still in ethtool -S, i guess it does not
> > > matter. That i do trust to be accurate, and probably consistent across
> > > the counters it returns.  
> >
> > Not in the NIC designs I'm familiar with.
> >
> > But anyway - this only matters in some strict testing harness, right?
> > Normal users will look at a stats after they noticed issues (so minutes
> > / hours later) or at the very best they'll look at a graph, which will
> > hardly require <1sec accuracy to when error occurred.  
> 
> Either way, can we conclude that ndo_get_stats64 is not a replacement
> for ethtool -S, since the latter is blocking and, if implemented correctly,
> can return the counters at the time of the call (therefore making sure
> that anything that happened before the syscall has been accounted into
> the retrieved values), and the former isn't?

ethtool -S stats are not 100% up to date. Not on Netronome, Intel,
Broadcom or Mellanox NICs AFAIK.

> The whole discussion started because you said we shouldn't expose some
> statistics counters in ethtool as long as they have a standardized
> equivalent. Well, I think we still should.

Users must have access to stats via standard Linux interfaces with well
defined semantics. We cannot continue to live in the world where user
has to guess driver specific name for ethtool -S to find out the number
of CRC errors...

I know it may not matter to a driver developer, and it didn't matter
much to me when I was one, because in my drivers they always had the
same name. But trying to monitor a fleet of hardware from multiple
vendors is very painful with the status quo, we must do better.
We can't have users scrape through what is basically a debug interface
to get to vital information.

I'd really love to find a way out of the procfs issue, but I'm not sure
if there is one.