netdev - Re: [PATCH net-next v2 2/3] net: dsa: add Arrow SpeedChips XRS700x driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201127224242.vpu4puvzmikpmnyx@skbuf>
Date:   Sat, 28 Nov 2020 00:42:42 +0200
From:   Vladimir Oltean <olteanv@...il.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     George McCollister <george.mccollister@...il.com>,
        Andrew Lunn <andrew@...n.ch>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        "David S . Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        "open list:OPEN FIRMWARE AND..." <devicetree@...r.kernel.org>
Subject: Re: [PATCH net-next v2 2/3] net: dsa: add Arrow SpeedChips XRS700x
 driver

On Fri, Nov 27, 2020 at 01:37:53PM -0800, Jakub Kicinski wrote:
> Replying to George's email 'cause I didn't get Vladimir's email from
> the ML.
>
> On Fri, 27 Nov 2020 14:58:29 -0600 George McCollister wrote:
> > > 100 Kbps = 12.5KB/s.
> > > sja1105 has 93 64-bit counters, and during every counter refresh cycle I
>
> Are these 93 for one port? That sounds like a lot.. There're usually
> ~10 stats (per port) that are relevant to the standard netdev stats.

I would like to report the rx_errors as an aggregate of a few other
discrete counters that are far in between. Hence the idea of reading the
entire region delegated to port counters using a single SPI transaction.

> > Yeah, that's quite big. The xrs700x counters are only 16 bit. They
> > need to be polled on an interval anyway or they will roll.
>
> Yup! That's pretty common.
>
> > > would need to get some counters from the beginning of that range, some
> > > from the middle and some from the end. With all the back-and-forth
> > > between the sja1105 driver and the SPI controller driver, and the
> > > protocol overhead associated with creating a "SPI read" message, it is
> > > all in all more efficient to just issue a burst read operation for all
> > > the counters, even ones that I'm not going to use. So let's go with
> > > that, 93x8 bytes (and ignore protocol overhead) = 744 bytes of SPI I/O
> > > per second. At a throughput of 12.5KB/s, that takes 59 ms to complete,
> > > and that's just for the raw I/O, that thing which keeps the SPI mutex
> > > locked. You know what else I could do during that time? Anything else!
> > > Like for example perform PTP timestamp reconstruction, which has a hard
> > > deadline at 135 ms after the packet was received, and would appreciate
> > > if the SPI mutex was not locked for 59 ms every second.
> >
> > Indeed, however if you need to acquire this data at all it's going to
> > burden the system at that time so unless you're able to stretch out
> > the reads over a length of time whether or not you're polling every
> > second or once a day may not matter if you're never able to miss a
> > deadline.
>
> Exactly, either way you gotta prepare for users polling those stats.
> A design where stats are read synchronously and user (an unprivileged
> user, BTW) has the ability to disturb the operation of the system
> sounds really flaky.

So I was probably overreacting with the previous estimate, since
- nobody in their right mind is going to use sja1105 at a SPI clock of
  100 KHz, but at least 100 times that value.
- other buses, like I2C/MDIO, don't really have the notion of variable
  buffer sizes and burst transactions. So the wait time has a smaller
  upper bound there, because of the implicit "cond_resched" given by the
  need to read word by word. In retrospect it would be wiser to not make
  use of burst reads for getting stats over SPI either.

But to your point. The worst case is worse than periodic readouts, and
the system needs to be resilient against them. Well, some things are
finicky no matter what you do and how well you protect, like phc2sys.
So for those, you need to be more radical, and you need to do cross
timestamping as close to the actual I/O as possible:
https://lwn.net/Articles/798467/
But nonetheless, point taken. As long as the system withstands the worst
case, then the impact it has on latency is not catastrophic, although it
is still there. But who would not like a system which achieves the same
thing by doing less? I feel that not sorting this out now will deter
many DSA driver writers from using ndo_get_stats64, if they don't have
otherwise a reason to schedule a periodic stats workqueue anyway.

> > > And all of that, for what benefit? Honestly any periodic I/O over the
> > > management interface is too much I/O, unless there is any strong reason
> > > to have it.
> >
> > True enough.
> >
> > > Also, even the simple idea of providing out-of-date counters to user
> > > space running in syscall context has me scratching my head. I can only
> > > think of all the drivers in selftests that are checking statistics
> > > counters before, then they send a packet, then they check the counters
> > > after. What do those need to do, put a sleep to make sure the counters
> > > were updated?
>
> Frankly life sounds simpler on the embedded networking planet than it is
> on the one I'm living on ;) High speed systems are often eventually
> consistent. Either because stats are gathered from HW periodically by
> the FW, or RCU grace period has to expire, or workqueue has to run,
> etc. etc. I know it's annoying for writing tests but it's manageable.

Yes, but all of these are problems which I do not have, so I don't see
why I would want to suffer from others' challenges.

> If there is a better alternative I'm all ears but having /proc and
> ifconfig return zeros for error counts while ip link doesn't will lead
> to too much confusion IMO. While delayed update of stats is a fact of
> life for _years_ now (hence it was backed into the ethtool -C API).

I don't know the net/core folder well enough to say anything definitive
about it. To me nothing looked obviously possible.
Although in my mind of somebody who doesn't understand how other pieces
of code seem to get away by exposing the same object through two
different types of locking (like br_vlan_get_pvid and
br_vlan_get_pvid_rcu), maybe this could be an approach to try to obtain
the RTNL in net-procfs.c instead of RCU. But this is just, you know,
speculating things I don't understand.