netdev - Re: [RFC PATCH net-next 2/7] net: ethtool: add support for Frame Preemption and MAC Merge layer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220910163619.fchn6kwgtvaszgcb@skbuf>
Date:   Sat, 10 Sep 2022 16:36:20 +0000
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Michal Kubecek <mkubecek@...e.cz>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Xiaoliang Yang <xiaoliang.yang_1@....com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Rui Sousa <rui.sousa@....com>,
        Ferenc Fejes <ferenc.fejes@...csson.com>,
        Richie Pearn <richard.pearn@....com>
Subject: Re: [RFC PATCH net-next 2/7] net: ethtool: add support for Frame
 Preemption and MAC Merge layer

On Fri, Sep 09, 2022 at 05:19:35PM -0700, Vinicius Costa Gomes wrote:
> Vladimir Oltean <vladimir.oltean@....com> writes:
> 
> > On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote:
> >> Yes, as the limits are not in the UAPI this is a minor point, agreed.
> >> 
> >> What I am concerned is about use cases not handled by IEEE 802.1Q, for
> >> example, thinking about that other thread about PCP/DSCP: how about a
> >> network that wants to have as many priorities as the DSCP field (6 bits)
> >> allows, together with frame preemption (in the future we may have
> >> hardware that supports that). I am only thinking that we should not
> >> close that door.
> >> 
> >> Again, minor point.
> >
> > Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE
> > constants of 16 rather than the 8 you'd expect, I guess I don't have
> > that big of a problem to also expose 16 admin-status values for FP.
> > I can limit the drivers I have to only look at the first 8 entries.
> > But what do 16 priorities mean to ethtool? 16 priorities are a tc thing.
> > Does ethtool even have a history of messing with packet priorities?
> > What am I even doing? lol.
> 
> That's a good point. ethtool doesn't has a history of knowing about
> priorities (and what they even mean). But I like the flexibility that
> this approach provides.

Fair. On the other hand, we need to weigh the pros and cons of moving
the adminStatus to other places.

dcbnl

- pro: is a hardware-only interface, just like ethtool, which is
  suitable for a hardware-only feature like FP

- pro: knows about priorities (app table, PFC). I keep coming back to
  PFC as an example because it has the absolute exact same problems as
  FP, which is that it is defined per priority, but there is a "NOTE 2 -
  Mixing PFC and non-PFC priorities in the same queue results in non-PFC
  traffic being paused causing congestion spreading, and therefore is
  not recommended."

- con: I think the dcbnl app table has the same limitation that
  priorities go only up to 8 (the comment above struct dcb_app says
  "@priority: 3-bit unsigned integer indicating priority for IEEE")

- con: where do dcbnl's responsibilities begin, and where do they end,
  exactly? My understanding is that DCB and TSN are exactly the same
  kind of thing, i.e. extensions to 802.1Q (and 802.3, in case of TSN)
  which were made in order to align Ethernet (and bridges) with a set of
  vertical use cases which weren't quite possible before. All is well,
  except TSN was integrated quite differently into the Linux kernel,
  i.o.w. we don't have a "tsnnl" like there is a "dcbnl", but things are
  much more fine-grained, and integrated with the subsystem that they
  extend, rather than creating a subsystem aligned to a use case. So the
  question becomes, is anything we're doing here, for TSN, extending
  DCB? With Microchip's desire to add non-standard APP table selectors
  for VLAN PCP/DEI prioritization, I think the general movement is
  towards more reuse of dcbnl constructs, and towards forgetting that
  dcbnl is for DCB. But there are many things I don't understand about
  dcbnl, for example why is it possible to set an interface's prio-tc
  map using dcb, and what does it even mean in relationship to
  tc-mqprio, tc-taprio etc (or even with "priomap" from tc-ets).

  $ dcb ets set dev eth0 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7

  Jakub had an answer which explained this in terms of Conway's law:
  https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/#24870325
  but as beautiful that answer is, as many questions it still leaves for
  me. Like if we were to accept dcbnl as part of the UAPI for TSN, how
  does this fact fit within our story that the already use the tc-taprio
  map to map priorities to traffic classes?

ethtool:

- pro: is in the same subsystem as MAC merge layer configuration (for
  which I have no doubts that ethtool is the right place)

- pro: is not aligned to any specific vertical use case, like dcbnl is

- con: knows nothing about priorities

- con: for hardware where FP adminStatus is per traffic class rather
  than per priority, we'll have to change the adminStatus not only from
  changes made via ethtool to the FP adminStatus, but also from changes
  made to the prio_tc_map (tc I guess?). But this may also be a subtle
  pro for dcbnl, since that has its own prio_tc_map, so we'd remain
  within the same subsystem?

tc (part of tc-mqprio and tc-taprio):

- open question: does the FP adminStatus influence scheduling or not?
  Normally I'd say no, adminStatus only decides whether the eMAC or pMAC
  will be used to send a frame, and this is orthogonal to scheduling.
  But 802.1Q says that "It should be noted that, other things being
  equal, designating a priority as “express” effectively increases its
  priority above that of any priority designated as “preemptible”."
  Is this phrase enough to justify a modification of the qdisc dequeue
  method in, say, taprio, which would prioritize a lower priority qdisc
  queue A over a higher one B, if A is express but B is preemptable?

- pro: assuming that in lack of tc-mqprio or tc-taprio, an interface's
  number of traffic classes will collapse to 1, then putting FP
  adminStatus here makes it trivial to support hardware with FP per
  traffic class

- con: more difficult for user space to change FP adminStatus
  independently of the current qdisc?

> >> > No; as mentioned, going through mqprio's "map" and "queues" to resolve
> >> > the priority to a queue is something that I intended to be possible.
> >> > Sure, it's not handy either. It would have been handier if the
> >> > "admin-status" array was part of the tc-mqprio config, like you did in
> >> > your RFC.
> >> 
> >> Just to be sure, say that my plan is that I document that for the igc
> >> driver, the mapping of priorities for the "FP" command is prio (0) ->
> >> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
> >> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
> >> queue 0. Would this be ok?
> >
> > If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands
> > in queue 0 where your actual FP knob is, then I'd expect the driver to
> > follow the reverse path between queue -> tc -> prios mapped to that tc.
> > And I would not state that prio 0 is queue 0 and skb->priority values
> > from the whole spectrum may land there, kthxbye. Also, didn't you say
> > earlier that "lowest queue index (0) has highest priority"? In Linux,
> > skb->priority 0 has lower priority that skb->priority 1. How does that
> > work out?
> >
> 
> Ah, sorry for the confusion, when I said "lowest queue index (0) has
> highest priority" it was more in the sense that queue 0 has higher
> precedence, packets from it are fetched first, that kind of thing.

Fetched first by whom?
In ENETC, the transmission selection chooses between TX BD rings whose
configured priority maps to the same traffic class based on a weighted
round robin dequeuing policy. Groups of TX BD rings of different
priorities are in strict priority dequeuing relative to each other.
And the number of the TX BD ring has nothing to do with its priority.

> > In fact, for both enetc and felix, I have to do this exact thing as you,
> > except that my hardware knobs are halfway in the path (per tc, not per
> > queue). The reason why I'm not doing it is because we only consider the
> > 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably.
> >
> 
> Hm, that's interesting. I think that's a difference on how we are
> modeling our traffic, here's one very common mqprio config for AVB use
> cases:
> 
> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>       num_tc 3 \
>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>       queues 1@0 1@1 2@2 \
>       hw 0
> 
> One priority for Class A traffic, one priority for Class B traffic and
> multiple priorities mapped to a single TC, the rest is Best Effort and
> it's mapped to multiple queues. The way we model traffic, TCs and
> priorities are different concepts.

So here's what I don't understand. Why do you need 16 priority values,
when in the end you'll map them to only 3 egress traffic classes anyway?

> One important detail is that in TSN mode, we use the "priority" (meaning
> precedence in this context) fetch order for the queues: queue 0 first,
> then queue 1, and so on.

So the number of the queue is what actually contributes to the egress
scheduling algorithm of your NIC? Strange. Tell me something so I'm sure
I'm understanding you properly. netdev_pick_tx() will pick a TX queue
using skb_tx_hash(), which hashes between the netdev queues mapped to
the same traffic class as the traffic class that skb->priority is mapped to.
In your case, you have 2@2 (2 TX queues for traffic class 2), call these
queues 2 and 3. You're saying that hardware prioritizes queue 2 over 3.
But the network stack doesn't know that. It hashes equally packets that
go to tc 2 between queue 2 and queue 3. In our case, this misconfiguration
would be immediately obvious, because we could have an offloaded tc-taprio
schedule in ENETC, and Linux could be enqueuing into a ring that ultimately
has a different schedule than software knows about. How does this work
out for you?

> > * The enetc verification state machine is off by default. The felix
> >   verification state machine is on by default. I guess we should figure
> >   out a reasonable default for all drivers out there?
> >
> 
> From the language of IEEE 802.3 ("disableVerify" and friends), it seems
> that the expectation is that verification is enabled by default. *But*
> from a interoperability point of view, I would only enable verification
> by default after some testing in the wild. Or only send the verify
> packet if frame preemption is also enabled. 

I wonder whether the answer to this one is actually "Additional Ethernet
Capabilities TLV" (802.3 clause 79.3.7). It would be good if we could
all start off with FP/MM disabled, verification state machine included,
until the LLDP daemon receives a TLV that says PreemptSupported and
PreemptEnabled are true. If both local and remote preemption are
supported and enabled, we tell the kernel to enable the MAC merge layer
and kick off verification. In turn this will alter PreemptActive that
gets set in future replies. I don't know, the flow may be different.

> > * For some reason, I notice that the verification state machine remains
> >   FAILED on felix when it links to enetc and that has verification
> >   enabled. I need to disable it on enetc, enable it again, and only then
> >   will the MAC merge interrupt on felix say that verification completed
> >   successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up()
> >   to toggle verification off and on, if it was requested, in order for
> >   this to be seamless to the user.
> >
> > If it's only these, nothing seems exactly major. However, I assume this
> > is only the beginning. I fully expect there to be one more round of
> > "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly
> > mentally ready (if credible arguments are to be put forward, I'd rather
> > have them now while I haven't yet put in the extra work in the direction
> > I'm about to).
> 
> I really know the feeling "not mentally ready".
> 
> I am still in the "tc" camp. But I cannot deny that this interface seems
> to work for the uses that we have in mind. That's (at least) good enough
> for me ;-)

To me right now, FP adminStatus is not exactly something that the
scheduler is concerned with, and this is why tc is kind of my last
choice. But I agree that it makes certain things simpler. Plus, it seems
to be the only subsystem aware of 16 priorities and 16 traffic classes,
which you seem to care about.