netdev - Re: [RFC PATCH net-next 2/7] net: ethtool: add support for Frame Preemption and MAC Merge layer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220907205711.hmvp7nbyyp7c73u5@skbuf>
Date:   Wed, 7 Sep 2022 20:57:11 +0000
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Michal Kubecek <mkubecek@...e.cz>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Xiaoliang Yang <xiaoliang.yang_1@....com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Rui Sousa <rui.sousa@....com>,
        Ferenc Fejes <ferenc.fejes@...csson.com>,
        Richie Pearn <richard.pearn@....com>
Subject: Re: [RFC PATCH net-next 2/7] net: ethtool: add support for Frame
 Preemption and MAC Merge layer

On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote:
> Yes, as the limits are not in the UAPI this is a minor point, agreed.
> 
> What I am concerned is about use cases not handled by IEEE 802.1Q, for
> example, thinking about that other thread about PCP/DSCP: how about a
> network that wants to have as many priorities as the DSCP field (6 bits)
> allows, together with frame preemption (in the future we may have
> hardware that supports that). I am only thinking that we should not
> close that door.
> 
> Again, minor point.

Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE
constants of 16 rather than the 8 you'd expect, I guess I don't have
that big of a problem to also expose 16 admin-status values for FP.
I can limit the drivers I have to only look at the first 8 entries.
But what do 16 priorities mean to ethtool? 16 priorities are a tc thing.
Does ethtool even have a history of messing with packet priorities?
What am I even doing? lol.

> Sorry if I made you go that deep in the spec.
> 
> I was only trying to say that for some implementations it's
> difficult/impossible to have totally orthogonal priorities vs. queue
> assignments, and that the commit message needs to take that into account
> (because the API introduced in this commit is implementation
> independent).

Yes, the comment is totally valid. However, between "difficult" and
"impossible" there is a big leap, and if it is really impossible, I'd
like to know precisely why.

> I, truly, admire your effort (and the time you took) and your
> explanation. I think I understand where you come from better. Thank you.
> 
> Taking a few steps back (I don't like this expression, but anyway), I
> think that strict conformance is not an objective of the Linux network
> stack as a whole. What I mean by "not strict": that it doesn't try to
> follow the letter of the "baggy pants" model that IEEE 802.* and friends
> define. Being efficient/useful has a higher importance.
> 
> But being certifiable is a target/ideal, i.e. the "on the wire" and user
> visible (statistics, etc) behavior can/"should be able" to be configured
> to pass conformance tests or even better, interoperability tests.
> 
> Now back on track, in my mental model, this simplifies things to a
> couple of questions that I ask myself about this RFC, in particular
> about this priority idea:
>  - Is this idea useful? My answer is yes.
>  - Can this idea be used to configure a certifiable system? Also, yes.
> 
> So, I have no problems with it.
> 

I've taken quite a few steps back now, unfortunately I'm still back to
where I was :)

I've talked to more people within NXP, explained to them that the
standard technically does not disallow the existence of FP for single
queue devices, for this and that reason. The only responses I got varied
from the equivalent of a frightened "brrr", to a resounding "no, that
isn't what was intended". Of course I tried to dig deeper, and I was
told that the configuration is per (PCP) priority because this is the
externally visible label carried by packets, so an external management
system that has to coordinate FP across a LAN does not need to know how
many traffic classes each node has, or how the priorities map to the
traffic classes. Only the externally visible behavior is important,
which is that packets with priority (PCP) X are sent on the wire using a
preemptable SFD or a normal SFD, because the configuration also affects
what the other nodes in the network expect. You might contrast this with
tc-taprio, where the open/closed gates really are defined per traffic
class, and you might wonder why it wasn't important there for the
configuration to also be handled using the externally-visible priority
(PCP) as input. Yes, but with tc-taprio, you can schedule within your
traffic classes all you want, but this does not change the externally
visible appearance of a frame, so it doesn't matter. The scheduling is
orthogonal to whether a packet will be sent as preemptable or not.

Is this explanation satisfactory, and where does it leave us? For me it
isn't, and it leaves me nowhere new, but it's the best I got.

So ok, single TX queue devices with FP probably are possibly a fun
physics class experiment, but practically minded vendors won't implement
them. But does the alternate justification given change my design decision
in any way (i.e. expose "preemptable priorities" in ethtool as opposed
to "preemptable queues" in tc-mqprio and tc-taprio as you did)? No.
It just becomes a matter of enforcing the recommended restriction that
preemptable and express priorities shouldn't be mixed within the same
traffic class, something which my current patch set does not do.

[ yes, "should" is a technical term in IEEE standards which means "not
  mandatory", and that's precisely how the constraint from 12.30.1.1.1
  was formulated ]

I guess when push comes to shove, somebody will have to answer the
question of why was FP exposed in Linux by something other than what the
standard defined it for (prio), and I wouldn't know how to answer that.

> >>  - Question: would it be against the intention of the API to have a 1:1
> >>  map of priorities to queues?
> >
> > No; as mentioned, going through mqprio's "map" and "queues" to resolve
> > the priority to a queue is something that I intended to be possible.
> > Sure, it's not handy either. It would have been handier if the
> > "admin-status" array was part of the tc-mqprio config, like you did in
> > your RFC.
> 
> Just to be sure, say that my plan is that I document that for the igc
> driver, the mapping of priorities for the "FP" command is prio (0) ->
> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
> queue 0. Would this be ok?

If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands
in queue 0 where your actual FP knob is, then I'd expect the driver to
follow the reverse path between queue -> tc -> prios mapped to that tc.
And I would not state that prio 0 is queue 0 and skb->priority values
from the whole spectrum may land there, kthxbye. Also, didn't you say
earlier that "lowest queue index (0) has highest priority"? In Linux,
skb->priority 0 has lower priority that skb->priority 1. How does that
work out?

In fact, for both enetc and felix, I have to do this exact thing as you,
except that my hardware knobs are halfway in the path (per tc, not per
queue). The reason why I'm not doing it is because we only consider the
1:1 prio_tc_map, so we use "prio" and "tc" interchangeably.

It can't even be any other way given the current code, since the struct
tc_taprio_qopt_offload passed via ndo_setup_tc does not even contain the
tc_mqprio_qopt sub-structure with the prio_tc_map. Drivers are probably
expected to look, from the ndo_setup_tc hook, at their dev->prio_tc_map
as changed by netdev_set_prio_tc_map() in taprio_change(). Why this API
is different compared to struct tc_mqprio_qopt_offload, I don't know.

What prevents you from doing this? I'd like to understand as precisely
as you can explain, to see how what I'm proposing really sounds when
applied to Intel hardware.

> >>  - Deal breaker: fixed 8 prios;
> >
> > idk, I don't think that's our biggest problem right now honestly.
> 
> Perhaps I am being too optimistic, but the way I see, your proposal is
> already expressible enough to be used in a "certifiable" system. I was
> too harsh with "deal breaker.
> 
> So, what is the biggest problem that you see?
> 
> Overall, no issues from my side.

Let me recap my action items out loud for v2 (some of them weren't
discussed publicly per se).

* Enforce the constraint recommended by 12.30.1.1.1, somewhere as a
  static inline helper function that can be called by drivers, from the
  tc-taprio/tc-mqprio and ethtool-fp code paths. Note that I could very
  quickly run into a very deep rabbit hole here, if I actually bother to
  offload to hardware (enetc) the prio_tc_map communicated from tc-mqprio.
  I'll try to avoid that. I noticed that the mqprio_parse_opt() function
  doesn't validate the provided queue offsets and count if hw offload is
  requested (therefore allowing overlapping queue ranges). I really
  dislike that, because it was probably done for a reason.

* Port the isochron script I have for enetc (endpoint) FP latency
  measurements to kselftest format.

* Do something such that eMAC/pMAC counters are also summed up somewhere
  centrally by ethtool, and reported to the user either individually or
  aggregated.

* Reorganize the netlink UAPI such as to remove ETHTOOL_A_FP_PARAM_TABLE
  and just have multiple ETHTOOL_A_FP_PARAM_ENTRY nests in the parent.

* The enetc verification state machine is off by default. The felix
  verification state machine is on by default. I guess we should figure
  out a reasonable default for all drivers out there?

* For some reason, I notice that the verification state machine remains
  FAILED on felix when it links to enetc and that has verification
  enabled. I need to disable it on enetc, enable it again, and only then
  will the MAC merge interrupt on felix say that verification completed
  successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up()
  to toggle verification off and on, if it was requested, in order for
  this to be seamless to the user.

If it's only these, nothing seems exactly major. However, I assume this
is only the beginning. I fully expect there to be one more round of
"why not dcbnl? why not tc?" nitpicks, for which I'm not exactly
mentally ready (if credible arguments are to be put forward, I'd rather
have them now while I haven't yet put in the extra work in the direction
I'm about to).