netdev - Re: [RFC PATCH net-next 2/7] net: ethtool: add support for Frame Preemption and MAC Merge layer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220915141417.ru2rdxujcgihmmd5@skbuf>
Date:   Thu, 15 Sep 2022 14:14:18 +0000
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Michal Kubecek <mkubecek@...e.cz>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Xiaoliang Yang <xiaoliang.yang_1@....com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Rui Sousa <rui.sousa@....com>,
        Ferenc Fejes <ferenc.fejes@...csson.com>,
        Richie Pearn <richard.pearn@....com>
Subject: Re: [RFC PATCH net-next 2/7] net: ethtool: add support for Frame
 Preemption and MAC Merge layer

I won't reply to every point, as that would just diverge more and more
from what I'm really trying to settle on.

On Tue, Sep 13, 2022 at 07:59:29PM -0700, Vinicius Costa Gomes wrote:
> For the long term, perhaps the solution is to have a single place that
> does this mapping (from skb to traffic classes and then to queues) more
> centralized. 'dcb' would populate that as would 'mqprio/taprio'. This
> seems to imply that ndo_select_queue() is a escape hatch, that in the
> ideal world it would not exist. Would this create worse problems?
> (Reality is pointing that it doesn't seem that we can avoid making
> drivers aware of this map, so this idea doesn't seem to help the
> situation much)

That central place is called netdev_set_prio_tc_map(). It is called from
qdiscs and drivers alike. W.r.t. your "ndo_select_queue() escape hatch",
the commit message of 4f57c087de9b ("net: implement mechanism for HW
based QOS") provides some nice insight into networking history, 11 years
later.

> > ethtool:
> >
> > - pro: is in the same subsystem as MAC merge layer configuration (for
> >   which I have no doubts that ethtool is the right place)
> >
> > - pro: is not aligned to any specific vertical use case, like dcbnl is
> >
> > - con: knows nothing about priorities
> >
> > - con: for hardware where FP adminStatus is per traffic class rather
> >   than per priority, we'll have to change the adminStatus not only from
> >   changes made via ethtool to the FP adminStatus, but also from changes
> >   made to the prio_tc_map (tc I guess?). But this may also be a subtle
> >   pro for dcbnl, since that has its own prio_tc_map, so we'd remain
> >   within the same subsystem?
> 
> If the driver also supports DCB, it will have the offload functions
> implemented, so it be aware of changes in that map.

Not so simple. DCB has many ops, you don't have to implement all of them.

> The problem is now mqprio, that if it isn't offloaded, the driver is
> not aware of any changes. But mqprio doesn't implement .change(),
> which makes things a bit easier.
> 
> 'ethtool' is gaining my vote.

Well, ethtool is certainly not aware of changes to the prio:tc map
either, very much less so than the driver :)

> > tc (part of tc-mqprio and tc-taprio):
> >
> > - open question: does the FP adminStatus influence scheduling or not?
> >   Normally I'd say no, adminStatus only decides whether the eMAC or pMAC
> >   will be used to send a frame, and this is orthogonal to scheduling.
> >   But 802.1Q says that "It should be noted that, other things being
> >   equal, designating a priority as “express” effectively increases its
> >   priority above that of any priority designated as “preemptible”."
> >   Is this phrase enough to justify a modification of the qdisc dequeue
> >   method in, say, taprio, which would prioritize a lower priority qdisc
> >   queue A over a higher one B, if A is express but B is preemptable?
> 
> I understood this as: if a preemptible packet (higher priority) and
> expresss packet (lower priority) arrive at the MAC at the same time, the
> express packet will finish transmitting before the preemptible, causing
> the express packet to have an 'effective' higher priority.
> 
> I don't think we need to change anything on the dequeueing in the qdisc
> side (and trying to solve it is going to be awkward, having to check the
> neighbor qdiscs for packets being enqueued/peeked, ugh!).

Agree. So bottom line, express/preemptable does not affect the
scheduling algorithm. This is one reservation I have against FP in tc.
Maybe if we restricted this new option to only the hardware offload
modes of the respective qdiscs.

> (as a note: for i225, traffic in queues marked as express will only
> preempt traffic from lower numbered queues, this behavior doesn't seem
> to be forbidden by the standard)

So I want the frame preemption adminStatus to work hand in hand with the
prio:tc map of the netdev, no matter where/how the adminStatus lands.

Whereas your hardware implementation does not work hand in hand with it.
You'll have to add some restrictions and do some clarification work, sorry.

> For the hardware I work with:
>  - i210: two modes, (1) round robin, by default, or in (2) Qav-mode
>    ("TSN), strict priority, fixed by the queue index.
>  - i225/i226: two modes, (1) round robin, by default, or in (2) TSN
>     mode, strict priority, configurable queue priority.
> 
> (to give the full picture, we can also configure how the descriptors are
> going to be fetched from host memory, but as host memory is not usually
> where the congestion happens, it's less important for this)

So this is to say that i210 has 1 traffic class in mode (1), and a
number of traffic classes equal to the number of queues, in mode (2)?
As for i225/i226, the number of traffic classes is given by the number
of uniqueuly configured queue priorities? Or what would you call a
traffic class, exactly?

> > So here's what I don't understand. Why do you need 16 priority values,
> > when in the end you'll map them to only 3 egress traffic classes anyway?
> >
> 
> Basically because of how prio_tc_map works and the fact that I want TC 0
> to be mapped to queue 0. i.e. if an entry is not specified in the map,
> it's going ot be mapped to TC 0. I want to avoid that, historically, in
> our models, TC 0 is the "important" one.

Strange. I wonder who else expects traffic class 0 to be the important one?
This is clearly a case of 2 vendors using the exact same tooling and
commands, but a configuration script working in the exact opposite way.

> More or less, repeating a bit what I said before, on i210 it was fixed,
> on i225/i226 it can be changed, but we decided to keep the default, less
> surprises for our users.

For yours, sure, maybe. For others.. :)

> In (some of) our traffic models, the only kind of traffic class that we
> allow to go through multiple queues is Best Effort.

With the mention that half (or more) of the best effort traffic will be
Better Effort, due to your secret prioritization scheme within the same
traffic class. And netdev_pick_tx() will essentially be a Russian
roulette, because it hashes packets with the same skb->priority towards
queues of different actual priorities.

> And even that it works out because taprio "translates" from traffic
> classes to queues when it sends the offload information to the driver,
> i.e. the driver knows the schedule of queues, not traffic classes.

Which is incredibly strange to me, since the standard clearly defines
Qbv gates to be per traffic class, and in ENETC, even if we have 2 TX
queues for the same traffic class (one per CPU), the hardware schedule
is still per traffic class and not per independent TX queue (BD ring).

How does this work for i225/i226, if 2 queues are configured for the
same dequeue priority? Do the taprio gates still take effect per queue?

> That's a good point (your understanding of the flow is similar to mine).
> This seems a good idea. But we may have to wait for a bit until we have
> a LLDP implementation that supports this.

This is circular; we have a downstream patch to openlldp that adds
support for this TLV, but it can't go anywhere until there is mainline
kernel support.

What about splitting out MAC merge support from FP support, and me
concentrating on the MAC merge layer for now? They're independent up to
a pretty high level. The MAC merge layer is supposed to be controlled by
the LLDP daemon and be pretty much plug-and-play, while the FP adminStatus
is supposed to be set by some high level administrator, like a NETCONF
client.

We can argue some more about how to best expose the FP adminStatus.
Right now, I'm thinking that for all intents and purposes, the
adminStatus really only makes practical sense when we have at least 2
traffic classes: one onto which we can map the preemptable priorities,
and one onto which we can map the express ones. In turn, it means that
the 'priorities collapsing into the same traffic class' problem that we
have when we delete the taprio/mqprio qdisc can be solved if we put the
FP adminStatus into the same netlink interface that also configures the
prio:tc map and the number of traffic classes (i.e. yes, in tc).

I'll let someone more knowledgeable of our sysrepo-tsn implementation of
a NETCONF server (like Xiaoliang, maybe even Rui or Richie?)
https://github.com/real-time-edge-sw/real-time-edge-sysrepo
comment on whether this would create any undesirable side effect.
The implication is that user space, when asked to change the adminStatus,
will have to figure out whether we are using a taprio, or an mqprio
qdisc, and create different netlink attributes to talk to the kernel to
request that configuration. User space will also have to send an error
towards the NETCONF client if no such qdisc exists (meaning that the
user wants to combine priorities of different express/preemptable values
in the same traffic class, TC0). The qdiscs will also centrally validate
whether such invalid mixed configurations are requested.

I'm mostly OK with this: an UAPI where FP adminStatus is per priority,
but it gets translated into per-tc by the qdisc, and passed to the
driver through ndo_setup_tc(). I am certainly still very much against
the "preemptable queue mask" that you proposed, and IMO you'll have to
do something and clarify what a traffic class even means for Intel
hardware, especially in relationship to multiple queues per tc
(mqprio queues 2@1).

Anyone else who has an opinion of any sort, please feel free.