lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230125131011.hs64czbvv6n3tojh@skbuf>
Date:   Wed, 25 Jan 2023 15:10:11 +0200
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>
Cc:     netdev@...r.kernel.org, John Fastabend <john.fastabend@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Camelia Groza <camelia.groza@....com>,
        Xiaoliang Yang <xiaoliang.yang_1@....com>,
        Gerhard Engleder <gerhard@...leder-embedded.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Ferenc Fejes <ferenc.fejes@...csson.com>,
        Tony Nguyen <anthony.l.nguyen@...el.com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>,
        Jacob Keller <jacob.e.keller@...el.com>
Subject: Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup

On Tue, Jan 24, 2023 at 05:11:29PM -0800, Vinicius Costa Gomes wrote:
> Hi Vladimir,
> 
> Sorry for the delay. I had to sleep on this for a bit.

No problem, thanks for responding.

> > Vinicius said that for some Intel NICs, prioritization at the egress
> > scheduler stage is fundamentally attached to TX queues rather than
> > traffic classes.
> >
> > In other words, in the "popular" mqprio configuration documented by him:
> >
> > $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
> >       num_tc 3 \
> >       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
> >       queues 1@0 1@1 2@2 \
> >       hw 0
> >
> > there are 3 Linux traffic classes and 4 TX queues. The TX queues are
> > organized in strict priority fashion, like this: TXQ 0 has highest prio
> > (hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
> > Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
> > but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
> > doesn't know that.
> >
> > I am surprised by this fact, and this isn't how ENETC works at all.
> > For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
> > higher priority than TC 7. For us, groups of TXQs that map to the same
> > TC have the same egress scheduling priority. It is possible (and maybe
> > useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
> > make that more clear.
> 
> That makes me think, making "queues" visible on mqprio/taprio perhaps
> was a mistake. Perhaps if we only had the "prio to tc" map, and relied
> on drivers implementing .ndo_select_queue() that would be less
> problematic. And for devices with tens/hundreds of queues, this "no
> queues to the user exposed" sounds like a better model. Anyway... just
> wondering.
> 
> Perhaps something to think about for mqprio/taprio/etc "the next generation" ;-)

Hmm, not sure I wanted to go there with my proposal. I think the fact
that taprio allows specifying how many TXQs are used per TC (and
starting with which TXQ offset) is a direct consequence of the fact that
mqprio had that in its UAPI. Today, there certainly needs to exist
hardware-level knowledge in mapping TXQs to TCs in a way that makes
software prioritization (netdev_core_pick_tx()) coincide with the
hardware prioritization scheme. That requirement of prior knowledge
certainly makes a given taprio/mqprio configuration less portable across
systems/vendors, which is the problem IMO.

But I wouldn't jump to your conclusion, that it was a mistake to even
expose TXQs to user space. I would argue, maybe the problem is that not
*enough* information about TXQs is exposed to user space. I could
imagine it being useful for user space to be able to probe information
such as

- this netdev has strict prioritization (i.e. for this netdev, egress
  scheduling priority is attached directly to TXQs, and each TXQ is
  required to have a unique priority)

- this netdev has round robin egress scheduling between TXQs (or some
  other fairness scheme; which one?)
  - is the round robin scheduling weighted? what are the weights, and
    are they configurable? should skb_tx_hash() take the weights into
    consideration?

- this netdev has 2 layers of egress scheduling, first being strict
  priority and the other being round robin

Based on this kind of information, some kind of automation would become
possible to write an mqprio configuration that maps TCs to TXQs in a
portable way, and user space sockets are just concerned with the packet
priority API.

I guess different people would want to expose even more, or slightly
different, information about what it is that the kernel exposes for
TXQs. I would be interested to know what that is.

> > Furthermore (and this is really the biggest point of contention), myself
> > and Vinicius have the fundamental disagreement whether the 802.1Qbv
> > (taprio) gate mask should be passed to the device driver per TXQ or per
> > TC. This is what patch 11/11 is about.
> 
> I think that I was being annoying because I believed that some
> implementation detail of the netdev prio_tc_map and the way that netdev
> select TX queues (the "core of how mqprio works") would leak, and it
> would be easier/more correct to make other vendors adapt themselves to
> the "Intel"/"queues have priorities" model. But I stand corrected, as
> you (and others) have proven.

The problem with gates per TXQ is that it doesn't answer the obvious
question of how does that work out when there is >1 TXQ per TC.
With the clarification that "gates per TXQ" requires that there is a
single TXQ per TC, this effectively becomes just a matter of changing
the indices of set bits in the gate mask (TC 3 may correspond to TXQ
offset 5), which is essentially what Gerhard seems to want to see with
tsnep. That is something I don't have a problem with.

But I may want, as a sanity measure, to enforce that the mqprio queue
count for each TC is no more than 1 ;) Otherwise, we fall into that
problem I keep repeating: skb_tx_hash() arbitrarily hashes between 2
TXQs, both have an open gate in software (allowing traffic to pass),
but in hardware, one TXQ has an open gate and the other has a closed gate.
So half the traffic goes into the bitbucket, because software doesn't
know what hardware does/expects.

So please ACK this issue and my proposal to break your "popular" mqprio
configuration.

> In short, I am not opposed to this idea. This capability operation
> really opens some possibilities. The patches look clean.

Yeah, gotta thank Jakub for that.

> I'll play with the patches later in the week, quite swamped at this
> point.

Regarding the patches - I plan to send a v2 anyway, because patch 08/11
"net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()"
doesn't quite work how I'd hoped. Specifically, taprio must hold a
persistent struct tc_mqprio_qopt, rather than just juggling with what it
received last time via TCA_TAPRIO_ATTR_PRIOMAP. This is because taprio
supports some level of dynamic reconfiguration via taprio_change(), and
the TCA_TAPRIO_ATTR_PRIOMAP would be NULL when reconfigured (because the
priomap isn't what has changed). Currently this will result in passing a
NULL (actually disabled) mqprio configuration to ndo_setup_tc(), but
what I really want is to pass the *old* mqprio configuration.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ