netdev - Re: [PATCH net v1] net: taprio offload: enforce qdisc to netdev queue mapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a5057b2-06f4-1621-6c34-d643f5b2a09d@oss.nxp.com>
Date:   Tue, 18 May 2021 16:34:12 +0200
From:   Yannick Vignon <yannick.vignon@....nxp.com>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>,
        Jakub Kicinski <kuba@...nel.org>
Cc:     Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        Joakim Zhang <qiangqing.zhang@....com>,
        sebastien.laveze@....nxp.com,
        Yannick Vignon <yannick.vignon@....com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Ivan Khoronzhuk <ivan.khoronzhuk@...aro.org>,
        Vladimir Oltean <olteanv@...il.com>,
        Vedang Patel <vedang.patel@...el.com>,
        Michael Walle <michael@...le.cc>
Subject: Re: [PATCH net v1] net: taprio offload: enforce qdisc to netdev queue
 mapping

On 5/18/2021 12:42 AM, Vinicius Costa Gomes wrote:
> Yannick Vignon <yannick.vignon@....nxp.com> writes:
> 
>> On 5/15/2021 1:47 AM, Vinicius Costa Gomes wrote:
>>> Jakub Kicinski <kuba@...nel.org> writes:
>>>
>>>> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>>>>> Jakub Kicinski <kuba@...nel.org> writes:
>>>>>> You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
>>>>>> CCing them now. Comments, anyone?
>>>>>
>>>>> I guess I should suggest myself as maintainer, to reduce chances of this
>>>>> happening again.
>>>>
>>>> Yes, please.
>>>>
>>>>>> This looks like a very drastic change. Are you expecting the qdisc will
>>>>>> always be bypassed?
>>>>>
>>>>> Only when running in full offload mode it will be bypassed.
>>>>>
>>>>> And it's kind of by design, in offload mode, the idea was: configure the
>>>>> netdev traffic class to queue mapping, send the schedule to the hardware
>>>>> and stay out of the way.
>>>>>
>>>>> But as per Yannick's report, it seems that taprio doesn't stay enough
>>>>> out of the yay.
>>>>>
>>>>>> After a 1 minute looks it seems like taprio is using device queues in
>>>>>> strict priority fashion. Maybe a different model is needed, but a qdisc
>>>>>> with:
>>>>>>
>>>>>> enqueue()
>>>>>> {
>>>>>> 	WARN_ONCE(1)
>>>>>> }
>>>>>>
>>>>>> really doesn't look right to me.
>>>>>
>>
>> My idea was to follow the logic of the other qdiscs dealing with
>> hardware multiqueue, namely mq and mqprio. Those do not have any
>> enqueue/dequeue callbacks, but instead define an attach callback to map
>> the child qdiscs to the HW queues. However, for taprio all those
>> callbacks are already defined by the time we choose between software and
>> full-offload, so the WARN_ONCE was more out of extra caution in case I
>> missed something. If my understanding is correct however, it would
>> probably make sense to put a BUG() instead, since those code paths
>> should never trigger with this patch.
>>
>> OTOH what did bother me a bit is that because I needed an attach
>> callback for the full-offload case, I ended up duplicating some code
>> from qdisc_graft in the attach callback, so that the software case would
>> continue behaving as is.
>>
>> Those complexities could be removed by pulling out the full-offload case
>> into its own qdisc, but as I said it has other drawbacks.
>>
>>>>> This patch takes the "stay out of the way" to the extreme, I kind of
>>>>> like it/I am not opposed to it, if I had this idea a couple of years
>>>>> ago, perhaps I would have used this same approach.
>>>>
>>>> Sorry for my ignorance, but for TXTIME is the hardware capable of
>>>> reordering or the user is supposed to know how to send packets?
>>>
>>> At least the hardware that I am familiar with doesn't reorder packets.
>>>
>>> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
>>> things work when taprio and ETF are used together is something like
>>> this: taprio only has enough knowledge about TXTIME to drop packets that
>>> would be transmitted outside their "transmission window" (e.g. for
>>> traffic class 0 the transmission window is only for 10 to 50, the TXTIME
>>> for a packet is 60, this packet is "invalid" and is dropped). And then
>>> when the packet is enqueued to the "child" ETF, it's re-ordered and then
>>> sent to the driver.
>>>
>>> And this is something that this patch breaks, the ability of dropping
>>> those invalid packets (I really wouldn't like to do this verification
>>> inside our drivers). Thanks for noticing this.
>>>
>>
>> Hmm, indeed, I missed that check (we don't use ETF currently). I'm not
>> sure of the best way forward, but here are a few thoughts:
>> . The problem only arises for full-offload taprio, not for the software
>> or TxTime-assisted cases.
>> . I'm not sure mixing taprio(full-offload) with etf(no-offload) is very
>> useful, at least with small gate intervals: it's likely you will miss
>> your window when trying to send a packet at exactly the right time in
>> software (I am usually testing taprio with a 2ms period and a 4µs
>> interval for the RT stream).
>> . That leaves the case of taprio(full-offload) with etf(offload). Right
>> now with the current stmmac driver config, a packet whose tstamp is
>> outside its gate interval will be sent on the next interval (and block
>> the queue).
> 
> This is the case that is a bit problematic with our hardware. (full taprio
> offload + ETF offload).

Based on your previous comment to Michael, this is starting to look a 
lot like a work-around for a hardware bug. Would moving it to the 
corresponding driver really be out of the question?
Note: there are currently only 4 drivers implementing ETF (including 2 
from Intel), so validating their behavior with late packets would likely 
be doable if needed.

>> . The stmmac hardware supports an expiryTime, currently unsupported in
>> the stmmac driver, which I think could be used to drop packets whose
>> tstamps are wrong (the packet would be dropped once the tstamp
>> "expires"). We'd need to add an API for configuration though, and it
>> should be noted that the stmmac config for this is global to the MAC,
>> not per-queue (so a config through sch-etf would affect all queues).
>> . In general using taprio(full-offload) with etf(offload) will incur a
>> small latency penalty: you need to post the packet before the ETF qdisc
>> wakes up (plus some margin), and the ETF qdisc must wake up before the
>> tx stamp (plus some margin). If possible (number of streams/apps <
>> number of hw queues), it would be better to just use
>> taprio(full-offload) alone, since the app will need to post the packet
>> before the gate opens (so plus one margin, not 2).
> 
> It really depends on the workload, and how the schedule is organized,
> but yeah, that might be possible (for some cases :-)).
> 
>>
>>
>>>>
>>>> My biggest problem with this patch is that unless the application is
>>>> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
>>>> servicing the queue when the application sends - the qdisc will not
>>>> be bypassed, right?
>>
>> See above, unless I'm mistaken the "root" qdisc is never
>> enqueued/dequeued for multi-queue aware qdiscs.
>>
> 
> That's true, mq and mqprio don't have enqueue()/dequeue(), but I think
> that's more a detail of their implementation than a rule (that no
> multiqueue-aware root qdisc should implement enqueue()/dequeue()).
> 
> That is, from my point of view, there's nothing wrong in having a root
> qdisc that's also a shaper/scheduler.
> 
>>>>> I am now thinking if this idea locks us out of anything.
>>>>>
>>>>> Anyway, a nicer alternative would exist if we had a way to tell the core
>>>>> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
>>>>> after init() runs.
>>>>
>>
>> Again, I don't think enqueue/dequeue are called unless the HW queues
>> point to the root qdisc. But this does raise an interesting point: the
>> "scheduling" issue I observed was on the dequeue side, when all the
>> queues were dequeued within the RT process context. If we could point
>> the enqueue side to the taprio qdisc and the dequeue side to the child
>> qdiscs, that would probably work (but I fear that would be a significant
>> change in the way the qdisc code works).
> 
> I am wondering if there's a simpler solution, right now (as you pointed
> out) taprio traverses all queues during dequeue(), that's the problem.
> 
> What I am thinking is if doing something like:
> 
>       sch->dev_queue - netdev_get_tx_queue(dev, 0);
> 
> To get the queue "index" and then only dequeuing packets for that queue,
> would solve the issue. (A bit ugly, I know).
> 
> I just wanted to write the idea down to see if any one else could find
> any big issues with it. I will try to play with it a bit, if no-one
> beats me to it.

I looked into it this morning, but I'm not sure how that would work. 
With the current code, when qdisc_run() is executed sch->dev_queue will 
always be 0 (sch being the taprio qdisc), so you'd need my proposed 
patch or something similar for sch to point to the child qdiscs, but at 
that point you'd also be bypassing the taprio qdisc on enqueue.

If removing the TxTime check inside taprio_queue remains difficult, I am 
wondering if keeping my patch with the change below would work (not tested):
***********************************************************************
@@ -4206,7 +4206,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, 
struct net_device *sb_dev)
                 skb_dst_force(skb);

         txq = netdev_core_pick_tx(dev, skb, sb_dev);
-       q = rcu_dereference_bh(txq->qdisc);
+       if (dev->qdisc->enqueue)
+               q = dev->qdisc;
+       else
+               q = rcu_dereference_bh(txq->qdisc);

         trace_net_dev_queue(skb);
         if (q->enqueue) {
***********************************************************************
One risk though for multiqueue-aware qdiscs that define an enqueue (only 
taprio is in that case today) is that the child qdisc selected by 
enqueue would not match the txq from netdev_core_pick_tx, so in the end 
we would call qdisc_run on the wrong qdisc...



>>
>>>> I don't think calling enqueue() and dequeue() is a problem. The problem
>>>> is that RT process does unrelated work.
>>>
>>> That is true. But this seems like a much bigger (or at least more
>>> "core") issue.
>>>
>>>
>>> Cheers,
>>>
>>
> 
> 
> Cheers,
>