netdev - Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60799930-56a0-3692-9482-e733d7277152@intel.com>
Date:   Tue, 27 Mar 2018 16:26:39 -0700
From:   Jesus Sanchez-Palencia <jesus.sanchez-palencia@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     netdev@...r.kernel.org, jhs@...atatu.com, xiyou.wangcong@...il.com,
        jiri@...nulli.us, vinicius.gomes@...el.com,
        richardcochran@...il.com, anna-maria@...utronix.de,
        henrik@...tad.us, John Stultz <john.stultz@...aro.org>,
        levi.pearson@...man.com, edumazet@...gle.com, willemb@...gle.com,
        mlichvar@...hat.com
Subject: Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

Hi Thomas,


On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
>> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
>>> So what's the plan for this? Having TAS as a separate entity or TAS feeding
>>> into the proposed 'basic' time transmission thing?
>>
>> The second one, I guess.
>
> That's just wrong. It won't work. See below.

Yes, our proposal does not handle the scenarios you are bringing into the
discussion.

I think we have more points of convergence than divergence already. I will just
go through some pieces of the discussion first, and then let's see if we can
agree on where we are trying to get.



>
>> Elaborating, the plan is at some point having TAS as a separate entity,
>> but which can use tbs for one of its classes (and cbs for another, and
>> strict priority for everything else, etc).
>>
>> Basically, the design would something along the lines of 'taprio'. A root qdisc
>> that is both time and priority aware, and capable of running a schedule for the
>> port. That schedule can run inside the kernel with hrtimers, or just be
>> offloaded into the controller if Qbv is supported on HW.
>>
>> Because it would expose the inner traffic classes in a mq / mqprio / prio style,
>> then it would allow for other per-queue qdiscs to be attached to it. On a system
>> using the i210, for instance, we could then have tbs installed on traffic class
>> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
>> entity (i.e. 'taprio') which would be setting the packets' txtime before
>> dequeueing packets on a fast path -> tbs -> NIC.
>>
>> Similarly, other qdisc, like cbs, could be installed if all that traffic class
>> requires is traffic shaping once its 'gate' is allowed to execute the selected
>> tx algorithm attached to it.
>>
>>> I've not yet seen a convincing argument why this low level stuff with all
>>> of its weird flavours is superiour over something which reflects the basic
>>> operating principle of TSN.
>>
>>
>> As you know, not all TSN systems are designed the same. Take AVB systems, for
>> example. These not always are running on networks that are aware of any time
>> schedule, or at least not quite like what is described by Qbv.
>>
>> On those systems there is usually a certain number of streams with different
>> priorities that care mostly about having their bandwidth reserved along the
>> network. The applications running on such systems are usually based on AVTP,
>> thus they already have to calculate and set the "avtp presentation time"
>> per-packet themselves. A Qbv scheduler would probably provide very little
>> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
>> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, for
>> instance.
>
> You're looking at it from particular use cases and try to accomodate for
> them in the simplest possible way. I don't think that cuts it.
>
> Let's take a step back and look at it from a more general POV without
> trying to make it fit to any of the standards first. I'm deliberately NOT
> using any of the standard defined terms.
>
> At the (local) network level you have always an explicit plan. This plan
> might range from no plan at all to an very elaborate plan which is strict
> about when each node is allowed to TX a particular class of packets.


Ok, we are aligned here.


>
> So lets assume we have the following picture:
>
>    	       	  [NIC]
> 		    |
> 	 [ Time slice manager ]
>
> Now in the simplest case, the time slice manager has no constraints and
> exposes a single input which allows the application to say: "Send my packet
> at time X". There is no restriction on 'time X' except if there is a time
> collision with an already queued packet or the requested TX time has
> already passed. That's close to what you implemented.
>
>   Is the TX timestamp which you defined in the user space ABI a fixed
>   scheduling point or is it a deadline?
>
>   That's an important distinction and for this all to work accross various
>   use cases you need a way to express that in the ABI. It might be an
>   implicit property of the socket/channel to which the application connects
>   to but still you want to express it from the application side to do
>   proper sanity checking.
>
>   Just think about stuff like audio/video streaming. The point of
>   transmission does not have to be fixed if you have some intelligent
>   controller at the receiving end which can buffer stuff. The only relevant
>   information is the deadline, i.e. the latest point in time where the
>   packet needs to go out on the wire in order to keep the stream steady at
>   the consumer side. Having the notion of a deadline and that's the only
>   thing the provider knows about allows you proper utilization by using an
>   approriate scheduling algorithm like EDF.
>
>   Contrary to that you want very explicit TX points for applications like
>   automation control. For this kind of use case there is no wiggle room, it
>   has to go out at a fixed time because that's the way control systems
>   work.
>
>   This is missing right now and you want to get that right from the very
>   beginning. Duct taping it on the interface later on is a bad idea.


Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe it's been
covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a different
mechanism for expressing that?


>
> Now lets go one step further and create two time slices for whatever
> purpose still on the single node (not network wide). You want to do that
> because you want temporal separation of services. The reason might be
> bandwidth guarantee, collission avoidance or whatever.
>
>   How does the application which was written for the simple manager which
>   had no restrictions learn about this?
>
>   Does it learn it the hard way because now the packets which fall into the
>   reserved timeslice are rejected? The way you created your interface, the
>   answer is yes. That's patently bad as it requires to change the
>   application once it runs on a partitioned node.
>
>   So you really want a way for the application to query the timing
>   constraints and perhaps other properties of the channel it connects
>   to. And you want that now before the first application starts to use the
>   new ABI. If the application developer does not use it, you still have to
>   fix the application, but you have to fix it because the developer was a
>   lazy bastard and not because the design was bad. That's a major
>   difference.


Ok, this is something that we have considered in the past, but then the feedback
here drove us onto a different direction. The overall input we got here was that
applications would have to be adjusted or that userspace would have to handle
the coordination between applications somehow (e.g.: a daemon could be developed
separately to accommodate the fully dynamic use-cases, etc).


>
> Now that we have two time slices, I'm coming back to your idea of having
> your proposed qdisc as the entity which sits right at the network
> interface. Lets assume the following:
>
>    [Slice 1: Timed traffic ] [Slice 2: Other Traffic]
>
>   Lets assume further that 'Other traffic' has no idea about time slices at
>   all. It's just stuff like ssh, http, etc. So if you keep that design
>
>        	         [ NIC ]
>   	            |
>            [ Time slice manager ]
> 	       |          |
>      [ Timed traffic ]  [ Other traffic ]
>
>   feeding into your proposed TBS thingy, then in case of underutilization
>   of the 'Timed traffic' slot you prevent utilization of remaining time by
>   pulling 'Other traffic' into the empty slots because 'Other traffic' is
>   restricted to Slice 2 and 'Timed traffic' does not know about 'Other
>   traffic' at all. And no, you cannot make TBS magically pull packets from
>   'Other traffic' just because its not designed for it. So your design
>   becomes strictly partitioned and forces underutilization.
>
>   That's becoming even worse, when you switch to the proposed full hardware
>   offloading scheme. In that case the only way to do admission control is
>   the TX time of the farthest out packet which is already queued. That
>   might work for a single application which controls all of the network
>   traffic, but it wont ever work for something more flexible. The more I
>   think about it the less interesting full hardware offload becomes. It's
>   nice if you have a fully strict scheduling plan for everything, but then
>   your admission control is bogus once you have more than one channel as
>   input. So yes, it can be used when the card supports it and you have
>   other ways to enforce admission control w/o hurting utilization or if you
>   don't care about utilization at all. It's also useful for channels which
>   are strictly isolated and have a defined TX time. Such traffic can be
>   directly fed into the hardware.


This is a new requirement for the entire discussion.

If I'm not missing anything, however, underutilization of the time slots is only
a problem:

1) for the fully dynamic use-cases and;
2) because now you are designing applications in terms of time slices, right?

We have not thought of making any of the proposed qdiscs capable of (optionally)
adjusting the "time slices", but mainly because this is not a problem we had
here before. Our assumption was that per-port Tx schedules would only be used
for static systems. In other words, no, we didn't think that re-balancing the
slots was a requirement, not even for 'taprio'.


>
> Coming back to the overall scheme. If you start upfront with a time slice
> manager which is designed to:
>
>   - Handle multiple channels
>
>   - Expose the time constraints, properties per channel
>
> then you can fit all kind of use cases, whether designed by committee or
> not. You can configure that thing per node or network wide. It does not
> make a difference. The only difference are the resulting constraints.


Ok, and I believe the above was covered by what we had proposed before, unless
what you meant by time constraints is beyond the configured port schedule.

Are you suggesting that we'll need to have a kernel entity that is not only
aware of the current traffic classes 'schedule', but also of the resources that
are still available for new streams to be accommodated into the classes? Putting
it differently, is the TAS you envision just an entity that runs a schedule, or
is it a time-aware 'orchestrator'?


>
> We really want to accomodate everything between the 'no restrictions' and
> the 'full network wide explicit plan' case. And it's not rocket science
> once you realize that the 'no restrictions' case is just a subset of the
> 'full network wide explicit plan' simply because it exposes a single
> channel where:
>
> 	slice period = slice length.
>
> It's that easy, but at the same time you teach the application from the
> very beginning to ask for the time constraints so if it runs on a more
> sophisticated system/network, then it will see a different slice period and
> a different slice length and can accomodate or react in a useful way
> instead of just dying on the 17th packet it tries to send because it is
> rejected.


Ok.


>
> We really want to design for this as we want to be able to run the video
> stream on the same node and network which does robot control without
> changing the video application. That's not a theoretical problem. These use
> cases exist today, but they are forced to use different networks for the
> two. But if you look at the utilization of both then they very well fit
> into one and industry certainly wants to go for that.
>
> That implies that you need constraint aware applications from the very
> beginning and that requires a proper ABI in the first place. The proposed
> ad hoc mode does not qualify. Please be aware, that you are creating a user
> space ABI and not a random in kernel interface which can be changed at any
> given time.
>
> So lets look once more at the picture in an abstract way:
>
>      	       [ NIC ]
> 	          |
> 	 [ Time slice manager ]
> 	    |           |
>          [ Ch 0 ] ... [ Ch N ]
>
> So you have a bunch of properties here:
>
> 1) Number of Channels ranging from 1 to N
>
> 2) Start point, slice period and slice length per channel

Ok, so we agree that a TAS entity is needed. Assuming that channels are traffic
classes, do you have something else in mind other than a new root qdisc?


>
> 3) Queueing modes assigned per channel. Again that might be anything from
>    'feed through' over FIFO, PRIO to more complex things like EDF.
>
>    The queueing mode can also influence properties like the meaning of the
>    TX time, i.e. strict or deadline.


Ok, but how are the queueing modes assigned / configured per channel?

Just to make sure we re-visit some ideas from the past:

* TAS:

   The idea we are currently exploring is to add a "time-aware", priority based
   qdisc, that also exposes the Tx queues available and provides a mechanism for
   mapping priority <-> traffic class <-> Tx queues in a similar fashion as
   mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:

   $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
     	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
	   queues 0 1 2 3                                              \
     	   sched-file gates.sched [base-time <interval>]               \
           [cycle-time <interval>] [extension-time <interval>]

   <file> is multi-line, with each line being of the following format:
   <cmd> <gate mask> <interval in nanoseconds>

   Qbv only defines one <cmd>: "S" for 'SetGates'

   For example:

   S 0x01 300
   S 0x03 500

   This means that there are two intervals, the first will have the gate
   for traffic class 0 open for 300 nanoseconds, the second will have
   both traffic classes open for 500 nanoseconds.


It would handle multiple channels and expose their constraints / properties.
Each channel also becomes a traffic class, so other qdiscs can be attached to
them separately.


So, in summary, because our entire design is based on qdisc interfaces, what we
had proposed was a root qdisc (the time slice manager, as you put) that allows
for other qdiscs to be attached to each channel. The inner qdiscs define the
queueing modes for each channel, and tbs is just one of those modes. I
understand now that you want to allow for fully dynamic use-cases to be
supported as well, which we hadn't covered with our TAS proposal before because
we hadn't envisioned it being used for these systems' design.

Have I missed anything?

Thanks,
Jesus



>
> Please sit back and map your use cases, standards or whatever you care
> about into the above and I would be very surprised if they don't fit.
>
> Thanks,
>
> 	tglx
>
>
>
>