[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111114144358.GB27284@hmsreliant.think-freely.org>
Date: Mon, 14 Nov 2011 09:43:58 -0500
From: Neil Horman <nhorman@...driver.com>
To: Dave Taht <dave.taht@...il.com>
Cc: netdev@...r.kernel.org,
John Fastabend <john.r.fastabend@...el.com>,
Robert Love <robert.w.love@...el.com>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: net: Add network priority cgroup
On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@...driver.com> wrote:
> > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> >> Data Center Bridging environments are currently somewhat limited in their
> >> ability to provide a general mechanism for controlling traffic priority.
> >> Specifically they are unable to administratively control the priority at which
> >> various types of network traffic are sent.
> >>
> >> Currently, the only ways to set the priority of a network buffer are:
> >>
> >> 1) Through the use of the SO_PRIORITY socket option
> >> 2) By using low level hooks, like a tc action
> >>
> >> (1) is difficult from an administrative perspective because it requires that the
> >> application to be coded to not just assume the default priority is sufficient,
> >> and must expose an administrative interface to allow priority adjustment. Such
> >> a solution is not scalable in a DCB environment
> >>
> >> (2) is also difficult, as it requires constant administrative oversight of
> >> applications so as to build appropriate rules to match traffic belonging to
> >> various classes, so that priority can be appropriately set. It is further
> >> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> >> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> >> hw queues for various traffic classes and needs the priority to be set prior to
> >> selecting the root qdisc)
> >>
> >>
> >> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> >> being a good general solution to this problem. The network priority cgroup
> >> allows for a per-interface priority map to be built per cgroup. Any traffic
> >> originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface. This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> >>
> >> Tested by John and myself, with good results
> >>
> >> Signed-off-by: Neil Horman <nhorman@...driver.com>
> >> CC: John Fastabend <john.r.fastabend@...el.com>
> >> CC: Robert Love <robert.w.love@...el.com>
> >> CC: "David S. Miller" <davem@...emloft.net>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@...r.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >
> > Bump, any other thoughts here? Dave T. has some reasonable thoughts regarding
> > the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> > this change. Any other reviews would be welcome.
>
> Well, in part I've been playing catchup in the hope that lldp and
> openlldp and/or this dcb netlink layer that I don't know anything
> about (pointers please?) could help somehow to resolve the semantic
> mess skb->priority has become in the first place.
>
> I liked what was described here.
>
> "What if we did at least carve out the DCB functionality away from
> skb->priority? Since, AIUI, we're only concerning ourselves with
> locally generated traffic here, we're talking
> about skbs that have a socket attached to them. We could, instead of indexing
> the prio_tc_map with skb->priority, we could index it with
> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch). The cgroup
> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
> something more appropriately named), and we don't have to touch skb->priority at
> all. I'd really rather not start down that road until I got more opinions and
> consensus on that, but it seems like a pretty good solution, one that would
> allow hardware queue selection in systems that use things like DCB to co-exist
> with software queueing features."
>
I was initially ok with this, but the more I think about it, the more I think
its just not needed (see further down in this email for my reasoning). John,
Rob, do you have any thoughts here?
> The piece that still kind of bothered me about the original proposal
> (and perhaps this one) was that setting SO_PRIORITY in an app means
> 'give my packets more mojo'.
>
> Taking something that took unprioritized packets and assigned them and
> *them only* to a hardware queue struck me as possibly deprioritizing
> the 'more mojo wanted' packets in the app(s), as they would end up in
> some other, possibly overloaded, hardware queue.
>
I don't really see what you mean by this at all. Taking packets with no
priority and assigning them a priority doesn't really have an effect on
pre-prioritized packets. Or rather it shouldn't. You can certainly create a
problem by having apps prioritized according to conflicting semantic rules, but
that strikes me as administrative error. Garbage in...Garbage out.
> So a cgroup that moves all of the packets from an application into a
> given hardware queue, and then gets scheduled normally according to
> skb->priority and friends (software queue, default of pfifo_fast,
> etc), seems to make some sense to me. (I wouldn't mind if we had
> abstractions for software queues, too, like, I need a software queue
> with these properties, find me a place for it on the hardware - but
> I'm dreaming)
>
> One open question is where do packets generated from other subsystems
> end up, if you are using a cgroup for the app? arp, dns, etc?
>
The overriding rule is the association of an skb to a socket. If a transmitted
frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
as set when we passed through the sendmsg code at the top of the stack.
Otherwise its behavior is unchanged from its current standpoint.
> So to rephrase your original description from this:
>
> >> Any traffic originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface. This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
>
> To this:
>
> "Any traffic originating from an application in a cgroup, will have
> its hardware queue assigned to the value designated for that group on
> that interface. This allows a user space daemon, when conducting LLDP
> negotiation with a DCB enabled peer to create a cgroup based on the
> APP_TLV value received and administratively assign applications to
> that hardware queue using the existing cgroup utility infrastructure."
>
As above, I'm split brained about this. I'm ok with the idea of making this a
queue selection cgroup, and separating it from priority, but at the same time,
in the context of DCB, we really are assigning priority here, so it seems a bit
false to do something that is not priority. I also like the fact that it
provides administrative control in a way that netfilter and tc don't really
enable.
> Assuming we're on the same page here, what the heck is APP_TLV?
>
LLDP does layer 2 discovery with peer networking devices. It does so using sets
of Type/length/value tuples. The types carry various bits of information, such
as which priority groups are available on the network. The APP tlv conveys
application or feature specific information. for instance, There is an ISCSI
app tlv that tells the host that "on the interface you received this tlv, iscsi
traffic must be sent at priority X". The idea being that, on receipt of this
tlv, the DCB daemon can create an ISCSI network priority cgroup instance, and
augment the cgroup rules file such that, when the user space iscsi daemon is
started, its traffic automatically transmits at the appropriate priority.
Actually, as I sit here thinking about it, I'm less supportive of separating the
assignment of hw queue / root qdisc from priority. I say that because its
really just not necessecary, at least not in a DCB environment. In a dcb
environment, we use priority exclusively to select the root qdisc and the
associated hardware queue. Any software queue policing/grooming that takes
place after that by definition can't make any decisions based on skb->priority,
as the priority will be the same for each frame in the given parent queue (since
we used skb->priority to assign the parent queue). You can still use RED or
some other queue discipline to manage queue length, but you have to understand
that priority won't be a packet-differentiating factor there. If you made this
a separate queue selection variable, you could manipulate priority independenty,
but it would be non-sensical, because the intent would still be to send all
packets on that queue at the same priority, since thats what that hardware queue
is configured for.
In the alternate, if you don't use DCB, then this cgroup becomes somewhat
worthless anyway, because no-one has a need to select a specific hardware queue.
If you want to categorize traffic into classes you can already use the net_cls
cgroup, which works quite well for bandwidth allocation, and that still frees up
the skb->prioirty assignment to use as you see fit.
In the end, I think its just plain old more useful to assign priorty here than
some new thing.
Neil
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
>
>
> >
> >
> > Regards
> > Neil
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>
>
> --
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> FR Tel: 0638645374
> http://www.bufferbloat.net
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists