[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D2233A4.9050701@intel.com>
Date: Mon, 03 Jan 2011 12:37:56 -0800
From: John Fastabend <john.r.fastabend@...el.com>
To: Jarek Poplawski <jarkao2@...il.com>
CC: "davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"hadi@...erus.ca" <hadi@...erus.ca>,
"shemminger@...tta.com" <shemminger@...tta.com>,
"tgraf@...radead.org" <tgraf@...radead.org>,
"eric.dumazet@...il.com" <eric.dumazet@...il.com>,
"bhutchings@...arflare.com" <bhutchings@...arflare.com>,
"nhorman@...driver.com" <nhorman@...driver.com>
Subject: Re: [net-next-2.6 PATCH v2 3/3] net_sched: implement a root container
qdisc sch_mclass
On 1/3/2011 9:02 AM, Jarek Poplawski wrote:
> On Sun, Jan 02, 2011 at 09:43:27PM -0800, John Fastabend wrote:
>> On 12/30/2010 3:37 PM, Jarek Poplawski wrote:
>>> John Fastabend wrote:
>>>> This implements a mclass 'multi-class' queueing discipline that by
>>>> default creates multiple mq qdisc's one for each traffic class. Each
>>>> mq qdisc then owns a range of queues per the netdev_tc_txq mappings.
>>>
>>> Is it really necessary to add one more abstraction layer for this,
>>> probably not most often used (or even asked by users), functionality?
>>> Why mclass can't simply do these few things more instead of attaching
>>> (and changing) mq?
>>>
>>
>> The statistics work nicely when the mq qdisc is used.
>
> Well, I sometimes add leaf qdiscs only to get class stats with less
> typing, too ;-)
>
>>
>> qdisc mclass 8002: root tc 4 map 0 1 2 3 0 1 2 3 1 1 1 1 1 1 1 1
>> queues:(0:1) (2:3) (4:5) (6:15)
>> Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>> qdisc mq 8003: parent 8002:1
>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>> qdisc mq 8004: parent 8002:2
>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>> qdisc mq 8005: parent 8002:3
>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>> qdisc mq 8006: parent 8002:4
>> Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>> qdisc sfq 8007: parent 8005:1 limit 127p quantum 1514b
>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>> qdisc sfq 8008: parent 8005:2 limit 127p quantum 1514b
>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>> backlog 0b 0p requeues 0
>>
>> The mclass gives the statistics for the interface and then statistics on the mq qdisc gives statistics for each traffic class. Also, when using the 'mq qdisc' with this abstraction other qdisc can be grafted onto the queue. For example the sch_sfq is used in the above example.
>
> IMHO, these tc offsets and counts make simply two level hierarchy
> (classes with leaf subclasses) similarly (or simpler) to other
> classful qdisc which manage it all inside one module. Of course,
> we could think of another way of code organization, but it should
> be rather done at the beginning of schedulers design. The mq qdisc
> broke the design a bit adding a fake root, but I doubt we should go
> deeper unless it's necessary. Doing mclass (or something) as a more
> complex alternative to mq should be enough. Why couldn't mclass graft
> sch_sfq the same way as mq?
>
If you also want to graft a scheduler onto a traffic class now your stuck. For now this qdisc doesn't exist, but I would like to have a software implementation of the currently offloaded DCB ETS scheduler. The 802.1Qaz spec allows different scheduling algorithms to be used on each traffic class. In the current implementation mclass could graft these scheduling schemes onto each traffic class independently.
mclass
|
-------------------------------------------------------
| | | | | | | |
mq_tbf mq_tbf mq_ets mq_ets mq mq mq_wrr greedy
| |
--------- ---------
| | | | | |
red red red red red red
Perhaps, being concerned with hypothetical qdiscs is a bit of a stretch but I would like to at least not preclude this work from being done in the future.
>>
>> Although I am not too hung up on this use case it does seem to be a good abstraction to me. Is it strictly necessary though no and looking at the class statistics of mclass could be used to get stats per traffic class.
>
> I am not too hung up on this either, especially if it's OK to others,
> especially to DaveM ;-)
>
>>
>>> ...
>>>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>>>> index 0af57eb..723ee52 100644
>>>> --- a/include/net/sch_generic.h
>>>> +++ b/include/net/sch_generic.h
>>>> @@ -50,6 +50,7 @@ struct Qdisc {
>>>> #define TCQ_F_INGRESS 4
>>>> #define TCQ_F_CAN_BYPASS 8
>>>> #define TCQ_F_MQROOT 16
>>>> +#define TCQ_F_MQSAFE 32
>>>
>>> If every other qdisc added a flag for qdiscs it likes...
>>>
>>
>> then we run out of bits and get unneeded complexity. I think I will drop the MQSAFE bit completely and let user space catch this. The worst that should happen is the noop qdisc is used.
>
> Maybe you're right. On the other hand, usually flags are added for
> more general purpose and the optimal/wrong configs are the matter of
> documentation.
>
So we handle this with documentation.
>>
>>>> @@ -709,7 +709,13 @@ static void attach_default_qdiscs(struct net_device *dev)
>>>> dev->qdisc = txq->qdisc_sleeping;
>>>> atomic_inc(&dev->qdisc->refcnt);
>>>> } else {
>>>> - qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops, TC_H_ROOT);
>>>> + if (dev->num_tc)
>>>
>>> Actually, where this num_tc is expected to be set? I can see it inside
>>> mclass only, with unsetting on destruction, but probably I miss something.
>>
>> Either through mclass as you noted or a driver could set the num_tc. One of the RFC's I sent out has ixgbe setting the num_tc when DCB was enabled.
>
> OK, I probably missed this second possibility in the last version.
>
> ...
>>>> + /* Unwind attributes on failure */
>>>> + u8 unwnd_tc = dev->num_tc;
>>>> + u8 unwnd_map[16];
>>>
>>> [TC_MAX_QUEUE] ?
>>
>> Actually TC_BITMASK+1 is probably more accurate. This array maps the skb priority to a traffic class after the priority is masked with TC_BITMASK.
>>
>>>
>>>> + struct netdev_tc_txq unwnd_txq[16];
>>>> +
>>
>> Although unwnd_txq should be TC_MAX_QUEUE.
> ...
>>>> + /* Always use supplied priority mappings */
>>>> + for (i = 0; i < 16; i++) {
>>>
>>> i < qopt->num_tc ?
>>
>> Nope, TC_BITMASK+1 here. If we only have 4 tcs for example we still need to map all 16 priority values to a tc.
>
> OK, anyway, all these '16' should be 'upgraded'.
Yes. I'll do this in the next version.
Thanks,
John.
>
> Thanks,
> Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists