[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160331211852.2d228976@redhat.com>
Date: Thu, 31 Mar 2016 21:18:52 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Michael Ma <make0818@...il.com>
Cc: brouer@...hat.com, netdev@...r.kernel.org
Subject: Re: qdisc spin lock
On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@...il.com> wrote:
> I know this might be an old topic so bare with me – what we are facing
> is that applications are sending small packets using hundreds of
> threads so the contention on spin lock in __dev_xmit_skb increases the
> latency of dev_queue_xmit significantly. We’re building a network QoS
> solution to avoid interference of different applications using HTB.
Yes, as you have noticed with HTB there is a single qdisc lock, and
congestion obviously happens :-)
It is possible with different tricks to make it scale. I believe
Google is using a variant of HTB, and it scales for them. They have
not open source their modifications to HTB (which likely also involves
a great deal of setup tricks).
If your purpose it to limit traffic/bandwidth per "cloud" instance,
then you can just use another TC setup structure. Like using MQ and
assigning a HTB per MQ queue (where the MQ queues are bound to each
CPU/HW queue)... But you have to figure out this setup yourself...
> But in this case when some applications send massive small packets in
> parallel, the application to be protected will get its throughput
> affected (because it’s doing synchronous network communication using
> multiple threads and throughput is sensitive to the increased latency)
>
> Here is the profiling from perf:
>
> - 67.57% iperf [kernel.kallsyms] [k] _spin_lock
> - 99.94% dev_queue_xmit
> - 96.91% _spin_lock
> - 2.62% __qdisc_run
> - 98.98% sch_direct_xmit
> - 99.98% _spin_lock
>
> As far as I understand the design of TC is to simplify locking schema
> and minimize the work in __qdisc_run so that throughput won’t be
> affected, especially with large packets. However if the scenario is
> that multiple classes in the queueing discipline only have the shaping
> limit, there isn’t really a necessary correlation between different
> classes. The only synchronization point should be when the packet is
> dequeued from the qdisc queue and enqueued to the transmit queue of
> the device. My question is – is it worth investing on avoiding the
> locking contention by partitioning the queue/lock so that this
> scenario is addressed with relatively smaller latency?
Yes, there is a lot go gain, but it is not easy ;-)
> I must have oversimplified a lot of details since I’m not familiar
> with the TC implementation at this point – just want to get your input
> in terms of whether this is a worthwhile effort or there is something
> fundamental that I’m not aware of. If this is just a matter of quite
> some additional work, would also appreciate helping to outline the
> required work here.
>
> Also would appreciate if there is any information about the latest
> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
This article seems to be very low quality... spelling errors, only 5
pages, no real code, etc.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists