netdev - qdisc spin lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAAmHdhw9bQkCm7uehRZ9mTetMzafdXxWhYj16f8W-YvSz8V4=g@mail.gmail.com>
Date:	Wed, 30 Mar 2016 00:20:03 -0700
From:	Michael Ma <make0818@...il.com>
To:	netdev@...r.kernel.org
Subject: qdisc spin lock

Hi -

I know this might be an old topic so bare with me – what we are facing
is that applications are sending small packets using hundreds of
threads so the contention on spin lock in __dev_xmit_skb increases the
latency of dev_queue_xmit significantly. We’re building a network QoS
solution to avoid interference of different applications using HTB.
But in this case when some applications send massive small packets in
parallel, the application to be protected will get its throughput
affected (because it’s doing synchronous network communication using
multiple threads and throughput is sensitive to the increased latency)

Here is the profiling from perf:

-  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
                                                         - 99.94%
dev_queue_xmit
                                                              96.91%
_spin_lock
                                                            - 2.62%
__qdisc_run
                                                               -
98.98% sch_direct_xmit

99.98% _spin_lock
                                                                 1.01%
_spin_lock

As far as I understand the design of TC is to simplify locking schema
and minimize the work in __qdisc_run so that throughput won’t be
affected, especially with large packets. However if the scenario is
that multiple classes in the queueing discipline only have the shaping
limit, there isn’t really a necessary correlation between different
classes. The only synchronization point should be when the packet is
dequeued from the qdisc queue and enqueued to the transmit queue of
the device. My question is – is it worth investing on avoiding the
locking contention by partitioning the queue/lock so that this
scenario is addressed with relatively smaller latency?

I must have oversimplified a lot of details since I’m not familiar
with the TC implementation at this point – just want to get your input
in terms of whether this is a worthwhile effort or there is something
fundamental that I’m not aware of. If this is just a matter of quite
some additional work, would also appreciate helping to outline the
required work here.

Also would appreciate if there is any information about the latest
status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf

Thanks,
Ke Ma