[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vbf1s6c5mlz.fsf@mellanox.com>
Date: Thu, 20 Dec 2018 13:55:25 +0000
From: Vlad Buslov <vladbu@...lanox.com>
To: Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>
CC: Jiri Pirko <jiri@...nulli.us>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Jamal Hadi Salim <jhs@...atatu.com>,
David Miller <davem@...emloft.net>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>
Subject: Re: [PATCH net-next v2 01/17] net: sched: refactor
mini_qdisc_pair_swap() to use workqueue
On Wed 19 Dec 2018 at 04:27, Cong Wang <xiyou.wangcong@...il.com> wrote:
> On Mon, Dec 17, 2018 at 2:30 AM Jiri Pirko <jiri@...nulli.us> wrote:
>>
>> Sun, Dec 16, 2018 at 07:52:18PM CET, xiyou.wangcong@...il.com wrote:
>> >On Sun, Dec 16, 2018 at 8:32 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>> >>
>> >> On Thu 13 Dec 2018 at 23:32, Cong Wang <xiyou.wangcong@...il.com> wrote:
>> >> > On Tue, Dec 11, 2018 at 2:19 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>> >> >>
>> >> >> As a part of the effort to remove dependency on rtnl lock, cls API is being
>> >> >> converted to use fine-grained locking mechanisms instead of global rtnl
>> >> >> lock. However, chain_head_change callback for ingress Qdisc is a sleeping
>> >> >> function and cannot be executed while holding a spinlock.
>> >> >
>> >> >
>> >> > Why does it have to be a spinlock not a mutex?
>> >> >
>> >> > I've read your cover letter and this changelog, I don't find any
>> >> > answer.
>> >>
>> >> My initial implementation used mutex. However, it was changed to
>> >> spinlock by Jiri's request during internal review.
>> >>
>> >
>> >So what's the answer to my question? :)
>>
>> Yeah, my concern agains mutexes was that it would be needed to have one
>> per every block and per every chain. I find it quite heavy and I believe
>> it is better to use spinlock in those cases. This patch is a side effect
>> of that. Do you think it would be better to have mutexes instead of
>> spinlocks?
>
> My only concern with spinlock is we have to go async as we
> can't block. This is almost always error-prone especially
> when dealing with tcf block. I had to give up with spinlock
> for idrinfo->lock, please take a look at:
>
> commit 95278ddaa15cfa23e4a06ee9ed7b6ee0197c500b
> Author: Cong Wang <xiyou.wangcong@...il.com>
> Date: Tue Oct 2 12:50:19 2018 -0700
>
> net_sched: convert idrinfo->lock from spinlock to a mutex
>
>
> There are indeed some cases in kernel we do take multiple
> mutex'es, for example,
>
> /*
> * bd_mutex locking:
> *
> * mutex_lock(part->bd_mutex)
> * mutex_lock_nested(whole->bd_mutex, 1)
> */
>
> So, how heavy are they comparing with spinlocks?
>
> Thanks.
Hi Cong,
I did quick-and-dirty mutex-based implementation of my cls API patchset
(removed async miniqp refactoring, changed block and chain lock types to
mutex lock) and performed some benchmarks to determine exact mutex
impact on cls API performance (both rule and chains API).
1. Parallel flower insertion rate (create 5 million flower filters on
same chain/prio with 10 tc instances) - same performance:
SPINLOCK
$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
real 0m12.183s
user 0m35.013s
sys 1m24.947s
MUTEX
$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
real 0m12.246s
user 0m35.417s
sys 1m25.330s
2. Parallel flower insertion rate (create 100k flower filters, each on
new instance of chain/tp, 10 tc instances) - mutex implementation is
~17% slower:
SPINLOCK:
$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
real 2m19.285s
user 0m1.863s
sys 5m41.157s
MUTEX:
multichain$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
real 2m43.495s
user 0m1.664s
sys 8m32.483s
3. Chains insertion rate with cls chain API (create 100k empty chains, 1
tc instance) - ~3% difference:
SPINLOCK:
$ time sudo tc -b add_chains
real 0m46.822s
user 0m0.239s
sys 0m46.437s
MUTEX:
$ time sudo tc -b add_chains
real 0m48.600s
user 0m0.230s
sys 0m48.227s
Only case where performance is significantly impacted by mutex is second
test. This happens because chain lookup is a linear search that is
performed while holding highly contested block lock. Perf profile for
mutex version confirms this:
+ 91.83% 0.00% tc [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
+ 91.83% 0.00% tc [kernel.vmlinux] [k] do_syscall_64
+ 91.80% 0.00% tc libc-2.25.so [.] __libc_sendmsg
+ 91.78% 0.00% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 91.77% 0.00% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 91.77% 0.00% tc [kernel.vmlinux] [k] sock_sendmsg
+ 91.77% 0.00% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 91.77% 0.01% tc [kernel.vmlinux] [k] netlink_unicast
+ 91.77% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 91.71% 0.01% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 91.69% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 90.96% 26.45% tc [kernel.vmlinux] [k] __tcf_chain_get
+ 64.30% 0.01% tc [kernel.vmlinux] [k] __mutex_lock.isra.7
+ 39.36% 39.18% tc [kernel.vmlinux] [k] osq_lock
+ 24.92% 24.81% tc [kernel.vmlinux] [k] mutex_spin_on_owner
Based on these results we can conclude that use-cases with significant
amount of chains on single block will be affected by using mutex in cls
API.
Regards,
Vlad
Powered by blists - more mailing lists