[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vbfef9fvzc3.fsf@mellanox.com>
Date: Mon, 14 Jan 2019 19:00:48 +0000
From: Vlad Buslov <vladbu@...lanox.com>
To: Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>
CC: Jiri Pirko <jiri@...nulli.us>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Jamal Hadi Salim <jhs@...atatu.com>,
David Miller <davem@...emloft.net>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>
Subject: Re: [PATCH net-next v2 01/17] net: sched: refactor
mini_qdisc_pair_swap() to use workqueue
On Thu 20 Dec 2018 at 13:55, Vlad Buslov <vladbu@...lanox.com> wrote:
> On Wed 19 Dec 2018 at 04:27, Cong Wang <xiyou.wangcong@...il.com> wrote:
>> On Mon, Dec 17, 2018 at 2:30 AM Jiri Pirko <jiri@...nulli.us> wrote:
>>>
>>> Sun, Dec 16, 2018 at 07:52:18PM CET, xiyou.wangcong@...il.com wrote:
>>> >On Sun, Dec 16, 2018 at 8:32 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>>> >>
>>> >> On Thu 13 Dec 2018 at 23:32, Cong Wang <xiyou.wangcong@...il.com> wrote:
>>> >> > On Tue, Dec 11, 2018 at 2:19 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>>> >> >>
>>> >> >> As a part of the effort to remove dependency on rtnl lock, cls API is being
>>> >> >> converted to use fine-grained locking mechanisms instead of global rtnl
>>> >> >> lock. However, chain_head_change callback for ingress Qdisc is a sleeping
>>> >> >> function and cannot be executed while holding a spinlock.
>>> >> >
>>> >> >
>>> >> > Why does it have to be a spinlock not a mutex?
>>> >> >
>>> >> > I've read your cover letter and this changelog, I don't find any
>>> >> > answer.
>>> >>
>>> >> My initial implementation used mutex. However, it was changed to
>>> >> spinlock by Jiri's request during internal review.
>>> >>
>>> >
>>> >So what's the answer to my question? :)
>>>
>>> Yeah, my concern agains mutexes was that it would be needed to have one
>>> per every block and per every chain. I find it quite heavy and I believe
>>> it is better to use spinlock in those cases. This patch is a side effect
>>> of that. Do you think it would be better to have mutexes instead of
>>> spinlocks?
>>
>> My only concern with spinlock is we have to go async as we
>> can't block. This is almost always error-prone especially
>> when dealing with tcf block. I had to give up with spinlock
>> for idrinfo->lock, please take a look at:
>>
>> commit 95278ddaa15cfa23e4a06ee9ed7b6ee0197c500b
>> Author: Cong Wang <xiyou.wangcong@...il.com>
>> Date: Tue Oct 2 12:50:19 2018 -0700
>>
>> net_sched: convert idrinfo->lock from spinlock to a mutex
>>
>>
>> There are indeed some cases in kernel we do take multiple
>> mutex'es, for example,
>>
>> /*
>> * bd_mutex locking:
>> *
>> * mutex_lock(part->bd_mutex)
>> * mutex_lock_nested(whole->bd_mutex, 1)
>> */
>>
>> So, how heavy are they comparing with spinlocks?
>>
>> Thanks.
>
> Hi Cong,
>
> I did quick-and-dirty mutex-based implementation of my cls API patchset
> (removed async miniqp refactoring, changed block and chain lock types to
> mutex lock) and performed some benchmarks to determine exact mutex
> impact on cls API performance (both rule and chains API).
>
> 1. Parallel flower insertion rate (create 5 million flower filters on
> same chain/prio with 10 tc instances) - same performance:
>
> SPINLOCK
> $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real 0m12.183s
> user 0m35.013s
> sys 1m24.947s
>
> MUTEX
> $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real 0m12.246s
> user 0m35.417s
> sys 1m25.330s
>
> 2. Parallel flower insertion rate (create 100k flower filters, each on
> new instance of chain/tp, 10 tc instances) - mutex implementation is
> ~17% slower:
>
> SPINLOCK:
> $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real 2m19.285s
> user 0m1.863s
> sys 5m41.157s
>
> MUTEX:
> multichain$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b
>
> real 2m43.495s
> user 0m1.664s
> sys 8m32.483s
>
> 3. Chains insertion rate with cls chain API (create 100k empty chains, 1
> tc instance) - ~3% difference:
>
> SPINLOCK:
> $ time sudo tc -b add_chains
>
> real 0m46.822s
> user 0m0.239s
> sys 0m46.437s
>
> MUTEX:
> $ time sudo tc -b add_chains
>
> real 0m48.600s
> user 0m0.230s
> sys 0m48.227s
>
>
> Only case where performance is significantly impacted by mutex is second
> test. This happens because chain lookup is a linear search that is
> performed while holding highly contested block lock. Perf profile for
> mutex version confirms this:
>
> + 91.83% 0.00% tc [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
> + 91.83% 0.00% tc [kernel.vmlinux] [k] do_syscall_64
> + 91.80% 0.00% tc libc-2.25.so [.] __libc_sendmsg
> + 91.78% 0.00% tc [kernel.vmlinux] [k] __sys_sendmsg
> + 91.77% 0.00% tc [kernel.vmlinux] [k] ___sys_sendmsg
> + 91.77% 0.00% tc [kernel.vmlinux] [k] sock_sendmsg
> + 91.77% 0.00% tc [kernel.vmlinux] [k] netlink_sendmsg
> + 91.77% 0.01% tc [kernel.vmlinux] [k] netlink_unicast
> + 91.77% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
> + 91.71% 0.01% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
> + 91.69% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
> + 90.96% 26.45% tc [kernel.vmlinux] [k] __tcf_chain_get
> + 64.30% 0.01% tc [kernel.vmlinux] [k] __mutex_lock.isra.7
> + 39.36% 39.18% tc [kernel.vmlinux] [k] osq_lock
> + 24.92% 24.81% tc [kernel.vmlinux] [k] mutex_spin_on_owner
>
> Based on these results we can conclude that use-cases with significant
> amount of chains on single block will be affected by using mutex in cls
> API.
>
> Regards,
> Vlad
Hi Cong,
Any comments regarding benchmark results? Are you okay with
spinlock-based implementation?
Thanks,
Vlad
Powered by blists - more mailing lists