[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vbfblz123vt.fsf@mellanox.com>
Date: Thu, 13 Jun 2019 10:09:32 +0000
From: Vlad Buslov <vladbu@...lanox.com>
To: Jiri Pirko <jiri@...nulli.us>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Vlad Buslov <vladbu@...lanox.com>,
"pablo@...filter.org" <pablo@...filter.org>,
"xiyou.wangcong@...il.com" <xiyou.wangcong@...il.com>,
"jhs@...atatu.com" <jhs@...atatu.com>, mlxsw <mlxsw@...lanox.com>,
Alex Kushnarov <alexanderk@...lanox.com>,
"pabeni@...hat.com" <pabeni@...hat.com>
Subject: Re: tc tp creation performance degratation since kernel 5.1
On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@...nulli.us> wrote:
> I made a mistake during measurements, sorry about that.
>
> This is the correct script:
> -----------------------------------------------------------------------
> #!/bin/bash
>
> dev=testdummy
> ip link add name $dev type dummy
> ip link set dev $dev up
> tc qdisc add dev $dev ingress
>
> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
> pref_id=1
>
> while [ $pref_id -lt 20000 ]
> do
> echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
> ((pref_id++))
> done
>
> start=$(date +"%s")
> tc -b $tmp_file_name
> stop=$(date +"%s")
> echo "Insertion duration: $(($stop - $start)) sec"
> rm -f $tmp_file_name
>
> ip link del dev $dev
> -----------------------------------------------------------------------
>
> Note the commented out matchall. I don't see the regression with
> matchall. However, I see that with flower:
> kernel 5.1
> Insertion duration: 4 sec
> kernel 5.2
> Insertion duration: 163 sec
>
> I don't see any significant difference in perf:
> kernel 5.1
> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find
> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner
> 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
> 0.93% tc [kernel.vmlinux] [k] idr_get_free
> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64
> 0.69% tc [kernel.vmlinux] [k] finish_task_switch
> 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms
> 0.49% tc [kernel.vmlinux] [k] __memset
> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc
> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free
> 0.24% tc [kernel.vmlinux] [k] __memcpy
> 0.23% tc [cls_flower] [k] fl_change
> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse
> 0.22% tc [kernel.vmlinux] [k] __slab_alloc
>
>
> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find
> 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64
> 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms
> 0.86% ip [kernel.kallsyms] [k] finish_task_switch
> 0.67% tc [kernel.kallsyms] [k] memset
> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner
> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc
> 0.48% tc [kernel.kallsyms] [k] idr_get_free
> 0.46% tc [kernel.kallsyms] [k] fl_change
> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free
> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2
> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc
> 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted
>
> Any ideas?
Thanks for providing reproduction script!
I've investigate the problem and found the root cause. First of all I
noticed that CPU utilization during problematic tc run is quite low
(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
and obtained following off-CPU trace (uninteresting traces are omitted
for brevity):
~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
...
finish_task_switch
__sched_text_start
schedule
schedule_timeout
wait_for_completion
__wait_rcu_gp
synchronize_rcu
fl_change
tc_new_tfilter
rtnetlink_rcv_msg
netlink_rcv_skb
netlink_unicast
netlink_sendmsg
sock_sendmsg
___sys_sendmsg
__sys_sendmsg
do_syscall_64
entry_SYSCALL_64_after_hwframe
- tc (2069)
142284953
As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
code is in fl_create_new_mask() function:
err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
&newmask->ht_node, mask_ht_params);
if (err)
goto errout_destroy;
/* Wait until any potential concurrent users of mask are finished */
synchronize_rcu();
The justification for this is described in comment in
fl_check_assign_mask() (user of fl_create_new_mask()):
/* Insert mask as temporary node to prevent concurrent creation of mask
* with same key. Any concurrent lookups with same key will return
* -EAGAIN because mask's refcnt is zero. It is safe to insert
* stack-allocated 'mask' to masks hash table because we call
* synchronize_rcu() before returning from this function (either in case
* of error or after replacing it with heap-allocated mask in
* fl_create_new_mask()).
*/
fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
&mask->ht_node,
mask_ht_params);
The offending commit is part of my series that implements unlocked
flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
insertion")
The justification presented in it is no longer relevant since Ivan
Vecera changed mask to be dynamically allocated in commit 2cddd2014782
("net/sched: cls_flower: allocate mask dynamically in fl_change()").
With this we can just change fl_change() to deallocate temporary mask
with rcu grace period and remove offending syncrhonize_rcu() call.
Any other suggestions?
Powered by blists - more mailing lists