[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190613141821.GA2242@nanopsycho>
Date: Thu, 13 Jun 2019 16:18:21 +0200
From: Jiri Pirko <jiri@...nulli.us>
To: Vlad Buslov <vladbu@...lanox.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"pablo@...filter.org" <pablo@...filter.org>,
"xiyou.wangcong@...il.com" <xiyou.wangcong@...il.com>,
"jhs@...atatu.com" <jhs@...atatu.com>, mlxsw <mlxsw@...lanox.com>,
Alex Kushnarov <alexanderk@...lanox.com>,
"pabeni@...hat.com" <pabeni@...hat.com>
Subject: Re: tc tp creation performance degratation since kernel 5.1
Thu, Jun 13, 2019 at 01:26:17PM CEST, vladbu@...lanox.com wrote:
>
>On Thu 13 Jun 2019 at 14:11, Jiri Pirko <jiri@...nulli.us> wrote:
>> Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@...lanox.com wrote:
>>>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@...nulli.us> wrote:
>>>> I made a mistake during measurements, sorry about that.
>>>>
>>>> This is the correct script:
>>>> -----------------------------------------------------------------------
>>>> #!/bin/bash
>>>>
>>>> dev=testdummy
>>>> ip link add name $dev type dummy
>>>> ip link set dev $dev up
>>>> tc qdisc add dev $dev ingress
>>>>
>>>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
>>>> pref_id=1
>>>>
>>>> while [ $pref_id -lt 20000 ]
>>>> do
>>>> echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
>>>> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>>>> ((pref_id++))
>>>> done
>>>>
>>>> start=$(date +"%s")
>>>> tc -b $tmp_file_name
>>>> stop=$(date +"%s")
>>>> echo "Insertion duration: $(($stop - $start)) sec"
>>>> rm -f $tmp_file_name
>>>>
>>>> ip link del dev $dev
>>>> -----------------------------------------------------------------------
>>>>
>>>> Note the commented out matchall. I don't see the regression with
>>>> matchall. However, I see that with flower:
>>>> kernel 5.1
>>>> Insertion duration: 4 sec
>>>> kernel 5.2
>>>> Insertion duration: 163 sec
>>>>
>>>> I don't see any significant difference in perf:
>>>> kernel 5.1
>>>> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find
>>>> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner
>>>> 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
>>>> 0.93% tc [kernel.vmlinux] [k] idr_get_free
>>>> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64
>>>> 0.69% tc [kernel.vmlinux] [k] finish_task_switch
>>>> 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms
>>>> 0.49% tc [kernel.vmlinux] [k] __memset
>>>> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc
>>>> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free
>>>> 0.24% tc [kernel.vmlinux] [k] __memcpy
>>>> 0.23% tc [cls_flower] [k] fl_change
>>>> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse
>>>> 0.22% tc [kernel.vmlinux] [k] __slab_alloc
>>>>
>>>>
>>>> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find
>>>> 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
>>>> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64
>>>> 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms
>>>> 0.86% ip [kernel.kallsyms] [k] finish_task_switch
>>>> 0.67% tc [kernel.kallsyms] [k] memset
>>>> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner
>>>> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc
>>>> 0.48% tc [kernel.kallsyms] [k] idr_get_free
>>>> 0.46% tc [kernel.kallsyms] [k] fl_change
>>>> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free
>>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2
>>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc
>>>> 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted
>>>>
>>>> Any ideas?
>>>
>>>Thanks for providing reproduction script!
>>>
>>>I've investigate the problem and found the root cause. First of all I
>>>noticed that CPU utilization during problematic tc run is quite low
>>>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
>>>and obtained following off-CPU trace (uninteresting traces are omitted
>>>for brevity):
>>>
>>>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
>>>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
>>>...
>>> finish_task_switch
>>> __sched_text_start
>>> schedule
>>> schedule_timeout
>>> wait_for_completion
>>> __wait_rcu_gp
>>> synchronize_rcu
>>> fl_change
>>> tc_new_tfilter
>>> rtnetlink_rcv_msg
>>> netlink_rcv_skb
>>> netlink_unicast
>>> netlink_sendmsg
>>> sock_sendmsg
>>> ___sys_sendmsg
>>> __sys_sendmsg
>>> do_syscall_64
>>> entry_SYSCALL_64_after_hwframe
>>> - tc (2069)
>>> 142284953
>>>
>>>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
>>>code is in fl_create_new_mask() function:
>>>
>>> err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
>>> &newmask->ht_node, mask_ht_params);
>>> if (err)
>>> goto errout_destroy;
>>>
>>> /* Wait until any potential concurrent users of mask are finished */
>>> synchronize_rcu();
>>>
>>>The justification for this is described in comment in
>>>fl_check_assign_mask() (user of fl_create_new_mask()):
>>>
>>> /* Insert mask as temporary node to prevent concurrent creation of mask
>>> * with same key. Any concurrent lookups with same key will return
>>> * -EAGAIN because mask's refcnt is zero. It is safe to insert
>>> * stack-allocated 'mask' to masks hash table because we call
>>> * synchronize_rcu() before returning from this function (either in case
>>> * of error or after replacing it with heap-allocated mask in
>>> * fl_create_new_mask()).
>>> */
>>> fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
>>> &mask->ht_node,
>>> mask_ht_params);
>>>
>>>The offending commit is part of my series that implements unlocked
>>>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
>>>insertion")
>>>
>>>The justification presented in it is no longer relevant since Ivan
>>>Vecera changed mask to be dynamically allocated in commit 2cddd2014782
>>>("net/sched: cls_flower: allocate mask dynamically in fl_change()").
>>>With this we can just change fl_change() to deallocate temporary mask
>>>with rcu grace period and remove offending syncrhonize_rcu() call.
>>>
>>>Any other suggestions?
>>
>> So basically you just change synchronize_rcu() to kfree_rcu(mask),
>> correct?
>
>Not really. I remove synchronize_rcu() and change all kfree(mask) in
>fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which
>uses queue_rcu_work() internally. This would allow us to deallocate
>fl_flow_mask in same manner on all code paths and doesn't require any
>extensions in fl_flow_mask struct (kfree_rcu would require extending it
>with rcu_head).
Got it. Makes sense to me. Thanks!
Powered by blists - more mailing lists