[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a985705-11e5-1575-a049-723accb97608@mellanox.com>
Date: Tue, 20 Dec 2016 08:22:41 +0200
From: Shahar Klein <shahark@...lanox.com>
To: Cong Wang <xiyou.wangcong@...il.com>
CC: <shahark@...lanox.com>, Or Gerlitz <gerlitz.or@...il.com>,
Daniel Borkmann <daniel@...earbox.net>,
Linux Netdev List <netdev@...r.kernel.org>,
Roi Dayan <roid@...lanox.com>,
David Miller <davem@...emloft.net>,
Jiri Pirko <jiri@...lanox.com>,
John Fastabend <john.fastabend@...il.com>,
"Hadar Hen Zion" <hadarh@...lanox.com>
Subject: Re: Soft lockup in tc_classify
On 12/19/2016 7:58 PM, Cong Wang wrote:
> Hello,
>
> On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@...lanox.com> wrote:
>>
>>
>> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>>
>>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@...il.com> wrote:
>>>>
>>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@...earbox.net>
>>>> wrote:
>>>>
>>>>> Note that there's still the RCU fix missing for the deletion race that
>>>>> Cong will still send out, but you say that the only thing you do is to
>>>>> add a single rule, but no other operation in involved during that test?
>>>>
>>>>
>>>> What's missing to have the deletion race fixed? making a patch or
>>>> testing to a patch which was sent?
>>>
>>>
>>> If you think it would help for this problem, here is my patch rebased
>>> on the latest net-next.
>>>
>>> Again, I don't see how it could help this case yet, especially I don't
>>> see how we could have a loop in this singly linked list.
>>>
>>
>> I've applied cong's patch and hit a different lockup(full log attached):
>
>
> Are you sure this is really different? For me, it is still inside the loop
> in tc_classify(), with only a slightly different offset.
>
>
>>
>> Daniel suggested I'll add a print:
>> case RTM_DELTFILTER:
>> - err = tp->ops->delete(tp, fh);
>> + printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
>> + err = tp->ops->delete(tp, fh, &last);
>> if (err == 0) {
>>
>> and I couldn't see this print in the output.....
>
> Hmm, that is odd, if this never prints, then my patch should not make any
> difference.
>
> There are still two other cases where we could change tp->next, so do you
> mind to add two more printk's for debugging?
>
> Attached is the delta patch.
>
> Thanks!
>
I've added a slightly different debug print:
@@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
if (tp_created) {
RCU_INIT_POINTER(tp->next,
rtnl_dereference(*back));
rcu_assign_pointer(*back, tp);
+ printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf
tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
}
tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
full output attached:
[ 283.290271] Mirror/redirect action on
[ 283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9432d704df60 tp->next= (null)
[ 283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9436e718d240 tp->next= (null)
[ 283.359997] GACT probability on
[ 283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
[ 283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[ 283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[ 283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[ 283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[ 308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
[swapper/0:0]
Thanks
Shahar
View attachment "tp_p_debug.log" of type "text/plain" (18147 bytes)
Powered by blists - more mailing lists