[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7f1d67f1-3a2c-2e74-bb86-c02a56370526@gmail.com>
Date: Tue, 28 Sep 2021 16:25:39 -0700
From: Eric Dumazet <eric.dumazet@...il.com>
To: Ben Greear <greearb@...delatech.com>,
Eric Dumazet <eric.dumazet@...il.com>,
netdev <netdev@...r.kernel.org>
Subject: Re: 5.15-rc3+ crash in fq-codel?
On 9/28/21 3:00 PM, Ben Greear wrote:
> On 9/27/21 5:16 PM, Ben Greear wrote:
>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>> Hello,
>>>>>
>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>> pktgen traffic on top of mac-vlans. Unfortunately for me, I've made big changes to
>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>> some of my hackings have caused this issue.
>>>>>
>>>>> But, in case others have seen similar, please let me know. I shall go digging
>>>>> in the meantime...
>>>>>
>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>
>>>>
>>>> pktgen must not be used in a mode where a single skb
>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>
>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>> anchor in their list.
>>>>
>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>
>>>> Please double check your clone_skb pktgen setup.
>>>>
>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>
>>> My pktgen config was not using any duplicated queueing in this case.
>>>
>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>> within a minute. I'll let it bake overnight....
>>
>> Still running stable. I also notice we have been using fq-codel for a while and haven't noticed
>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>
>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>> if we just haven't actually done this exact test in a while...
>
> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
> We did not try any kernels older than 5.4.
> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>
> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>
qdisc are protected by a qdisc spinlock.
fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
I guess your local changes to pktgen might be to blame.
pfifo is much simpler than fq-codel, it uses less fields from skb.
Powered by blists - more mailing lists