lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 28 Sep 2021 15:00:51 -0700
From:   Ben Greear <greearb@...delatech.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: 5.15-rc3+ crash in fq-codel?

On 9/27/21 5:16 PM, Ben Greear wrote:
> On 9/27/21 5:04 PM, Ben Greear wrote:
>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>> Hello,
>>>>
>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>> some of my hackings have caused this issue.
>>>>
>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>> in the meantime...
>>>>
>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>
>>>
>>> pktgen must not be used in a mode where a single skb
>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>
>>> qdisc of all sorts assume skb->next/prev can be used as
>>> anchor in their list.
>>>
>>> If the same skb is queued multiple times, lists are corrupted.
>>>
>>> Please double check your clone_skb pktgen setup.
>>>
>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>
>> My pktgen config was not using any duplicated queueing in this case.
>>
>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>> within a minute.  I'll let it bake overnight....
> 
> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
> 
> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
> if we just haven't actually done this exact test in a while...

We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
We did not try any kernels older than 5.4.
We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)

While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
Any chance that it makes assumptions that would be incorrect with pktgen running multiple
threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?

Thanks,
Ben

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ