netdev - Re: 5.15-rc3+ crash in fq-codel?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f3f1378d-6839-cd23-9e2c-4668947c2345@gmail.com>
Date:   Wed, 29 Sep 2021 16:28:44 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Ben Greear <greearb@...delatech.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: 5.15-rc3+ crash in fq-codel?



On 9/29/21 4:21 PM, Eric Dumazet wrote:
> 
> 
> On 9/29/21 12:07 PM, Ben Greear wrote:
>> On 9/28/21 4:25 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/28/21 3:00 PM, Ben Greear wrote:
>>>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>>>> some of my hackings have caused this issue.
>>>>>>>>
>>>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>>>> in the meantime...
>>>>>>>>
>>>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>>>
>>>>>>>
>>>>>>> pktgen must not be used in a mode where a single skb
>>>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>>>
>>>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>>>> anchor in their list.
>>>>>>>
>>>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>>>
>>>>>>> Please double check your clone_skb pktgen setup.
>>>>>>>
>>>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>>>
>>>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>>>
>>>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>>>> within a minute.  I'll let it bake overnight....
>>>>>
>>>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>>>
>>>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>>>> if we just haven't actually done this exact test in a while...
>>>>
>>>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>>>> We did not try any kernels older than 5.4.
>>>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>>>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>>>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>>>
>>>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>>>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>>>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>>>
>>>
>>>
>>> qdisc are protected by a qdisc spinlock.
>>>
>>> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
>>>
>>> I guess your local changes to pktgen might be to blame.
>>>
>>> pfifo is much simpler than fq-codel, it uses less fields from skb.
>>
>> I looked through my pktgen, and the skb creation and setup code looks pretty
>> similar to upstream pktgen.
>>
>> I also added this debugging code:
>>
>> [greearb@...-dt4 linux-5.15.dev.y]$ git diff
>> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
>> index bb0cd6d3d2c2..56e22106e19d 100644
>> --- a/net/sched/sch_fq_codel.c
>> +++ b/net/sched/sch_fq_codel.c
>> @@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
>>         len = 0;
>>         i = 0;
>>         do {
>> +               if (!flow->head) {
>> +                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
>> +                              idx, maxbacklog, threshold, max_packets, len, i);
>> +                       BUG_ON(1);
>> +               }
>>                 skb = dequeue_head(flow);
>>                 len += qdisc_pkt_len(skb);
>>                 mem += get_codel_cb(skb)->mem_usage;
>>
>> The printout I see when this hits is:
>>
>>
>> fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
>> kernel BUG at net/sched/sch_fq_codel.c:171!
>> .....
>>
>> So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?
>>
>> Any suggestions for what kinds of issues in pktgen could cause this?
> 
> Modifications to skbs after they were queued to the qdisc.
> 
> qdisc_pkt_len(skb) uses skb->cb[] storage. Make sure to not use it.
> 
> 

Actually the bug seems to be in pktgen, vs NET_XMIT_CN

You probably would hit the same issues with other qdisc also using NET_XMIT_CN