lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88bc8a03-da44-fc15-f032-fe5cb592958b@candelatech.com>
Date:   Wed, 29 Sep 2021 12:07:04 -0700
From:   Ben Greear <greearb@...delatech.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: 5.15-rc3+ crash in fq-codel?

On 9/28/21 4:25 PM, Eric Dumazet wrote:
> 
> 
> On 9/28/21 3:00 PM, Ben Greear wrote:
>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>> Hello,
>>>>>>
>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>> some of my hackings have caused this issue.
>>>>>>
>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>> in the meantime...
>>>>>>
>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>
>>>>>
>>>>> pktgen must not be used in a mode where a single skb
>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>
>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>> anchor in their list.
>>>>>
>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>
>>>>> Please double check your clone_skb pktgen setup.
>>>>>
>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>
>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>
>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>> within a minute.  I'll let it bake overnight....
>>>
>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>
>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>> if we just haven't actually done this exact test in a while...
>>
>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>> We did not try any kernels older than 5.4.
>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>
>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>
> 
> 
> qdisc are protected by a qdisc spinlock.
> 
> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
> 
> I guess your local changes to pktgen might be to blame.
> 
> pfifo is much simpler than fq-codel, it uses less fields from skb.

I looked through my pktgen, and the skb creation and setup code looks pretty
similar to upstream pktgen.

I also added this debugging code:

[greearb@...-dt4 linux-5.15.dev.y]$ git diff
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index bb0cd6d3d2c2..56e22106e19d 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
         len = 0;
         i = 0;
         do {
+               if (!flow->head) {
+                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
+                              idx, maxbacklog, threshold, max_packets, len, i);
+                       BUG_ON(1);
+               }
                 skb = dequeue_head(flow);
                 len += qdisc_pkt_len(skb);
                 mem += get_codel_cb(skb)->mem_usage;

The printout I see when this hits is:


fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
kernel BUG at net/sched/sch_fq_codel.c:171!
.....

So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?

Any suggestions for what kinds of issues in pktgen could cause this?

Thanks,
Ben

-- 
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc  http://www.candelatech.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ