netdev - Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05fa74b9-6e3b-7bd3-fa1c-c02e37a521f8@huawei.com>
Date:   Tue, 26 Feb 2019 21:02:10 +0800
From:   Sheng Lan <lansheng@...wei.com>
To:     Stephen Hemminger <stephen@...workplumber.org>
CC:     <davem@...emloft.net>, <netdev@...r.kernel.org>,
        <netem@...ts.linux-foundation.org>, <xuhanbing@...wei.com>,
        <zhengshaoyu@...wei.com>, <jiqin.ji@...wei.com>,
        <liuzhiqiang26@...wei.com>, <yuehaibing@...wei.com>
Subject: Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec




> On Mon, 25 Feb 2019 22:49:39 +0800
> Sheng Lan <lansheng@...wei.com> wrote:
> 
>> From: Sheng Lan <lansheng@...wei.com>
>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>
>> It can be reproduced by following steps:
>> 1. virtio_net NIC is configured with gso/tso on
>> 2. configure nginx as http server with an index file bigger than 1M bytes
>> 3. use tc netem to produce duplicate packets and delay:
>>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>> 4. continually curl the nginx http server to get index file on client
>> 5. BUG_ON is seen quickly
>>
>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>> [10258690.372094] Call Trace:
>> [10258690.372094]  <IRQ>
>> [10258690.372094]  skb_to_sgvec+0x11/0x40
>> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
>> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
>> [10258690.372094]  sch_direct_xmit+0xff/0x260
>> [10258690.372094]  __qdisc_run+0x15e/0x4e0
>> [10258690.372094]  net_tx_action+0x137/0x210
>> [10258690.372094]  __do_softirq+0xd6/0x2a9
>> [10258690.372094]  irq_exit+0xde/0xf0
>> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
>> [10258690.372094]  apic_timer_interrupt+0xf/0x20
>> [10258690.372094]  </IRQ>
>>
>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>> bad skb's nonlinear data size is less than skb->data_len, because the
>> skb is cloned and a part of related cloned skb's nonlinear data is
>> split off.
>>
>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>> some time in qdisc. Due to the delay time, the original skb will be pushed
>> again later in __tcp_push_pending_frames when tcp receives new packets.
>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>> the original skb will be fragmented and the skb's nonlinear data will be
>> split off. The length of the skb cloned by netem will not be updated.
>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>
>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>> the duplicated skb's nonlinear data is independent.
>>
>> Signed-off-by: Sheng Lan <lansheng@...wei.com>
>> Reported-by: Qin Ji <jiqin.ji@...wei.com>
>>
>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
> 
> This sounds like a bug in the other layers (either TCP or Virtio net)
> not handling a cloned skb properly.
> 

I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.

TCP layer transmit the index file to client, the original skb1 size is large:
...
tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
tso_fragment              (it needs to be fragmented by limit value)
skb_split                 (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
...
netem_enqueue             (netem construct a duplicate packet of skb1 by skb_clone)
skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)


TCP layer receives new packets and trys to retransmit the skb1:
tcp_rcv_established
__tcp_push_pending_frames
tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
tso_fragment              (limit value is less than skb1->data_len, skb1 will be fragmented again)
skb_split                 (the second time split, skb1 is cloned now and share nonlinear data with skb2.
                           skb1->data_len == 1448, frags[0] == 1448, a part of shared nonlinear data has been split off)

Now we can see:
skb1->data_len == 1448
skb2->data_len == 2896    (2896 is wrong value)
frags[0] == 1448


The route of skb2:
netem_enqueue             (netem construct a duplicate packet, skb2 = skb_clone(skb1), put skb2 into queue)
waiting 30ms              (delayed packet)
...
netem_dequeue             (skb2->data_len == 2896, frags[0] == 1448)
sch_direct_xmit
dev_hard_start_xmit
xmit_one
start_xmit                [virtio_net driver]
skb_to_sgvec              (BUG_ON here )


The key is that skb be split by skb_split in tso_fragment again after netem clone it and share nonlinear data with another skb.
Processing of TCP seems OK, which push and fragment delayed packets in write queue. And virtio_net is the trigger of the BUG_ON.
So I replaced skb_clone with skb_copy in netem_enqueue, and the method worked. Currently, I have no better idea to fix it,
would you give me some inspiring advice ? If I am wrong, please correct me.

Thanks