[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <375ee6d3-ef86-c653-5d8d-df48cea8b3ba@gmail.com>
Date: Tue, 26 Feb 2019 07:52:00 -0800
From: Eric Dumazet <eric.dumazet@...il.com>
To: Sheng Lan <lansheng@...wei.com>,
Stephen Hemminger <stephen@...workplumber.org>
Cc: davem@...emloft.net, netdev@...r.kernel.org,
netem@...ts.linux-foundation.org, xuhanbing@...wei.com,
zhengshaoyu@...wei.com, jiqin.ji@...wei.com,
liuzhiqiang26@...wei.com, yuehaibing@...wei.com
Subject: Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
On 02/26/2019 05:02 AM, Sheng Lan wrote:
>
>
>
>> On Mon, 25 Feb 2019 22:49:39 +0800
>> Sheng Lan <lansheng@...wei.com> wrote:
>>
>>> From: Sheng Lan <lansheng@...wei.com>
>>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>>
>>> It can be reproduced by following steps:
>>> 1. virtio_net NIC is configured with gso/tso on
>>> 2. configure nginx as http server with an index file bigger than 1M bytes
>>> 3. use tc netem to produce duplicate packets and delay:
>>> tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>>> 4. continually curl the nginx http server to get index file on client
>>> 5. BUG_ON is seen quickly
>>>
>>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.0.0-rc6 #2
>>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>>> [10258690.372094] FS: 0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>>> [10258690.372094] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>>> [10258690.372094] Call Trace:
>>> [10258690.372094] <IRQ>
>>> [10258690.372094] skb_to_sgvec+0x11/0x40
>>> [10258690.372094] start_xmit+0x38c/0x520 [virtio_net]
>>> [10258690.372094] dev_hard_start_xmit+0x9b/0x200
>>> [10258690.372094] sch_direct_xmit+0xff/0x260
>>> [10258690.372094] __qdisc_run+0x15e/0x4e0
>>> [10258690.372094] net_tx_action+0x137/0x210
>>> [10258690.372094] __do_softirq+0xd6/0x2a9
>>> [10258690.372094] irq_exit+0xde/0xf0
>>> [10258690.372094] smp_apic_timer_interrupt+0x74/0x140
>>> [10258690.372094] apic_timer_interrupt+0xf/0x20
>>> [10258690.372094] </IRQ>
>>>
>>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>>> bad skb's nonlinear data size is less than skb->data_len, because the
>>> skb is cloned and a part of related cloned skb's nonlinear data is
>>> split off.
>>>
>>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>>> some time in qdisc. Due to the delay time, the original skb will be pushed
>>> again later in __tcp_push_pending_frames when tcp receives new packets.
>>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>>> the original skb will be fragmented and the skb's nonlinear data will be
>>> split off. The length of the skb cloned by netem will not be updated.
>>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>>
>>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>>> the duplicated skb's nonlinear data is independent.
>>>
>>> Signed-off-by: Sheng Lan <lansheng@...wei.com>
>>> Reported-by: Qin Ji <jiqin.ji@...wei.com>
>>>
>>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
>>
>> This sounds like a bug in the other layers (either TCP or Virtio net)
>> not handling a cloned skb properly.
>>
>
> I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
> Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.
>
> TCP layer transmit the index file to client, the original skb1 size is large:
> ...
> tcp_write_xmit (skb1->data_len == 62264, limit == 2*mss == 2896)
> tso_fragment (it needs to be fragmented by limit value)
> skb_split (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
> ...
> netem_enqueue (netem construct a duplicate packet of skb1 by skb_clone)
> skb2 = skb_clone(skb1) (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
> waiting 30ms (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)
>
>
> TCP layer receives new packets and trys to retransmit the skb1:
> tcp_rcv_established
> __tcp_push_pending_frames
> tcp_write_xmit (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
tcp_write_xmit() only deals with packet in the write queue,
they never were sent. They can not be any clone of them by definition, since
skbs in the TCP write queue are private to TCP stack,
Once a packet is sent, the master skb is moved to the rtx rb-tree,
while the clone is going through lower stacks.
When/if a retransmit is due, we always make sure there is no clone on it,
look at the various calls to skb_unclone()
> tso_fragment (limit value is less than skb1->data_len, skb1 will be fragmented again)
> skb_split (the second time split, skb1 is cloned now and share nonlinear data with skb2.
> skb1->data_len == 1448, frags[0] == 1448, a part of shared nonlinear data has been split off)
>
> Now we can see:
> skb1->data_len == 1448
> skb2->data_len == 2896 (2896 is wrong value)
> frags[0] == 1448
>
>
> The route of skb2:
> netem_enqueue (netem construct a duplicate packet, skb2 = skb_clone(skb1), put skb2 into queue)
> waiting 30ms (delayed packet)
> ...
> netem_dequeue (skb2->data_len == 2896, frags[0] == 1448)
> sch_direct_xmit
> dev_hard_start_xmit
> xmit_one
> start_xmit [virtio_net driver]
> skb_to_sgvec (BUG_ON here )
>
>
> The key is that skb be split by skb_split in tso_fragment again after netem clone it and share nonlinear data with another skb.
> Processing of TCP seems OK, which push and fragment delayed packets in write queue. And virtio_net is the trigger of the BUG_ON.
> So I replaced skb_clone with skb_copy in netem_enqueue, and the method worked. Currently, I have no better idea to fix it,
> would you give me some inspiring advice ? If I am wrong, please correct me.
>
> Thanks
>
Powered by blists - more mailing lists