[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd75923d-a079-7db0-6903-a31fff062d26@huawei.com>
Date: Wed, 27 Feb 2019 19:26:57 +0800
From: Sheng Lan <lansheng@...wei.com>
To: Eric Dumazet <eric.dumazet@...il.com>,
Stephen Hemminger <stephen@...workplumber.org>
CC: <davem@...emloft.net>, <netdev@...r.kernel.org>,
<netem@...ts.linux-foundation.org>, <xuhanbing@...wei.com>,
<zhengshaoyu@...wei.com>, <jiqin.ji@...wei.com>,
<liuzhiqiang26@...wei.com>, <yuehaibing@...wei.com>
Subject: Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
在 2019/2/26 23:52, Eric Dumazet 写道:
>
>
> On 02/26/2019 05:02 AM, Sheng Lan wrote:
>>
>>
>>
>>> On Mon, 25 Feb 2019 22:49:39 +0800
>>> Sheng Lan <lansheng@...wei.com> wrote:
>>>
>>>> From: Sheng Lan <lansheng@...wei.com>
>>>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>>>
>>>> It can be reproduced by following steps:
>>>> 1. virtio_net NIC is configured with gso/tso on
>>>> 2. configure nginx as http server with an index file bigger than 1M bytes
>>>> 3. use tc netem to produce duplicate packets and delay:
>>>> tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>>>> 4. continually curl the nginx http server to get index file on client
>>>> 5. BUG_ON is seen quickly
>>>>
>>>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>>>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>>>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.0.0-rc6 #2
>>>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>>>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>>>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>>>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>>>> [10258690.372094] FS: 0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>>>> [10258690.372094] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>>>> [10258690.372094] Call Trace:
>>>> [10258690.372094] <IRQ>
>>>> [10258690.372094] skb_to_sgvec+0x11/0x40
>>>> [10258690.372094] start_xmit+0x38c/0x520 [virtio_net]
>>>> [10258690.372094] dev_hard_start_xmit+0x9b/0x200
>>>> [10258690.372094] sch_direct_xmit+0xff/0x260
>>>> [10258690.372094] __qdisc_run+0x15e/0x4e0
>>>> [10258690.372094] net_tx_action+0x137/0x210
>>>> [10258690.372094] __do_softirq+0xd6/0x2a9
>>>> [10258690.372094] irq_exit+0xde/0xf0
>>>> [10258690.372094] smp_apic_timer_interrupt+0x74/0x140
>>>> [10258690.372094] apic_timer_interrupt+0xf/0x20
>>>> [10258690.372094] </IRQ>
>>>>
>>>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>>>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>>>> bad skb's nonlinear data size is less than skb->data_len, because the
>>>> skb is cloned and a part of related cloned skb's nonlinear data is
>>>> split off.
>>>>
>>>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>>>> some time in qdisc. Due to the delay time, the original skb will be pushed
>>>> again later in __tcp_push_pending_frames when tcp receives new packets.
>>>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>>>> the original skb will be fragmented and the skb's nonlinear data will be
>>>> split off. The length of the skb cloned by netem will not be updated.
>>>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>>>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>>>
>>>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>>>> the duplicated skb's nonlinear data is independent.
>>>>
>>>> Signed-off-by: Sheng Lan <lansheng@...wei.com>
>>>> Reported-by: Qin Ji <jiqin.ji@...wei.com>
>>>>
>>>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
>>>
>>> This sounds like a bug in the other layers (either TCP or Virtio net)
>>> not handling a cloned skb properly.
>>>
>>
>> I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
>> Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.
>>
>> TCP layer transmit the index file to client, the original skb1 size is large:
>> ...
>> tcp_write_xmit (skb1->data_len == 62264, limit == 2*mss == 2896)
>> tso_fragment (it needs to be fragmented by limit value)
>> skb_split (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
>> ...
>> netem_enqueue (netem construct a duplicate packet of skb1 by skb_clone)
>> skb2 = skb_clone(skb1) (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
>> waiting 30ms (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)
>>
>>
>> TCP layer receives new packets and trys to retransmit the skb1:
>> tcp_rcv_established
>> __tcp_push_pending_frames
>> tcp_write_xmit (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
>
> tcp_write_xmit() only deals with packet in the write queue,
> they never were sent. They can not be any clone of them by definition, since
> skbs in the TCP write queue are private to TCP stack,
>
> Once a packet is sent, the master skb is moved to the rtx rb-tree,
> while the clone is going through lower stacks.
>
> When/if a retransmit is due, we always make sure there is no clone on it,
> look at the various calls to skb_unclone()
I traced again and found that the skb was not sent, master skb was still in write queue,
because the function tcp_transmit_skb() returns 1 (NET_XMIT_DROP), thus it can be retransmit.
I found the error value NET_XMIT_DROP returns from netem_enqueue(), when the length of qdisc queue
is greater than queue limit value.
In netem_enqueue() the skb is cloned before returning the NET_XMIT_DROP error value,
thus the master skb is still in write queue and be cloned in netem_enqueue(). This may cause the master
skb be retransmit and fragmented again while it is cloned.
I think there are potential risks that tso_fragment() will get a cloned skb if skb is cloned by lower layer.
I try to fix it by moving returning error value statment to the front of the skb_clone() in netem_enqueue(), and it works.
And netem_enqueue() constructs corrupt packets statment returns NET_XMIT_DROP too. To fix this completely should I move the
constructing corrupt statment to the front of the skb_clone() ?
Please correct me if I am wrong, and I need your advice.
Thanks
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 75046ec..615a341 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -474,6 +474,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
if (q->latency || q->jitter || q->rate)
skb_orphan_partial(skb);
+ if (unlikely(sch->q.qlen >= sch->limit))
+ return qdisc_drop_all(skb, sch, to_free);
+
/*
* If we need to duplicate packet, then re-insert at top of the
* qdisc tree, since parent queuer expects that only one
@@ -521,9 +524,6 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
1<<(prandom_u32() % 8);
}
- if (unlikely(sch->q.qlen >= sch->limit))
- return qdisc_drop_all(skb, sch, to_free);
-
qdisc_qstats_backlog_inc(sch, skb);
cb = netem_skb_cb(skb);
--
Powered by blists - more mailing lists