netdev - Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd75923d-a079-7db0-6903-a31fff062d26@huawei.com>
Date:   Wed, 27 Feb 2019 19:26:57 +0800
From:   Sheng Lan <lansheng@...wei.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        Stephen Hemminger <stephen@...workplumber.org>
CC:     <davem@...emloft.net>, <netdev@...r.kernel.org>,
        <netem@...ts.linux-foundation.org>, <xuhanbing@...wei.com>,
        <zhengshaoyu@...wei.com>, <jiqin.ji@...wei.com>,
        <liuzhiqiang26@...wei.com>, <yuehaibing@...wei.com>
Subject: Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec



在 2019/2/26 23:52, Eric Dumazet 写道:
> 
> 
> On 02/26/2019 05:02 AM, Sheng Lan wrote:
>>
>>
>>
>>> On Mon, 25 Feb 2019 22:49:39 +0800
>>> Sheng Lan <lansheng@...wei.com> wrote:
>>>
>>>> From: Sheng Lan <lansheng@...wei.com>
>>>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>>>
>>>> It can be reproduced by following steps:
>>>> 1. virtio_net NIC is configured with gso/tso on
>>>> 2. configure nginx as http server with an index file bigger than 1M bytes
>>>> 3. use tc netem to produce duplicate packets and delay:
>>>>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>>>> 4. continually curl the nginx http server to get index file on client
>>>> 5. BUG_ON is seen quickly
>>>>
>>>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>>>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>>>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
>>>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>>>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>>>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>>>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>>>> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>>>> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>>>> [10258690.372094] Call Trace:
>>>> [10258690.372094]  <IRQ>
>>>> [10258690.372094]  skb_to_sgvec+0x11/0x40
>>>> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
>>>> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
>>>> [10258690.372094]  sch_direct_xmit+0xff/0x260
>>>> [10258690.372094]  __qdisc_run+0x15e/0x4e0
>>>> [10258690.372094]  net_tx_action+0x137/0x210
>>>> [10258690.372094]  __do_softirq+0xd6/0x2a9
>>>> [10258690.372094]  irq_exit+0xde/0xf0
>>>> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
>>>> [10258690.372094]  apic_timer_interrupt+0xf/0x20
>>>> [10258690.372094]  </IRQ>
>>>>
>>>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>>>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>>>> bad skb's nonlinear data size is less than skb->data_len, because the
>>>> skb is cloned and a part of related cloned skb's nonlinear data is
>>>> split off.
>>>>
>>>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>>>> some time in qdisc. Due to the delay time, the original skb will be pushed
>>>> again later in __tcp_push_pending_frames when tcp receives new packets.
>>>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>>>> the original skb will be fragmented and the skb's nonlinear data will be
>>>> split off. The length of the skb cloned by netem will not be updated.
>>>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>>>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>>>
>>>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>>>> the duplicated skb's nonlinear data is independent.
>>>>
>>>> Signed-off-by: Sheng Lan <lansheng@...wei.com>
>>>> Reported-by: Qin Ji <jiqin.ji@...wei.com>
>>>>
>>>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
>>>
>>> This sounds like a bug in the other layers (either TCP or Virtio net)
>>> not handling a cloned skb properly.
>>>
>>
>> I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
>> Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.
>>
>> TCP layer transmit the index file to client, the original skb1 size is large:
>> ...
>> tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
>> tso_fragment              (it needs to be fragmented by limit value)
>> skb_split                 (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
>> ...
>> netem_enqueue             (netem construct a duplicate packet of skb1 by skb_clone)
>> skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
>> waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)
>>
>>
>> TCP layer receives new packets and trys to retransmit the skb1:
>> tcp_rcv_established
>> __tcp_push_pending_frames
>> tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
> 
> tcp_write_xmit() only deals with packet in the write queue,
> they never were sent. They can not be any clone of them by definition, since
> skbs in the TCP write queue are private to TCP stack,
> 
> Once a packet is sent, the master skb is moved to the rtx rb-tree,
> while the clone is going through lower stacks.
> 
> When/if a retransmit is due, we always make sure there is no clone on it,
> look at the various calls to skb_unclone()

I traced again and found that the skb was not sent, master skb was still in write queue,
because the function tcp_transmit_skb() returns 1 (NET_XMIT_DROP), thus it can be retransmit.
I found the error value NET_XMIT_DROP returns from netem_enqueue(), when the length of qdisc queue
is greater than queue limit value.

In netem_enqueue() the skb is cloned before returning the NET_XMIT_DROP error value,
thus the master skb is still in write queue and be cloned in netem_enqueue(). This may cause the master
skb be retransmit and fragmented again while it is cloned.

I think there are potential risks that tso_fragment() will get a cloned skb if skb is cloned by lower layer.
I try to fix it by moving returning error value statment to the front of the skb_clone() in netem_enqueue(), and it works.
And netem_enqueue() constructs corrupt packets statment returns NET_XMIT_DROP too. To fix this completely should I move the
constructing corrupt statment to the front of the skb_clone() ?
Please correct me if I am wrong, and I need your advice.

Thanks

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 75046ec..615a341 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -474,6 +474,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	if (q->latency || q->jitter || q->rate)
 		skb_orphan_partial(skb);

+	if (unlikely(sch->q.qlen >= sch->limit))
+		return qdisc_drop_all(skb, sch, to_free);
+
 	/*
 	 * If we need to duplicate packet, then re-insert at top of the
 	 * qdisc tree, since parent queuer expects that only one
@@ -521,9 +524,6 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			1<<(prandom_u32() % 8);
 	}

-	if (unlikely(sch->q.qlen >= sch->limit))
-		return qdisc_drop_all(skb, sch, to_free);
-
 	qdisc_qstats_backlog_inc(sch, skb);

 	cb = netem_skb_cb(skb);
--