[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <198418193.20140501213951@eikelenboom.it>
Date: Thu, 1 May 2014 21:39:51 +0200
From: Sander Eikelenboom <linux@...elenboom.it>
To: Zoltan Kiss <zoltan.kiss@...rix.com>,
Eric Dumazet <eric.dumazet@...il.com>
CC: netdev@...r.kernel.org, xen-devel@...ts.xen.org,
Ian Campbell <Ian.Campbell@...rix.com>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.
Thursday, May 1, 2014, 7:39:36 PM, you wrote:
> Thursday, May 1, 2014, 5:46:01 PM, you wrote:
>> On 01/05/14 14:59, Sander Eikelenboom wrote:
>>>
>>> Thursday, May 1, 2014, 3:37:41 PM, you wrote:
>>>
>>>> On 30/04/14 23:25, Sander Eikelenboom wrote:
>>>>>
>>>>> Wednesday, April 30, 2014, 10:53:39 PM, you wrote:
>>>>>
>>>>>> On 30/04/14 11:45, Sander Eikelenboom wrote:
>>>>>>> Hi Zoltan,
>>>>>>>
>>>>>>> Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", merged into mainline with merge commit 4caeccb4de76440e433a15009636e77d003eb3d6,
>>>>>>> seem to introduce a subtle bug on network traffic between 2 guests on a bridge on the same host.
>>>>>>> I have one guest running apache as webdav server with SSL and another guest that is using that is uploading large files to that webdav server.
>>>>>>> Small requests (some get's and propfind's) seem to work ok, but when the bulk uploading begins it fails with:
>>>>>>>
>>>>>>> Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>> Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>> Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>> Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>>
>>>>>>> So some how large (probably fragmented) packets can get mangled when from guest to guest on the same host.
>>>>>>> I don't see this with clients that upload large files from external sources.
>>>>>>> Probably if SSL wasn't complaining it would probably be unnoticed for longer and doing some silent corruption.
>>>>>>>
>>>>>>> I first blamed openssl, since it started around all the latest openssl mayhem and updates, but it turns out it is all xen-netback related again.
>>>>>>>
>>>>>>> Since these commits break bisectabillity:
>>>>>>> - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d (note in commit message && kernel panic)
>>>>>>> - 62bad3199a4c20505fc36c169deef20b25e17c5f (kernel panic)
>>>>>>> i stopped bisecting at this point.
>>>>>>>
>>>>>>> The upside is .. it's 100% reproduceable :-)
>>>>>> That's good :) Can you take tcpdump captures along the way (sending
>>>>>> guest, dom0, receiving guest), and try to work out which packets are
>>>>>> different, and where? Although taking captures in Dom0 might change your
>>>>>> result, as it triggers the pages to be copied and unmapped before they
>>>>>> reach their target.
>>>>>
>>>>>> Thanks,
>>>>>> Zoli
>>>>>
>>>>>
>>>>> Hrrmm that sounds like a lot of data and a lot of work ..
>>>> If you could make captures in the sending and receiving guest with
>>>> tcpdump (take care of increasing snaplen so the whole packet is there,
>>>> and filter to the SSH connection itself), and upload it somewhere for
>>>> me, that would be enough for start. I will try to work out where the
>>>> corruption happens.
>>>> Also, do you have timestamps for the above mentioned log entries? I
>>>> guess they appear on the receiving side.
>>>> And some info about the componenets on the server, so I can work out
>>>> where is that _ssl.c:1415, and which part of the packet it actually
>>>> looks for.
>>>
>>> They appear on the sending side (duplicity) .. the receiving side (apache +
>>> mod_dav + ssl | gnu_tls) gives a "Could not get next bucket brigade (URI:"
>> I will try to repro this case in house. What versions of these
>> components you used?
> Both guests are debian wheezy.
> The webdav server has:
> ii apache2-mpm-event 2.2.22-13+deb7u1 amd64 Apache HTTP Server - event driven model
> ii apache2-utils 2.2.22-13+deb7u1 amd64 uti
> ii apache2.2-bin 2.2.22-13+deb7u1 amd64 Apa
> ii apache2.2-common 2.2.22-13+deb7u1 amd64 Apa
> ii libapache2-mod-gnutls 0.5.10-1.1 amd64 Apa
> ii libssl1.0.0:amd64 1.0.1e-2+deb7u7 amd64 SSL
> ii openssl 1.0.1e-2+deb7u7 amd64 Sec
> The guest with duplicity currently has a duplicity version from unstable
> recompiled for wheezy. But i previously also tried a downgrade to the standard
> wheezy version. It uses the webdav backend and a volumesize of 100MB.
> Unfortunately it seems duplicity doesn't bail out at first instance, it seems it
> only reports error after the so the full tcpdumps i got are also 100MB each.
> Since the error seems to happen when it's going through
> "xenvif_handle_frag_list", i have added a bunch of ratelimited printk's.
> Will run that for both the cases:
> skb->truesize -= skb->data_len;
> skb->truesize -= nskb->data_len;
> Let's see what that does different and if that gives an insight in what is going
> wrong.
The debug patch that i used is attached.
Here is some of the output, skb->data_len size sometimes varies slightly as does skb->truesize (is that allowed to become > 64k ?).
All of it attached in dmesg.txt
[ 647.256275] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 1: skb->data_len: 64890, skb->len: 65018, skb->truesize: 65658, nskb->data_len: 0, nskb->len: 0, nskb->truesize: 768 MAX_SKB_FRAGS:17 offset:128
[ 647.273483] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 2: skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:128
[ 647.291478] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:4224
[ 647.310199] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:8320
[ 647.329874] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:12416
[ 647.350268] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:16512
[ 647.371562] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:20608
[ 647.393707] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:24704
[ 647.416604] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:28800
[ 647.440482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:32896
[ 647.465490] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:36992
[ 647.491216] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:41088
[ 647.517698] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:45184
[ 647.545022] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:49280
[ 647.573482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:53376
[ 647.602458] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:57472
[ 647.632271] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:61568
[ 647.662608] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:65226
[ 647.693632] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 4: i:16, skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:65226
<snip>
[ 848.058828] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 1: skb->data_len: 63493, skb->len: 63621, skb->truesize: 64261, nskb->data_len: 0, nskb->len: 0, nskb->truesize: 768 MAX_SKB_FRAGS:17 offset:128
[ 848.096299] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 2: skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:128
[ 848.133427] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:4224
[ 848.170519] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:8320
[ 848.207578] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:12416
[ 848.244847] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:16512
[ 848.282044] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:20608
[ 848.319114] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:24704
[ 848.356198] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:28800
[ 848.393369] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:32896
[ 848.430450] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:36992
[ 848.467631] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:41088
[ 848.504711] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:45184
[ 848.541759] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:49280
[ 848.578914] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:53376
[ 848.616120] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:57472
[ 848.653251] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:61568
[ 848.690269] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:65226
[ 848.727394] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 4: i:16, skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:65226
>> Zoli
>>>
>>>
>>>>>
>>>>> how ever .. could it be just a type and would the following make sense ?
>>>>>
>>>>> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
>>>>> index 7666540..abeea10 100644
>>>>> --- a/drivers/net/xen-netback/netback.c
>>>>> +++ b/drivers/net/xen-netback/netback.c
>>>>> @@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif *vif, struct sk_buff *skb)
>>>>>
>>>>> xenvif_fill_frags(vif, nskb);
>>>>> /* Subtract frags size, we will correct it later */
>>>>> - skb->truesize -= skb->data_len;
>>>>> + skb->truesize -= nskb->data_len;
>>>>> skb->len += nskb->len;
>>>>> skb->data_len += nskb->len;
>>>
>>>> Nope, that's correct there: after that skb->truesize will be the size of
>>>> the struct plus the linear buffer itself. The code is just about the
>>>> ditch the original fragments plus the skb on the frag_list. When the new
>>>> pages are created, it will update it again.
>>>
>>> Well i just went a head and tried this .. and the uploading does seem to work fine with this change
>>> .. (that obviously doesn't say anything about correctness)
>>>
>>>> Also, this code path runs only if the guest sends more slots we can
>>>> handle (so we put the extra one to the frag_list until we can get rid of
>>>> it). On Linux it can only happen with 3.2 or older guest kernels, and
>>>> only occasionally. As you said, this is 100% reproducible, so I would
>>>> doubt the problem is with this part of the code.
>>>
>>> Well this assumption seems to be incorrect:
>>> - both dom0 and guest kernels are 3.15-rc3's.
>>> - but we do end up in this code path
>>>
>>>> Zoli
>>>
>>>
Download attachment "netback-debug6.diff" of type "application/octet-stream" (3082 bytes)
View attachment "dmesg.txt" of type "text/plain" (70159 bytes)
Powered by blists - more mailing lists