netdev - Re: regression: UFO removal breaks kvm live migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Date:   Fri, 17 Nov 2017 09:31:23 -0500
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Jason Wang <jasowang@...hat.com>
Cc:     David Miller <davem@...emloft.net>,
        Michal Kubecek <mkubecek@...e.cz>,
        Network Development <netdev@...r.kernel.org>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Vlad Yasevic <vyasevic@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: regression: UFO removal breaks kvm live migration

On Fri, Nov 10, 2017 at 12:32 AM, Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
> On Wed, Nov 8, 2017 at 9:53 PM, Jason Wang <jasowang@...hat.com> wrote:
>>
>>
>> On 2017年11月08日 20:32, David Miller wrote:
>>>
>>> From: Jason Wang <jasowang@...hat.com>
>>> Date: Wed, 8 Nov 2017 17:25:48 +0900
>>>
>>>> On 2017年11月08日 17:08, Willem de Bruijn wrote:
>>>>>
>>>>> That won't help in the short term. I'm still reading up to see if
>>>>> there are
>>>>> any other options besides reimplement or advertise-but-drop, such as
>>>>> an implicit trigger that would make the guest renegotiate. It's
>>>>> unlikely, but
>>>>> worth a look..
>>>>
>>>> Yes, this looks hard. And even if we can manage to do this, it looks
>>>> an overkill since it will impact all guest after migration.
>>>
>>> Like Willem I would much prefer "advertise-but-drop" if it works.
>>
>>
>> This makes migration work but all guest UFO traffic will stall.
>>
>>>
>>> In the long term feature renegotiation triggers are a must.
>>>
>>> There is no way for us to remove features otherwise.
>>
>>
>> We can remove if we don't break userspace(guest).
>>
>>> In my opinion
>>> this will even make migrations more powerful.
>>
>>
>> But this does not help for guest running old version of kernel which still
>> think UFO work.
>
> Indeed, if we have to support live migration of arbitrary old guests
> without any expectations on hypervisor version either, features can
> simply never be reverted, even if a negotiation interface exists.
>
> At least for upcoming features and devices, guest code should not
> have this expectation, but from the start allow renegation such as
> CTRL_GUEST_OFFLOADS [1] based on a host trigger. But for
> tuntap TUNSETOFFLOAD it seems that ship has sailed.
>
> Okay, I will send a patch to reinstate UFO for this use case (only). There
> is some related work in tap_handle_frame and packet_direct_xmit to
> segment directly in the device. I will be traveling the next few days, so
> it won't be in time for 4.14 (but can go in stable later, of course).

I'm finishing up and running some tests. The majority of the patch is a
straightforward partial revert of the patchset, so while fairly large for a
patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
thoroughly tested code. Notably absent are the protocol layer and
hardware support (NETIF_F_UFO) portions.

The only open issue is whether to rely on existing skb_gso_segment
processing in the transmit path from validate_xmit_skb or to add new
skb_gso_segment calls directly to tun_get_user, tap_get_user and
pf_packet. Tun has to loop around four different ways of injecting
packets into the device. Something like the below snippet.

More conservative is to introduce no completely new code and rely on
validate_xmit_skb, but that means having to protect the entire stack
against skbs with SKB_GSO_UDP, so also bringing back some
checksum and fragment handling snippets in gre_gso_segment,
__skb_udp_tunnel_segment, act_csum and openvswitch.

A third option is to send the conservative approach to net, then
in net-next follow up with a patch to plug the SKB_GSO_UDP
directly in the devices and revert the tunnel/act/openvswitch stanzas
I'm leaning towards that approach.

@@ -1380,7 +1380,7 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,
                            int noblock, bool more)
 {
        struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
-       struct sk_buff *skb;
+       struct sk_buff *skb, *segs = NULL;
        size_t total_len = iov_iter_count(from);
        size_t len = total_len, align = tun->align, linear;
        struct virtio_net_hdr gso = { 0 };
@@ -1552,12 +1552,33 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,
        }

        rxhash = __skb_get_hash_symmetric(skb);
+
+       if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) {
+               skb_push(skb, ETH_HLEN);
+               segs = __skb_gso_segment(skb, netif_skb_features(skb), false);
+
+               if (IS_ERR(segs)) {
+                       kfree_skb(skb);
+                       return PTR_ERR(segs);
+               }
+
+               if (segs) {
+                       consume_skb(skb);
+                       skb = segs;
+               }
+again:
+               skb_pull(skb, ETH_HLEN);
+               segs = skb->next;
+               skb->next = NULL;
+       }
+
 #ifndef CONFIG_4KSTACKS
-        tun_rx_batched(tun, tfile, skb, more);
+       tun_rx_batched(tun, tfile, skb, more || segs);
 #else
        netif_rx_ni(skb);
 #endif

+       if (segs) {
+               skb = segs;
+               goto again;
+       }