netdev - Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ucp8WhkORpVVy2VWwuvxZHZjXKn6YfHfVaLAVVybf3t5w@mail.gmail.com>
Date:   Tue, 27 Feb 2018 13:16:21 -0800
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Jiri Pirko <jiri@...nulli.us>
Cc:     Sridhar Samudrala <sridhar.samudrala@...el.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        David Miller <davem@...emloft.net>,
        Netdev <netdev@...r.kernel.org>,
        virtualization@...ts.linux-foundation.org,
        virtio-dev@...ts.oasis-open.org,
        "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
        "Duyck, Alexander H" <alexander.h.duyck@...el.com>,
        Jakub Kicinski <kubakici@...pl>,
        Jason Wang <jasowang@...hat.com>,
        Siwei Liu <loseweigh@...il.com>
Subject: Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a
 passthru device

On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko <jiri@...nulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@...il.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@...nulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@...el.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> I have another usecase that would require the solution to be different
> then what you suggest. Consider following scenario:
> - baremetal has 2 sr-iov nics
> - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
> - baremetal would like to somehow tell the VM to bond vf0 and vf1
>   together and how this bonding should be configured, according to how
>   the VF representors are configured on the baremetal (LACP for example)
>
> The baremetal could decide to remove any VF during the VM runtime, it
> can add another VF there. For migration, it can add virtio_net. The VM
> should be inctructed to bond all interfaces together according to how
> baremetal decided - as it knows better.
>
> For this we need a separate communication channel from baremetal to VM
> (perhaps something re-usable already exists), we need something to
> listen to the events coming from this channel (kernel/userspace) and to
> react accordingly (create bond/team, enslave, etc).
>
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
>
> Thoughts?

So that is about what I had in mind. We end up having to do something
completely different to support this more complex solution. I think we
might have referred to it as v2/v3 in a different thread, and
virt-bond in this thread.

Basically we need some sort of PCI or PCIe topology mapping for the
devices that can be translated into something we can communicate over
the communication channel. After that we also have the added
complexity of how do we figure out which Tx path we want to choose.
This is one of the reasons why I was thinking of something like a eBPF
blob that is handed up from the host side and into the guest to select
the Tx queue. That way when we add some new approach such as a
NUMA/cpu based netdev selection then we just provide an eBPF blob that
does that. Most of this is just theoretical at this point though since
I haven't had a chance to look into it too deeply yet. If you want to
take something like this on the help would always be welcome.. :)

The other thing I am looking at is trying to find a good way to do
dirty page tracking in the hypervisor using something like a
para-virtual IOMMU. However I don't have any ETA on that as I am just
starting out and have limited development time. If we get that in
place we can leave the VF in the guest until the very last moments
instead of having to remove it before we start the live migration.

- Alex