[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eac018bf-58fe-188d-0dad-f454c9affebb@intel.com>
Date: Tue, 13 Mar 2018 17:28:07 -0700
From: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To: Siwei Liu <loseweigh@...il.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Stephen Hemminger <stephen@...workplumber.org>,
David Miller <davem@...emloft.net>,
Netdev <netdev@...r.kernel.org>, Jiri Pirko <jiri@...nulli.us>,
virtio-dev@...ts.oasis-open.org,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
Alexander Duyck <alexander.h.duyck@...el.com>,
Jakub Kicinski <kubakici@...pl>
Subject: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when
available
On 3/12/2018 3:44 PM, Siwei Liu wrote:
> Apologies, still some comments going. Please see inline.
>
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@...el.com> wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@...el.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@...el.com>
>> ---
>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 682 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index bcd13fe906ca..f2860d86c952 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -30,6 +30,8 @@
>> #include <linux/cpu.h>
>> #include <linux/average.h>
>> #include <linux/filter.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/pci.h>
>> #include <net/route.h>
>> #include <net/xdp.h>
>>
>> @@ -206,6 +208,9 @@ struct virtnet_info {
>> u32 speed;
>>
>> unsigned long guest_offloads;
>> +
>> + /* upper netdev created when BACKUP feature enabled */
>> + struct net_device *bypass_netdev;
>> };
>>
>> struct padded_vnet_hdr {
>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> }
>> }
>>
>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> + size_t len)
>> +{
>> + struct virtnet_info *vi = netdev_priv(dev);
>> + int ret;
>> +
>> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> + return -EOPNOTSUPP;
>> +
>> + ret = snprintf(buf, len, "_bkup");
>> + if (ret >= len)
>> + return -EOPNOTSUPP;
>> +
>> + return 0;
>> +}
>> +
>> static const struct net_device_ops virtnet_netdev = {
>> .ndo_open = virtnet_open,
>> .ndo_stop = virtnet_close,
>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>> .ndo_xdp_xmit = virtnet_xdp_xmit,
>> .ndo_xdp_flush = virtnet_xdp_flush,
>> .ndo_features_check = passthru_features_check,
>> + .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>> };
>>
>> static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>> return 0;
>> }
>>
>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>> + * is created that acts as a master device and tracks the state of the
>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>> + * 'backup' netdev and a passthru device with the same MAC is registered
>> + * as 'active' netdev.
>> + */
>> +
>> +/* bypass state maintained when BACKUP feature is enabled */
>> +struct virtnet_bypass_info {
>> + /* passthru netdev with same MAC */
>> + struct net_device __rcu *active_netdev;
>> +
>> + /* virtio_net netdev */
>> + struct net_device __rcu *backup_netdev;
>> +
>> + /* active netdev stats */
>> + struct rtnl_link_stats64 active_stats;
>> +
>> + /* backup netdev stats */
>> + struct rtnl_link_stats64 backup_stats;
>> +
>> + /* aggregated stats */
>> + struct rtnl_link_stats64 bypass_stats;
>> +
>> + /* spinlock while updating stats */
>> + spinlock_t stats_lock;
>> +};
>> +
>> +static void virtnet_bypass_child_open(struct net_device *dev,
>> + struct net_device *child_netdev)
>> +{
>> + int err = dev_open(child_netdev);
>> +
>> + if (err)
>> + netdev_warn(dev, "unable to open slave: %s: %d\n",
>> + child_netdev->name, err);
>> +}
> Why ignoring the error?? i.e. virtnet_bypass_child_open should have
> return value. I don't believe the caller is in a safe context to
> guarantee the dev_open always succeeds.
OK. Will handle this in the next revision.
>
>> +
>> +static int virtnet_bypass_open(struct net_device *dev)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + struct net_device *child_netdev;
>> +
>> + netif_carrier_off(dev);
>> + netif_tx_wake_all_queues(dev);
>> +
>> + child_netdev = rtnl_dereference(vbi->active_netdev);
>> + if (child_netdev)
>> + virtnet_bypass_child_open(dev, child_netdev);
> Handle the error?
>
>> +
>> + child_netdev = rtnl_dereference(vbi->backup_netdev);
>> + if (child_netdev)
>> + virtnet_bypass_child_open(dev, child_netdev);
> Handle the error and unwind?
Sure.
<snip>
> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> + struct virtnet_bypass_info *vbi;
> + struct net_device *dev;
> + bool backup;
> + int ret;
> +
> + if (child_netdev->addr_len != ETH_ALEN)
> + return NOTIFY_DONE;
> +
> + /* We will use the MAC address to locate the virtnet_bypass netdev
> + * to associate with the child netdev. If we don't find a matching
> + * bypass netdev, move on.
> + */
> + dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> + child_netdev->perm_addr);
> + if (!dev)
> + return NOTIFY_DONE;
> +
> + vbi = netdev_priv(dev);
> + backup = (child_netdev->dev.parent == dev->dev.parent);
> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
> + rtnl_dereference(vbi->active_netdev)) {
> + netdev_info(dev,
> + "%s attempting to join bypass dev when %s already present\n",
> + child_netdev->name, backup ? "backup" : "active");
> + return NOTIFY_DONE;
> + }
> +
> + /* Avoid non pci devices as active netdev */
> + if (!backup && (!child_netdev->dev.parent ||
> + !dev_is_pci(child_netdev->dev.parent)))
> + return NOTIFY_DONE;
> +
> There's a problem here in terms of error (particularly on the active
> slave, e.g. VF), see below:
>
>> + ret = netdev_rx_handler_register(child_netdev,
>> + virtnet_bypass_handle_frame, dev);
>> + if (ret != 0) {
>> + netdev_err(child_netdev,
>> + "can not register bypass receive handler (err = %d)\n",
>> + ret);
>> + goto rx_handler_failed;
>> + }
>> +
>> + ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> + if (ret != 0) {
>> + netdev_err(child_netdev,
>> + "can not set master device %s (err = %d)\n",
>> + dev->name, ret);
>> + goto upper_link_failed;
>> + }
>> +
>> + child_netdev->flags |= IFF_SLAVE;
>> +
>> + if (netif_running(dev)) {
>> + ret = dev_open(child_netdev);
>> + if (ret && (ret != -EBUSY)) {
>> + netdev_err(dev, "Opening child %s failed ret:%d\n",
>> + child_netdev->name, ret);
>> + goto err_interface_up;
>> + }
>> + }
>> +
>> + /* Align MTU of child with master */
>> + ret = dev_set_mtu(child_netdev, dev->mtu);
>> + if (ret) {
>> + netdev_err(dev,
>> + "unable to change mtu of %s to %u register failed\n",
>> + child_netdev->name, dev->mtu);
>> + goto err_set_mtu;
>> + }
> If any of the above calls returns non-zero, the current code steps
> back to undo what's being done on that spefic slave previously. For
> instance, if the netdev_rx_handler_register returns non-zero because
> of busy rx handler, this register_child function would give up
> enslaving the VF and leave the upper virtio_bypass interface behind
> once it returns.
virtio_bypass interface is the upper master netdev and it is always present when
BACKUP feature is enabled.
If there is failure with enslaving VF, there will be 2 netdevs,
master virtio_bypass and slave virtio_net and the VM can be migrated.
> I am not sure if it's a good idea to leave the
> virtio_bypass around if running into failure: the guest is not
> migratable as the VF doesn't have a backup path,
Are you talking about a failure when registering backup netdev? This should not
happen, but i guess we can improve error handling in such scenario.
> And perhaps the worse
> part is that, it now has two interfaces with identical MAC address but
> one of them is invalid (user cannot use the virtio interface as it has
> a dampened datapath). IMHO the virtio_bypass and its lower netdev
> should be destroyed at all when it fails to bind the VF, and
> technically, there should be some way to propogate the failure status
> to the hypervisor/backend, indicating that the VM is not migratable
> because of guest software errors (maybe by clearing out the backup
> feature from the guest virtio driver so host can see/learn it).
In BACKUP mode, user can only use the upper virtio_bypass netdev and that will
always be there. Any failure to enslave VF netdev is not fatal, but i will see
if we can improve the error handling of failure to enslave backup netdev.
Also, i don't think the BACKUP feature bit is negotiable with the host.
Thanks
Sridhar
Powered by blists - more mailing lists