netdev - Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADGSJ23fyYuDhXBU6LRarcv36VjXbkfiYSpXWWJzP+coZBmhDg@mail.gmail.com>
Date:   Fri, 2 Mar 2018 16:09:27 -0800
From:   Siwei Liu <loseweigh@...il.com>
To:     "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        David Miller <davem@...emloft.net>,
        Netdev <netdev@...r.kernel.org>, Jiri Pirko <jiri@...nulli.us>,
        virtio-dev@...ts.oasis-open.org,
        "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        Jakub Kicinski <kubakici@...pl>
Subject: Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use
 VF datapath when available

On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar
<sridhar.samudrala@...el.com> wrote:
> On 3/2/2018 1:11 PM, Siwei Liu wrote:
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@...el.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@...el.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@...el.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include <linux/cpu.h>
>>>   #include <linux/average.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>>   #include <net/route.h>
>>>   #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>          u32 speed;
>>>
>>>          unsigned long guest_offloads;
>>> +
>>> +       /* upper netdev created when BACKUP feature enabled */
>>> +       struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>          }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> +                                     size_t len)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       int ret;
>>> +
>>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       ret = snprintf(buf, len, "_bkup");
>>> +       if (ret >= len)
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n<phys_port_name> naming? Would virtio_bypass get a different name
>> than the original virtio_net? Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
>
> If udev doesn't support renaming of the devices,  then the upper bypass
> device
> should get the original name and the lower virtio netdev will get the next
> name.

If you got two virtio-net's (say e.g. eth0 and eth1) before the
update, the virtio-bypass interface on the first virtio gets eth0 and
the backup netdev would get eth1? Then the IP address originally on
eth1 gets configurd on the backup interface?

> Hopefully the distros updating the kernel will also move to the new
> systemd/udev.

This is not reliable. I'd opt for a new udev API for this, and fall
back to what it was if don't see fit.

>
>
>
>>
>>>   static const struct net_device_ops virtnet_netdev = {
>>>          .ndo_open            = virtnet_open,
>>>          .ndo_stop            = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>>          .ndo_features_check     = passthru_features_check,
>>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>          return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> +       /* passthru netdev with same MAC */
>>> +       struct net_device __rcu *active_netdev;
>>> +
>>> +       /* virtio_net netdev */
>>> +       struct net_device __rcu *backup_netdev;
>>> +
>>> +       /* active netdev stats */
>>> +       struct rtnl_link_stats64 active_stats;
>>> +
>>> +       /* backup netdev stats */
>>> +       struct rtnl_link_stats64 backup_stats;
>>> +
>>> +       /* aggregated stats */
>>> +       struct rtnl_link_stats64 bypass_stats;
>>> +
>>> +       /* spinlock while updating stats */
>>> +       spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> +                                     struct net_device *child_netdev)
>>> +{
>>> +       int err = dev_open(child_netdev);
>>> +
>>> +       if (err)
>>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> +                           child_netdev->name, err);
>>> +}
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_carrier_off(dev);
>>> +       netif_tx_wake_all_queues(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static int virtnet_bypass_close(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_tx_disable(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vi->active_netdev);
>>> +       if (child_netdev)
>>> +               dev_close(child_netdev);
>>> +
>>> +       child_netdev = rtnl_dereference(vi->backup_netdev);
>>> +       if (child_netdev)
>>> +               dev_close(child_netdev);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>>> +                                           struct net_device *dev)
>>> +{
>>> +       atomic_long_inc(&dev->tx_dropped);
>>> +       dev_kfree_skb_any(skb);
>>> +       return NETDEV_TX_OK;
>>> +}
>>> +
>>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>>> +{
>>> +       return netif_running(dev) && netif_carrier_ok(dev);
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>>> +                                            struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *xmit_dev;
>>> +
>>> +       /* Try xmit via active netdev followed by backup netdev */
>>> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>>> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>>> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
>>> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>>> +                       return virtnet_bypass_drop_xmit(skb, dev);
>>> +       }
>>> +
>>> +       skb->dev = xmit_dev;
>>> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>>> +
>>> +       return dev_queue_xmit(skb);
>>> +}
>>> +
>>> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
>>> +                                      struct sk_buff *skb, void
>>> *accel_priv,
>>> +                                      select_queue_fallback_t fallback)
>>> +{
>>> +       /* This helper function exists to help dev_pick_tx get the
>>> correct
>>> +        * destination queue.  Using a helper function skips a call to
>>> +        * skb_tx_hash and will put the skbs in the queue we expect on
>>> their
>>> +        * way down to the bonding driver.
>>> +        */
>>> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>>> +
>>> +       /* Save the original txq to restore before passing to the driver
>>> */
>>> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>>> +
>>> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
>>> +               do {
>>> +                       txq -= dev->real_num_tx_queues;
>>> +               } while (txq >= dev->real_num_tx_queues);
>>> +       }
>>> +
>>> +       return txq;
>>> +}
>>> +
>>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>>> + * that some drivers can provide 32bit values only.
>>> + */
>>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>>> +                                     const struct rtnl_link_stats64
>>> *_new,
>>> +                                     const struct rtnl_link_stats64
>>> *_old)
>>> +{
>>> +       const u64 *new = (const u64 *)_new;
>>> +       const u64 *old = (const u64 *)_old;
>>> +       u64 *res = (u64 *)_res;
>>> +       int i;
>>> +
>>> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>>> +               u64 nv = new[i];
>>> +               u64 ov = old[i];
>>> +               s64 delta = nv - ov;
>>> +
>>> +               /* detects if this particular field is 32bit only */
>>> +               if (((nv | ov) >> 32) == 0)
>>> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
>>> +
>>> +               /* filter anomalies, some drivers reset their stats
>>> +                * at down/up events.
>>> +                */
>>> +               if (delta > 0)
>>> +                       res[i] += delta;
>>> +       }
>>> +}
>>> +
>>> +static void virtnet_bypass_get_stats(struct net_device *dev,
>>> +                                    struct rtnl_link_stats64 *stats)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       const struct rtnl_link_stats64 *new;
>>> +       struct rtnl_link_stats64 temp;
>>> +       struct net_device *child_netdev;
>>> +
>>> +       spin_lock(&vbi->stats_lock);
>>> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>>> +
>>> +       rcu_read_lock();
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               new = dev_get_stats(child_netdev, &temp);
>>> +               virtnet_bypass_fold_stats(stats, new,
>>> &vbi->active_stats);
>>> +               memcpy(&vbi->active_stats, new, sizeof(*new));
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               new = dev_get_stats(child_netdev, &temp);
>>> +               virtnet_bypass_fold_stats(stats, new,
>>> &vbi->backup_stats);
>>> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
>>> +       }
>>> +
>>> +       rcu_read_unlock();
>>> +
>>> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>>> +       spin_unlock(&vbi->stats_lock);
>>> +}
>>> +
>>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int
>>> new_mtu)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +       int ret = 0;
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>>> +               if (ret)
>>> +                       return ret;
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>>> +               if (ret)
>>> +                       netdev_err(child_netdev,
>>> +                                  "Unexpected failure to set mtu to
>>> %d\n",
>>> +                                  new_mtu);
>>
>> Shouldn't we unwind the MTU config on active_netdev if failing to set
>> it on backup_netdev?
>
>
> backup_netdev is virtio_net and  it is not expected to fail as it supports
> MTU all the way upto 64K.

Hmmm, then why do you need to check up the return of setting mtu on
backup_netdev here? And it's not completely impossible setting MTU
fails due to some other reason at the lower level. We should not take
assumption that way.

>
>
>
>>
>>> +       }
>>> +
>>> +       dev->mtu = new_mtu;
>>> +       return 0;
>>> +}
>>> +
>>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       rcu_read_lock();
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               dev_uc_sync_multiple(child_netdev, dev);
>>> +               dev_mc_sync_multiple(child_netdev, dev);
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               dev_uc_sync_multiple(child_netdev, dev);
>>> +               dev_mc_sync_multiple(child_netdev, dev);
>>> +       }
>>> +
>>
>> If VF comes up later than when set_rx_mode is called where do you sync
>> up the unicast and multicast address?
>
>
> I could include sync up of unicast and multicast list along with
> MTU when the child is registered.
>
>
>
>>
>> The rest looks good.
>>
>> Thanks,
>> -Siwei
>>
>>
>