[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 2 Mar 2018 16:09:27 -0800
From: Siwei Liu <loseweigh@...il.com>
To: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Stephen Hemminger <stephen@...workplumber.org>,
David Miller <davem@...emloft.net>,
Netdev <netdev@...r.kernel.org>, Jiri Pirko <jiri@...nulli.us>,
virtio-dev@...ts.oasis-open.org,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
Alexander Duyck <alexander.h.duyck@...el.com>,
Jakub Kicinski <kubakici@...pl>
Subject: Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use
VF datapath when available
On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar
<sridhar.samudrala@...el.com> wrote:
> On 3/2/2018 1:11 PM, Siwei Liu wrote:
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@...el.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@...el.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@...el.com>
>>> ---
>>> drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>> 1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>> #include <linux/cpu.h>
>>> #include <linux/average.h>
>>> #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>> #include <net/route.h>
>>> #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>> u32 speed;
>>>
>>> unsigned long guest_offloads;
>>> +
>>> + /* upper netdev created when BACKUP feature enabled */
>>> + struct net_device *bypass_netdev;
>>> };
>>>
>>> struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>> }
>>> }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> + size_t len)
>>> +{
>>> + struct virtnet_info *vi = netdev_priv(dev);
>>> + int ret;
>>> +
>>> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> + return -EOPNOTSUPP;
>>> +
>>> + ret = snprintf(buf, len, "_bkup");
>>> + if (ret >= len)
>>> + return -EOPNOTSUPP;
>>> +
>>> + return 0;
>>> +}
>>> +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n<phys_port_name> naming? Would virtio_bypass get a different name
>> than the original virtio_net? Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
>
> If udev doesn't support renaming of the devices, then the upper bypass
> device
> should get the original name and the lower virtio netdev will get the next
> name.
If you got two virtio-net's (say e.g. eth0 and eth1) before the
update, the virtio-bypass interface on the first virtio gets eth0 and
the backup netdev would get eth1? Then the IP address originally on
eth1 gets configurd on the backup interface?
> Hopefully the distros updating the kernel will also move to the new
> systemd/udev.
This is not reliable. I'd opt for a new udev API for this, and fall
back to what it was if don't see fit.
>
>
>
>>
>>> static const struct net_device_ops virtnet_netdev = {
>>> .ndo_open = virtnet_open,
>>> .ndo_stop = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>> .ndo_xdp_xmit = virtnet_xdp_xmit,
>>> .ndo_xdp_flush = virtnet_xdp_flush,
>>> .ndo_features_check = passthru_features_check,
>>> + .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>> };
>>>
>>> static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>> return 0;
>>> }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> + /* passthru netdev with same MAC */
>>> + struct net_device __rcu *active_netdev;
>>> +
>>> + /* virtio_net netdev */
>>> + struct net_device __rcu *backup_netdev;
>>> +
>>> + /* active netdev stats */
>>> + struct rtnl_link_stats64 active_stats;
>>> +
>>> + /* backup netdev stats */
>>> + struct rtnl_link_stats64 backup_stats;
>>> +
>>> + /* aggregated stats */
>>> + struct rtnl_link_stats64 bypass_stats;
>>> +
>>> + /* spinlock while updating stats */
>>> + spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> + struct net_device *child_netdev)
>>> +{
>>> + int err = dev_open(child_netdev);
>>> +
>>> + if (err)
>>> + netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> + child_netdev->name, err);
>>> +}
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> + struct net_device *child_netdev;
>>> +
>>> + netif_carrier_off(dev);
>>> + netif_tx_wake_all_queues(dev);
>>> +
>>> + child_netdev = rtnl_dereference(vbi->active_netdev);
>>> + if (child_netdev)
>>> + virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> + child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> + if (child_netdev)
>>> + virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int virtnet_bypass_close(struct net_device *dev)
>>> +{
>>> + struct virtnet_bypass_info *vi = netdev_priv(dev);
>>> + struct net_device *child_netdev;
>>> +
>>> + netif_tx_disable(dev);
>>> +
>>> + child_netdev = rtnl_dereference(vi->active_netdev);
>>> + if (child_netdev)
>>> + dev_close(child_netdev);
>>> +
>>> + child_netdev = rtnl_dereference(vi->backup_netdev);
>>> + if (child_netdev)
>>> + dev_close(child_netdev);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>>> + struct net_device *dev)
>>> +{
>>> + atomic_long_inc(&dev->tx_dropped);
>>> + dev_kfree_skb_any(skb);
>>> + return NETDEV_TX_OK;
>>> +}
>>> +
>>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>>> +{
>>> + return netif_running(dev) && netif_carrier_ok(dev);
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>>> + struct net_device *dev)
>>> +{
>>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> + struct net_device *xmit_dev;
>>> +
>>> + /* Try xmit via active netdev followed by backup netdev */
>>> + xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>>> + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>>> + xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
>>> + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>>> + return virtnet_bypass_drop_xmit(skb, dev);
>>> + }
>>> +
>>> + skb->dev = xmit_dev;
>>> + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>>> +
>>> + return dev_queue_xmit(skb);
>>> +}
>>> +
>>> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
>>> + struct sk_buff *skb, void
>>> *accel_priv,
>>> + select_queue_fallback_t fallback)
>>> +{
>>> + /* This helper function exists to help dev_pick_tx get the
>>> correct
>>> + * destination queue. Using a helper function skips a call to
>>> + * skb_tx_hash and will put the skbs in the queue we expect on
>>> their
>>> + * way down to the bonding driver.
>>> + */
>>> + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>>> +
>>> + /* Save the original txq to restore before passing to the driver
>>> */
>>> + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>>> +
>>> + if (unlikely(txq >= dev->real_num_tx_queues)) {
>>> + do {
>>> + txq -= dev->real_num_tx_queues;
>>> + } while (txq >= dev->real_num_tx_queues);
>>> + }
>>> +
>>> + return txq;
>>> +}
>>> +
>>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>>> + * that some drivers can provide 32bit values only.
>>> + */
>>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>>> + const struct rtnl_link_stats64
>>> *_new,
>>> + const struct rtnl_link_stats64
>>> *_old)
>>> +{
>>> + const u64 *new = (const u64 *)_new;
>>> + const u64 *old = (const u64 *)_old;
>>> + u64 *res = (u64 *)_res;
>>> + int i;
>>> +
>>> + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>>> + u64 nv = new[i];
>>> + u64 ov = old[i];
>>> + s64 delta = nv - ov;
>>> +
>>> + /* detects if this particular field is 32bit only */
>>> + if (((nv | ov) >> 32) == 0)
>>> + delta = (s64)(s32)((u32)nv - (u32)ov);
>>> +
>>> + /* filter anomalies, some drivers reset their stats
>>> + * at down/up events.
>>> + */
>>> + if (delta > 0)
>>> + res[i] += delta;
>>> + }
>>> +}
>>> +
>>> +static void virtnet_bypass_get_stats(struct net_device *dev,
>>> + struct rtnl_link_stats64 *stats)
>>> +{
>>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> + const struct rtnl_link_stats64 *new;
>>> + struct rtnl_link_stats64 temp;
>>> + struct net_device *child_netdev;
>>> +
>>> + spin_lock(&vbi->stats_lock);
>>> + memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>>> +
>>> + rcu_read_lock();
>>> +
>>> + child_netdev = rcu_dereference(vbi->active_netdev);
>>> + if (child_netdev) {
>>> + new = dev_get_stats(child_netdev, &temp);
>>> + virtnet_bypass_fold_stats(stats, new,
>>> &vbi->active_stats);
>>> + memcpy(&vbi->active_stats, new, sizeof(*new));
>>> + }
>>> +
>>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>>> + if (child_netdev) {
>>> + new = dev_get_stats(child_netdev, &temp);
>>> + virtnet_bypass_fold_stats(stats, new,
>>> &vbi->backup_stats);
>>> + memcpy(&vbi->backup_stats, new, sizeof(*new));
>>> + }
>>> +
>>> + rcu_read_unlock();
>>> +
>>> + memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>>> + spin_unlock(&vbi->stats_lock);
>>> +}
>>> +
>>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int
>>> new_mtu)
>>> +{
>>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> + struct net_device *child_netdev;
>>> + int ret = 0;
>>> +
>>> + child_netdev = rcu_dereference(vbi->active_netdev);
>>> + if (child_netdev) {
>>> + ret = dev_set_mtu(child_netdev, new_mtu);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> +
>>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>>> + if (child_netdev) {
>>> + ret = dev_set_mtu(child_netdev, new_mtu);
>>> + if (ret)
>>> + netdev_err(child_netdev,
>>> + "Unexpected failure to set mtu to
>>> %d\n",
>>> + new_mtu);
>>
>> Shouldn't we unwind the MTU config on active_netdev if failing to set
>> it on backup_netdev?
>
>
> backup_netdev is virtio_net and it is not expected to fail as it supports
> MTU all the way upto 64K.
Hmmm, then why do you need to check up the return of setting mtu on
backup_netdev here? And it's not completely impossible setting MTU
fails due to some other reason at the lower level. We should not take
assumption that way.
>
>
>
>>
>>> + }
>>> +
>>> + dev->mtu = new_mtu;
>>> + return 0;
>>> +}
>>> +
>>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>>> +{
>>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> + struct net_device *child_netdev;
>>> +
>>> + rcu_read_lock();
>>> +
>>> + child_netdev = rcu_dereference(vbi->active_netdev);
>>> + if (child_netdev) {
>>> + dev_uc_sync_multiple(child_netdev, dev);
>>> + dev_mc_sync_multiple(child_netdev, dev);
>>> + }
>>> +
>>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>>> + if (child_netdev) {
>>> + dev_uc_sync_multiple(child_netdev, dev);
>>> + dev_mc_sync_multiple(child_netdev, dev);
>>> + }
>>> +
>>
>> If VF comes up later than when set_rx_mode is called where do you sync
>> up the unicast and multicast address?
>
>
> I could include sync up of unicast and multicast list along with
> MTU when the child is registered.
>
>
>
>>
>> The rest looks good.
>>
>> Thanks,
>> -Siwei
>>
>>
>
Powered by blists - more mailing lists