[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180611082253.631219cf@xeon-e3>
Date: Mon, 11 Jun 2018 08:22:53 -0700
From: Stephen Hemminger <stephen@...workplumber.org>
To: Siwei Liu <loseweigh@...il.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Jiri Pirko <jiri@...nulli.us>, kys@...rosoft.com,
haiyangz@...rosoft.com, David Miller <davem@...emloft.net>,
"Samudrala, Sridhar" <sridhar.samudrala@...el.com>,
Netdev <netdev@...r.kernel.org>,
Stephen Hemminger <sthemmin@...rosoft.com>
Subject: Re: [PATCH net] failover: eliminate callback hell
On Fri, 8 Jun 2018 17:42:21 -0700
Siwei Liu <loseweigh@...il.com> wrote:
> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
> <stephen@...workplumber.org> wrote:
> > On Fri, 8 Jun 2018 16:44:12 -0700
> > Siwei Liu <loseweigh@...il.com> wrote:
> >
> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> >> <stephen@...workplumber.org> wrote:
> >> > On Fri, 8 Jun 2018 15:25:59 -0700
> >> > Siwei Liu <loseweigh@...il.com> wrote:
> >> >
> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >> >> <stephen@...workplumber.org> wrote:
> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> >> > "Michael S. Tsirkin" <mst@...hat.com> wrote:
> >> >> >
> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@...workplumber.org wrote:
> >> >> >> > >The net failover should be a simple library, not a virtual
> >> >> >> > >object with function callbacks (see callback hell).
> >> >> >> >
> >> >> >> > Why just a library? It should do a common things. I think it should be a
> >> >> >> > virtual object. Looks like your patch again splits the common
> >> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> >> > model.
> >> >> >>
> >> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> >> handling of renames.
> >> >> >>
> >> >> >> Question is how can this change to 3-netdev happen? Stephen is
> >> >> >> concerned about risk of breaking some userspace.
> >> >> >>
> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> >> address, and you said then "why not use existing network namespaces
> >> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> >> compatibility?
> >> >> >>
> >> >> >
> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> >> > After this history, there is a strong motivation not to change how kernel
> >> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> >> > existing userspace.
> >> >> >
> >> >> > With virtio you can work it out with the distro's yourself.
> >> >> > There is no pre-existing semantics to deal with.
> >> >> >
> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
> >> >>
> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> >> that flag, as well as the 1-netdev model, is to have a means to
> >> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> >> around renaming devices, customizing udev rules and et al. Why
> >> >> inheriting VF's name important? To allow existing config/setup around
> >> >> VF continues to work across kernel feature upgrade. Most of network
> >> >> config files in all distros are based on interface names. Few are MAC
> >> >> address based but making lower slaves hidden would cover the rest. And
> >> >> most importantly, preserving the same level of user experience as
> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> >> exposed. This is essential to realize transparent live migration that
> >> >> users dont have to learn and be aware of the undertaken.
> >> >
> >> > Inheriting the VF name will fail in the migration scenario.
> >> > It is perfectly reasonable to migrate a guest to another machine where
> >> > the VF PCI address is different. And since current udev/systemd model
> >> > is to base network device name off of PCI address, the device will change
> >> > name when guest is migrated.
> >> >
> >> The scenario of having VF on a different PCI address on post migration
> >> is essentially equal to plugging in a new NIC. Why it has to pair with
> >> the original PV? A sepearte PV device should be in place to pair the
> >> new VF.
> >
> > The host only guarantees that the PV device will be on the same network.
> > It does not make any PCI guarantees. The way Windows works is to find
> > the device based on "serial number" which is an Hyper-V specific attribute
> > of PCI devices.
> >
> > I considered naming off of serial number but that won't work for the
> > case where PV device is present first and VF arrives later. The serial
> > number is attribute of VF, not the PV which is there first.
>
> I assume the PV can get that information ahead of time before VF
> arrives? Without it how do you match the device when you see a VF
> coming with some serial number? Is it possible for PV to get the
> matching SN even earlier during probe time? Or it has to depend on the
> presence of vPCI bridge to generate this SN?
NO. the PV device does not know ahead of time and there are scenario
where the serial and PCI info can change when it does arrive. These
are test cases (not something people usually do). Example on WS2016:
Guest configured with two or more vswitches and NICs.
SR-IOV is not enabled
Later:
On Hyper-V console (or Powershell command line) on host SR-IOV
is enabled on the second NIC.
The guest will be notified of new PCI device; the "serial number"
will be 1.
If same process is repeated but in this case the first NIC has
SR-IOV enabled, it will get serial # 1.
I agree with Jakub. What you are proposing is backwards. The VF
must be thought of as a dependent of PV device not vice/versa.
> >
> > Your ideas about having the PCI information of the VF form the name
> > of the failover device have the same problem. The PV device may
> > be the only one present on boot.
>
> Yeah, this is a chicken-egg problem indeed, and that was the reason
> why I supply the BDF info for PV to name the master interface.
> However, the ACPI PCI slot needs to depend on the PCI bus enumeration
> so that can't be predictable. Would it make sense to only rename when
> the first time a matching VF appears and PV interface isn't brought
> up, then the failover master would always stick to the name
> afterwards? I think it should cover most scenarios as it's usually
> during boot time (dracut) the VF first appears and the PV interface at
> the time then shouldn't have been configured yet.
>
> -Siwei
>
> >
> >
> >> > On Azure, the VF maybe removed (by host) at any time and then later
> >> > reattached. There is no guarantee that VF will show back up at
> >> > the same synthetic PCI address. It will likely have a different
> >> > PCI domain value.
> >>
> >> This is something QEMU can do and make sure the PCI address is
> >> consistent after migration.
> >>
> >> -Siwei
> >
Powered by blists - more mailing lists