[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADGSJ21+g+HmjYW6hp03oMK96irx+g9y6nUoD37X-UwR=MBnYA@mail.gmail.com>
Date: Mon, 11 Jun 2018 12:23:33 -0700
From: Siwei Liu <loseweigh@...il.com>
To: Stephen Hemminger <stephen@...workplumber.org>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Jiri Pirko <jiri@...nulli.us>, kys@...rosoft.com,
haiyangz@...rosoft.com, David Miller <davem@...emloft.net>,
"Samudrala, Sridhar" <sridhar.samudrala@...el.com>,
Netdev <netdev@...r.kernel.org>,
Stephen Hemminger <sthemmin@...rosoft.com>
Subject: Re: [PATCH net] failover: eliminate callback hell
On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger
<stephen@...workplumber.org> wrote:
> On Fri, 8 Jun 2018 17:42:21 -0700
> Siwei Liu <loseweigh@...il.com> wrote:
>
>> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>> <stephen@...workplumber.org> wrote:
>> > On Fri, 8 Jun 2018 16:44:12 -0700
>> > Siwei Liu <loseweigh@...il.com> wrote:
>> >
>> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> >> <stephen@...workplumber.org> wrote:
>> >> > On Fri, 8 Jun 2018 15:25:59 -0700
>> >> > Siwei Liu <loseweigh@...il.com> wrote:
>> >> >
>> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> >> <stephen@...workplumber.org> wrote:
>> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> >> > "Michael S. Tsirkin" <mst@...hat.com> wrote:
>> >> >> >
>> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@...workplumber.org wrote:
>> >> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> >> > >object with function callbacks (see callback hell).
>> >> >> >> >
>> >> >> >> > Why just a library? It should do a common things. I think it should be a
>> >> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> >> > model.
>> >> >> >>
>> >> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> >> handling of renames.
>> >> >> >>
>> >> >> >> Question is how can this change to 3-netdev happen? Stephen is
>> >> >> >> concerned about risk of breaking some userspace.
>> >> >> >>
>> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> >> address, and you said then "why not use existing network namespaces
>> >> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> >> compatibility?
>> >> >> >>
>> >> >> >
>> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> >> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> >> > After this history, there is a strong motivation not to change how kernel
>> >> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> >> > existing userspace.
>> >> >> >
>> >> >> > With virtio you can work it out with the distro's yourself.
>> >> >> > There is no pre-existing semantics to deal with.
>> >> >> >
>> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >> >>
>> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> >> around renaming devices, customizing udev rules and et al. Why
>> >> >> inheriting VF's name important? To allow existing config/setup around
>> >> >> VF continues to work across kernel feature upgrade. Most of network
>> >> >> config files in all distros are based on interface names. Few are MAC
>> >> >> address based but making lower slaves hidden would cover the rest. And
>> >> >> most importantly, preserving the same level of user experience as
>> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> >> exposed. This is essential to realize transparent live migration that
>> >> >> users dont have to learn and be aware of the undertaken.
>> >> >
>> >> > Inheriting the VF name will fail in the migration scenario.
>> >> > It is perfectly reasonable to migrate a guest to another machine where
>> >> > the VF PCI address is different. And since current udev/systemd model
>> >> > is to base network device name off of PCI address, the device will change
>> >> > name when guest is migrated.
>> >> >
>> >> The scenario of having VF on a different PCI address on post migration
>> >> is essentially equal to plugging in a new NIC. Why it has to pair with
>> >> the original PV? A sepearte PV device should be in place to pair the
>> >> new VF.
>> >
>> > The host only guarantees that the PV device will be on the same network.
>> > It does not make any PCI guarantees. The way Windows works is to find
>> > the device based on "serial number" which is an Hyper-V specific attribute
>> > of PCI devices.
>> >
>> > I considered naming off of serial number but that won't work for the
>> > case where PV device is present first and VF arrives later. The serial
>> > number is attribute of VF, not the PV which is there first.
>>
>> I assume the PV can get that information ahead of time before VF
>> arrives? Without it how do you match the device when you see a VF
>> coming with some serial number? Is it possible for PV to get the
>> matching SN even earlier during probe time? Or it has to depend on the
>> presence of vPCI bridge to generate this SN?
>
>
>
> NO. the PV device does not know ahead of time and there are scenario
> where the serial and PCI info can change when it does arrive. These
> are test cases (not something people usually do). Example on WS2016:
> Guest configured with two or more vswitches and NICs.
> SR-IOV is not enabled
>
> Later:
> On Hyper-V console (or Powershell command line) on host SR-IOV
> is enabled on the second NIC.
>
> The guest will be notified of new PCI device; the "serial number"
> will be 1.
>
> If same process is repeated but in this case the first NIC has
> SR-IOV enabled, it will get serial # 1.
>
>
> I agree with Jakub. What you are proposing is backwards. The VF
> must be thought of as a dependent of PV device not vice/versa.
I don't enforce netvsc moving to the same 1-netdev model, did I? I
understand Hyper-V has its specific design that's hard to get around
of.
All I said transparent live migration and the 1-netdev model should
work for the passthrough with virtio as helper under QEMU. As I recall
the initial intent was to use virtio as a migration helper rather than
having VF as acceleration path. The latter is as far as I know is from
Hyper-V's point of view. I don't know where those side features come
from and why doing live migration religiously is backwards.
-Siwei
>
>> >
>> > Your ideas about having the PCI information of the VF form the name
>> > of the failover device have the same problem. The PV device may
>> > be the only one present on boot.
>>
>> Yeah, this is a chicken-egg problem indeed, and that was the reason
>> why I supply the BDF info for PV to name the master interface.
>> However, the ACPI PCI slot needs to depend on the PCI bus enumeration
>> so that can't be predictable. Would it make sense to only rename when
>> the first time a matching VF appears and PV interface isn't brought
>> up, then the failover master would always stick to the name
>> afterwards? I think it should cover most scenarios as it's usually
>> during boot time (dracut) the VF first appears and the PV interface at
>> the time then shouldn't have been configured yet.
>>
>> -Siwei
>>
>> >
>> >
>> >> > On Azure, the VF maybe removed (by host) at any time and then later
>> >> > reattached. There is no guarantee that VF will show back up at
>> >> > the same synthetic PCI address. It will likely have a different
>> >> > PCI domain value.
>> >>
>> >> This is something QEMU can do and make sure the PCI address is
>> >> consistent after migration.
>> >>
>> >> -Siwei
>> >
>
Powered by blists - more mailing lists