[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <A93B8FB2-9B60-417B-BE7A-94173B8FE584@oracle.com>
Date: Thu, 28 Feb 2019 02:00:48 +0200
From: Liran Alon <liran.alon@...cle.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: si-wei liu <si-wei.liu@...cle.com>,
"Samudrala, Sridhar" <sridhar.samudrala@...el.com>,
Siwei Liu <loseweigh@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Stephen Hemminger <stephen@...workplumber.org>,
David Miller <davem@...emloft.net>,
Netdev <netdev@...r.kernel.org>,
virtualization@...ts.linux-foundation.org,
virtio-dev <virtio-dev@...ts.oasis-open.org>,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
Alexander Duyck <alexander.h.duyck@...el.com>,
Jakub Kicinski <kubakici@...pl>,
Jason Wang <jasowang@...hat.com>
Subject: Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC
PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use
the bypass framework)
> On 28 Feb 2019, at 1:50, Michael S. Tsirkin <mst@...hat.com> wrote:
>
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>
>>
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>>
>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>
>>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE&s=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM&e=
>>>>>>>>>>>>
>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>
>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>> Above says:
>>>>>>>>>>>
>>>>>>>>>>> there's no motivation in the systemd/udevd community at
>>>>>>>>>>> this point to refactor the rename logic and make it work well with
>>>>>>>>>>> 3-netdev.
>>>>>>>>>>>
>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>
>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>> solved...
>>>>>>> I was just wondering what did you mean when you said
>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>> was there a proposal udev rejected?
>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>> the feasibility...
>>>>>>
>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>> leads to failure? That would help look for triggers that we can tie
>>>>>>> into, or add new ones.
>>>>>>>
>>>>>> See attached diagram.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>> to only work with the master failover device.
>>>>>>>> Where does this expectation come from?
>>>>>>>>
>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>
>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>> migration.
>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>
>>>>>>> BTW this is something we should look at IMHO.
>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>> the very beginning.
>>>>>>
>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>
>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>> Your scripts would not work at all then, right?
>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>> emerges in upstream eventually.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@...ts.oasis-open.org
>>>>>>> For additional commands, e-mail: virtio-dev-help@...ts.oasis-open.org
>>>>>>>
>>>>>> net_failover(kernel) | network.service (user) | systemd-udevd (user)
>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>> (standby virtio-net and net_failover | |
>>>>>> devices created and initialized, | |
>>>>>> i.e. virtnet_probe()-> | |
>>>>>> net_failover_create() | |
>>>>>> was done.) | |
>>>>>> | |
>>>>>> | runs `ifup ens3' -> |
>>>>>> | ip link set dev ens3 up |
>>>>>> net_failover_open() | |
>>>>>> dev_open(virtnet_dev) | |
>>>>>> virtnet_open(virtnet_dev) | |
>>>>>> netif_carrier_on(failover_dev) | |
>>>>>> ... | |
>>>>>> | |
>>>>>> (VF hot plugged in) | |
>>>>>> ixgbevf_probe() | |
>>>>>> register_netdev(ixgbevf_netdev) | |
>>>>>> netdev_register_kobject(ixgbevf_netdev) | |
>>>>>> kobject_add(ixgbevf_dev) | |
>>>>>> device_add(ixgbevf_dev) | |
>>>>>> kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | |
>>>>>> netlink_broadcast() | |
>>>>>> ... | |
>>>>>> call_netdevice_notifiers(NETDEV_REGISTER) | |
>>>>>> failover_event(..., NETDEV_REGISTER, ...) | |
>>>>>> failover_slave_register(ixgbevf_netdev) | |
>>>>>> net_failover_slave_register(ixgbevf_netdev) | |
>>>>>> dev_open(ixgbevf_netdev) | |
>>>>>> | |
>>>>>> | |
>>>>>> | | received ADD uevent from netlink fd
>>>>>> | | ...
>>>>>> | | udev-builtin-net_id.c:dev_pci_slot()
>>>>>> | | (decided to renamed 'eth0' )
>>>>>> | | ip link set dev eth0 name ens4
>>>>>> (dev_change_name() returns -EBUSY as | |
>>>>>> ixgbevf_netdev->flags has IFF_UP) | |
>>>>>> | |
>>>>>>
>>>>> Given renaming slaves does not work anyway:
>>>> I was actually thinking what if we relieve the rename restriction just for
>>>> the failover slave? What the impact would be? I think users don't care about
>>>> slave being renamed when it's in use, especially the initial rename.
>>>> Thoughts?
>>>>
>>>>> would it work if we just
>>>>> hard-coded slave names instead?
>>>>>
>>>>> E.g.
>>>>> 1. fail slave renames
>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>> and primary to XXnpry
>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>> may not be present.
>>> In this scheme if VF is not there it will be renamed immediately after registration.
>> Who will be responsible to rename the slave, the kernel?
>
> That's the idea.
>
>> Note the master's
>> name may or may not come from the userspace. If it comes from the userspace,
>> should the userspace daemon change their expectation not to name/rename
>> _any_ slaves (today there's no distinction)?
>
> Yes the idea would be to fail renaming slaves.
>
>> How do users know which name to
>> trust, depending on which wins the race more often? Say if kernel wants a
>> ens3npry name while userspace wants it named as ens4.
>>
>> -Siwei
>
> With this approach kernel will deny attempts by userspace to rename
> slaves. Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
>
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.
I’m fond of this idea and I have similar opinion.
I think it simplifies the issue here.
I don’t see a real reason for customer to define udev rule to rename a net-failover slave to have different postfix.
-Liran
>
>
>>>
>>>> I don't like the idea to delay exposing failover master
>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>
>>>> Thanks,
>>>> -Siwei
>>>
>>> I agree, this was not what I meant.
>>>
>>>>>
Powered by blists - more mailing lists