[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b7a9cd5-3696-65c2-5656-a1c8eb174344@xen.org>
Date: Tue, 18 May 2021 07:57:16 +0100
From: Paul Durrant <xadimgnik@...il.com>
To: Marek Marczykowski-Górecki
<marmarek@...isiblethingslab.com>,
"Durrant, Paul" <pdurrant@...zon.co.uk>
Cc: Michael Brown <mbrown@...systems.co.uk>,
"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
Ian Jackson <iwj@...project.org>, Wei Liu <wl@....org>,
Anthony PERARD <anthony.perard@...rix.com>
Subject: Re: [PATCH] xen-netback: Check for hotplug-status existence before
watching
On 17/05/2021 22:43, Marek Marczykowski-Górecki wrote:
> On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote:
>>> -----Original Message-----
>>> From: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
>>> Sent: 11 May 2021 11:45
>>> To: Durrant, Paul <pdurrant@...zon.co.uk>
>>> Cc: Michael Brown <mbrown@...systems.co.uk>; paul@....org; xen-devel@...ts.xenproject.org;
>>> netdev@...r.kernel.org; wei.liu@...nel.org
>>> Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
>>>
>>> On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
>>>> On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
>>>>>> -----Original Message-----
>>>>>> From: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
>>>>>> Sent: 10 May 2021 20:43
>>>>>> To: Michael Brown <mbrown@...systems.co.uk>; paul@....org
>>>>>> Cc: paul@....org; xen-devel@...ts.xenproject.org; netdev@...r.kernel.org; wei.liu@...nel.org;
>>> Durrant,
>>>>>> Paul <pdurrant@...zon.co.uk>
>>>>>> Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
>>>>>>
>>>>>> On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
>>>>>>> If you have a suggested patch, I'm happy to test that it doesn't reintroduce
>>>>>>> the regression bug that was fixed by this commit.
>>>>>>
>>>>>> Actually, I've just tested with a simple reloading xen-netfront module. It
>>>>>> seems in this case, the hotplug script is not re-executed. In fact, I
>>>>>> think it should not be re-executed at all, since the vif interface
>>>>>> remains in place (it just gets NO-CARRIER flag).
>>>>>>
>>>>>> This brings a question, why removing hotplug-status in the first place?
>>>>>> The interface remains correctly configured by the hotplug script after
>>>>>> all. From the commit message:
>>>>>>
>>>>>> xen-netback: remove 'hotplug-status' once it has served its purpose
>>>>>>
>>>>>> Removing the 'hotplug-status' node in netback_remove() is wrong; the script
>>>>>> may not have completed. Only remove the node once the watch has fired and
>>>>>> has been unregistered.
>>>>>>
>>>>>> I think the intention was to remove 'hotplug-status' node _later_ in
>>>>>> case of quickly adding and removing the interface. Is that right, Paul?
>>>>>
>>>>> The removal was done to allow unbind/bind to function correctly. IIRC before the original patch
>>> doing a bind would stall forever waiting for the hotplug status to change, which would never happen.
>>>>
>>>> Hmm, in that case maybe don't remove it at all in the backend, and let
>>>> it be cleaned up by the toolstack, when it removes other backend-related
>>>> nodes?
>>>
>>> No, unbind/bind _does_ require hotplug script to be called again.
>>>
>>
>> Yes, sorry I was misremembering. My memory is hazy but there was definitely a problem at the time with leaving the node in place.
>>
>>> When exactly vif interface appears in the system (starts to be available
>>> for the hotplug script)? Maybe remove 'hotplug-status' just before that
>>> point?
>>>
>>
>> I really can't remember any detail. Perhaps try reverting both patches then and check that the unbind/rmmod/modprobe/bind sequence still works (and the backend actually makes it into connected state).
>
> Ok, I've tried this. I've reverted both commits, then used your test
> script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17:
>
> This has been tested by running iperf as a server in the test VM and
> then running a client against it in a continuous loop, whilst also
> running:
>
> while true;
> do echo vif-$DOMID-$VIF >unbind;
> echo down;
> rmmod xen-netback;
> echo unloaded;
> modprobe xen-netback;
> cd $(pwd);
> brctl addif xenbr0 vif$DOMID.$VIF;
> ip link set vif$DOMID.$VIF up;
> echo up;
> sleep 5;
> done
>
> in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
> unload, re-load, re-bind and re-plumb the backend.
>
> In fact, the need to call `brctl` and `ip link` manually is exactly
> because the hotplug script isn't executed. When I execute it manually,
> the backend properly gets back to working. So, removing 'hotplug-status'
> was in the correct place (netback_remove). The missing part is the toolstack
> calling the hotplug script on xen-netback re-bind.
>
Why is that missing? We're going behind the back of the toolstack to do
the unbind and bind so why should we expect it to re-execute a hotplug
script?
> In this case, I'm not sure what is the proper way. If I restart
> xendriverdomain service (I do run the backend in domU), it properly
> executes hotplug script and the backend interface gets properly
> configured. But it doesn't do it on its own. It seems to be related to
> device "state" in xenstore. The specific part of the libxl is
> backend_watch_callback():
> https://github.com/xen-project/xen/blob/master/tools/libs/light/libxl_device.c#L1664
>
> ddev = search_for_device(dguest, dev);
> if (ddev == NULL && state == XenbusStateClosed) {
> /*
> * Spurious state change, device has already been disconnected
> * or never attached.
> */
> goto skip;
> } else if (ddev == NULL) {
> rc = add_device(egc, nested_ao, dguest, dev);
> if (rc > 0)
> free_ao = true;
> } else if (state == XenbusStateClosed && online == 0) {
> rc = remove_device(egc, nested_ao, dguest, ddev);
> if (rc > 0)
> free_ao = true;
> check_and_maybe_remove_guest(gc, ddomain, dguest);
> }
>
> In short: if device gets XenbusStateInitWait for the first time (ddev ==
> NULL case), it goes to add_device() which executes the hotplug script
> and stores the device.
> Then, if device goes to XenbusStateClosed + online==0 state, then it
> executes hotplug script again (with "offline" parameter) and forgets the
> device. If you unbind the driver, the device stays in
> XenbusStateConnected state (in xenstore), and after you bind it again,
> it goes to XenbusStateInitWait. It don't think it goes through
> XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute
> the hotplug script again.
This is pretty key. The frontend should not notice an unbind/bind i.e.
there should be no evidence of it happening by examining states in
xenstore (from the guest side).
Paul
>
> Some solution could be to add an extra case at the end, like "ddev !=
> NULL && state == XenbusStateInitWait && hotplug-status != connected".
> And make sure xl devd won't call the same hotplug script multiple times
> for the same device _at the same time_ (I'm not sure about the async
> machinery here).
>
> But even if xl devd (aka xendriverdomain service) gets "fixes" to
> execute hotplug script in that case, I don't think it would work in
> backend in dom0 case - there, I think nothing watches already configured
> vif interfaces (there is no xl devd daemon in dom0, and xl background
> process watches only domain death and cdrom eject events).
>
> I'm adding toolstack maintainers, maybe they'll have some idea...
>
> In any case, the issue is not calling the hotplug script, responsible
> for configuring newly created vif interface. Not kernel waiting for it.
> So, I think both commits should still be reverted.
>
Powered by blists - more mailing lists