[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <KL1P15301MB0006B36AD458D43907DF82B0BF6D0@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM>
Date: Tue, 29 May 2018 19:58:09 +0000
From: Dexuan Cui <decui@...rosoft.com>
To: "Michael Kelley (EOSG)" <Michael.H.Kelley@...rosoft.com>,
'Lorenzo Pieralisi' <lorenzo.pieralisi@....com>,
'Bjorn Helgaas' <bhelgaas@...gle.com>,
"'linux-pci@...r.kernel.org'" <linux-pci@...r.kernel.org>,
KY Srinivasan <kys@...rosoft.com>,
Stephen Hemminger <sthemmin@...rosoft.com>,
"'olaf@...fle.de'" <olaf@...fle.de>,
"'apw@...onical.com'" <apw@...onical.com>,
"'jasowang@...hat.com'" <jasowang@...hat.com>
CC: "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
"'driverdev-devel@...uxdriverproject.org'"
<driverdev-devel@...uxdriverproject.org>,
Haiyang Zhang <haiyangz@...rosoft.com>,
"'vkuznets@...hat.com'" <vkuznets@...hat.com>,
"'marcelo.cerri@...onical.com'" <marcelo.cerri@...onical.com>
Subject: RE: [PATCH] PCI: hv: Do not wait forever on a device that has
disappeared
> From: Michael Kelley (EOSG)
> Sent: Monday, May 28, 2018 17:19
>
> While this patch solves the immediate problem of getting hung waiting
> for a response from Hyper-V that will never come, there's another scenario
> to look at that I think introduces a race. Suppose the guest VM issues a
> vmbus_sendpacket() request in one of the cases covered by this patch,
> and suppose that Hyper-V queues a response to the request, and then
> immediately follows with a rescind request. Processing the response will
> get queued to a tasklet associated with the channel, while processing the
> rescind will get queued to a tasklet associated with the top-level vmbus
> connection. From what I can see, the code doesn't impose any ordering
> on processing the two. If the rescind is processed first, the new
> wait_for_response() function may wake up, notice the rescind flag, and
> return an error. Its caller will return an error, and in doing so pop the
> completion packet off the stack. When the response is processed later,
> it will try to signal completion via a completion packet that no longer
> exists, and memory corruption will likely occur.
>
> Am I missing anything that would prevent this scenario from happening?
> It is admittedly low probability, and a solution seems non-trivial. I haven't
> looked specifically, but a similar scenario is probably possible with the
> drivers for other VMbus devices. We should work on a generic solution.
>
> Michael
Thanks for spotting the race!
IMO we can disable the per-channel tasklet to exclude the race:
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -565,6 +565,7 @@ static int wait_for_response(struct hv_device *hdev,
{
while (true) {
if (hdev->channel->rescind) {
+ tasklet_disable(&hdev->channel->callback_event);
dev_warn_once(&hdev->device, "The device is gone.\n");
return -ENODEV;
}
This way, when we exit the loop, we're sure hv_pci_onchannelcallback() can not
run anymore. What do you think of this?
It looks the list of the other vmbus devices that can be hot-removed is:
the hv_utils devices
hv_sock devices
storvsc device
netvsc device
As I checked, the first 3 types of devices don't have this "send a request to the
host and wait for the response forever" pattern. NetVSC should be fixed as it has
the same pattern.
-- Dexuan
Powered by blists - more mailing lists