[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SA1PR21MB1335549F59339A9848101158BF5BA@SA1PR21MB1335.namprd21.prod.outlook.com>
Date: Thu, 15 Jun 2023 04:27:45 +0000
From: Dexuan Cui <decui@...rosoft.com>
To: Lorenzo Pieralisi <lpieralisi@...nel.org>
CC: "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"edumazet@...gle.com" <edumazet@...gle.com>,
Haiyang Zhang <haiyangz@...rosoft.com>,
Jake Oshins <jakeo@...rosoft.com>,
"kuba@...nel.org" <kuba@...nel.org>, "kw@...ux.com" <kw@...ux.com>,
KY Srinivasan <kys@...rosoft.com>,
"leon@...nel.org" <leon@...nel.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
"pabeni@...hat.com" <pabeni@...hat.com>,
"robh@...nel.org" <robh@...nel.org>,
"saeedm@...dia.com" <saeedm@...dia.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
Long Li <longli@...rosoft.com>,
"boqun.feng@...il.com" <boqun.feng@...il.com>,
Saurabh Singh Sengar <ssengar@...rosoft.com>,
"helgaas@...nel.org" <helgaas@...nel.org>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Jose Teuttli Carranco <josete@...rosoft.com>,
"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask()
that can cause panic
> From: Lorenzo Pieralisi <lpieralisi@...nel.org>
> Sent: Thursday, May 25, 2023 3:16 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
> > *data)
> > pbus = pdev->bus;
> > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > int_desc = data->chip_data;
> > + if (!int_desc) {
> > + dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> > + __func__, data->irq);
> > + return;
> > + }
>
> That's a check that should be there regardless ?
Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).
In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.
> > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> > @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct
> irq_data *data, struct msi_msg *msg)
> > hv_pci_onchannelcallback(hbus);
> > spin_unlock_irqrestore(&channel->sched_lock, flags);
> >
> > - if (hpdev->state == hv_pcichild_ejecting) {
> > - dev_err_once(&hbus->hdev->device,
> > - "the device is being ejected\n");
> > - goto enable_tasklet;
> > - }
> > -
> > udelay(100);
> > }
>
> I don't understand why this code is in hv_compose_msi_msg() in the first
> place (and why only in that function ?) to me this looks like you are
> adding plasters in the code that can turn out to be problematic while
> ejecting a device, this does not seem robust at all - that's my opinion.
The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")
de0aa7b2f97d says
"
2. If the host is ejecting the VF device before we reach
hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
forever, because at this time the host doesn't respond to the
CREATE_INTERRUPT request.
"
de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.
So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)
The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock
> Feel free to merge this code, I can't ACK it, sorry.
>
> Lorenzo
Thanks for sharing the thougths!
Powered by blists - more mailing lists