linux-kernel - RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SA1PR21MB1335549F59339A9848101158BF5BA@SA1PR21MB1335.namprd21.prod.outlook.com>
Date:   Thu, 15 Jun 2023 04:27:45 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Lorenzo Pieralisi <lpieralisi@...nel.org>
CC:     "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Jake Oshins <jakeo@...rosoft.com>,
        "kuba@...nel.org" <kuba@...nel.org>, "kw@...ux.com" <kw@...ux.com>,
        KY Srinivasan <kys@...rosoft.com>,
        "leon@...nel.org" <leon@...nel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "robh@...nel.org" <robh@...nel.org>,
        "saeedm@...dia.com" <saeedm@...dia.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        Long Li <longli@...rosoft.com>,
        "boqun.feng@...il.com" <boqun.feng@...il.com>,
        Saurabh Singh Sengar <ssengar@...rosoft.com>,
        "helgaas@...nel.org" <helgaas@...nel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Jose Teuttli Carranco <josete@...rosoft.com>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask()
 that can cause panic

> From: Lorenzo Pieralisi <lpieralisi@...nel.org>
> Sent: Thursday, May 25, 2023 3:16 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
> >  *data)
> >  	pbus = pdev->bus;
> >  	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> >  	int_desc = data->chip_data;
> > +	if (!int_desc) {
> > +		dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> > +			 __func__, data->irq);
> > +		return;
> > +	}
> 
> That's a check that should be there regardless ?
Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).

In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.

> >  	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> > @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct
> irq_data *data, struct msi_msg *msg)
> >  		hv_pci_onchannelcallback(hbus);
> >  		spin_unlock_irqrestore(&channel->sched_lock, flags);
> >
> > -		if (hpdev->state == hv_pcichild_ejecting) {
> > -			dev_err_once(&hbus->hdev->device,
> > -				     "the device is being ejected\n");
> > -			goto enable_tasklet;
> > -		}
> > -
> >  		udelay(100);
> >  	}
> 
> I don't understand why this code is in hv_compose_msi_msg() in the first
> place (and why only in that function ?) to me this looks like you are
> adding plasters in the code that can turn out to be problematic while
> ejecting a device, this does not seem robust at all - that's my opinion.

The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

de0aa7b2f97d says

"
2. If the host is ejecting the VF device before we reach
    hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
    forever, because at this time the host doesn't respond to the
    CREATE_INTERRUPT request.
"

de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.

So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)

The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

> Feel free to merge this code, I can't ACK it, sorry.
> 
> Lorenzo
Thanks for sharing the thougths!