lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5279A191.9030807@oracle.com>
Date:	Wed, 06 Nov 2013 09:55:29 +0800
From:	Zhenzhong Duan <zhenzhong.duan@...cle.com>
To:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
CC:	Bjorn Helgaas <bhelgaas@...gle.com>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	xen-devel <xen-devel@...ts.xen.org>,
	Feng Jin <joe.jin@...cle.com>,
	Sucheta Chakraborty <sucheta.chakraborty@...gic.com>,
	Jingoo Han <jg1.han@...sung.com>
Subject: Re: [PATCH 2/3] PCI: Refactor MSI/MSIX mask restore code to fix interrupt
 lost issue


On 2013-11-05 22:32, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 30, 2013 at 10:20:40AM +0800, Zhenzhong Duan wrote:
>> On 2013-10-30 05:58, Bjorn Helgaas wrote:
>>> On Wed, Oct 16, 2013 at 02:33:04PM +0800, Zhenzhong Duan wrote:
>>>> Driver init call graph under baremetal:
>>>> driver_init->
>>>>      msix_capability_init->
>>>>          msix_program_entries->
>>>>              msix_mask_irq->
>>>>                  entry->masked = 1
>>>>      request_irq->
>>>>          __setup_irq->
>>>>              irq_startup->
>>>>                  unmask_msi_irq->
>>>>                      msix_mask_irq->
>>>>                          entry->masked = 0;
>>>>
>>>> So entry->masked is always updated with newest value and its value could be used
>>>> to restore to mask register in device.
>>>>
>>>> But in initial domain (aka priviliged guest), it's different.
>>>> Driver init call graph under initial domain:
>>>> driver_init->
>>>>      msix_capability_init->
>>>>          msix_program_entries->
>>>>              msix_mask_irq->
>>>>                  entry->masked = 1
>>>>      request_irq->
>>>>          __setup_irq->
>>>>              irq_startup->
>>>>                  __startup_pirq->
>>>>                      EVTCHNOP_bind_pirq hypercall    (trap into Xen)
>>>> [Xen:]
>>>> pirq_guest_bind->
>>>>      startup_msi_irq->
>>>>          unmask_msi_irq->
>>>>              msi_set_mask_bit->
>>>>                  entry->msi_attrib.masked = 0;
>> The right mask value is saved in entry->msi_attrib.masked on Xen.
>>>> So entry->msi_attrib.masked in xen side always has newest value. entry->masked
>>>> in initial domain is untouched and is 1 after msix_capability_init.
>>> If we run the following sequence:
>>>
>>>      pci_enable_msix()
>>>      request_irq()
>>>
>>> don't we end up with the MSI IRQ unmasked if we're on bare metal but masked
>>> if we're on Xen?  It seems like we'd want it unmasked in both cases, so I
>>> expected your patch to do something to make it unmasked if we're on Xen.
>>> But I don't think it does, does it?
>>>
>>> As far as I can tell, this patch only changes the pci_restore_state()
>>> path.  I think that part makes sense.
>>>
>>> Bjorn
>> It's unmasked on Xen too. This is just what this patch try to fix.
>> In PHYSDEVOP_restore_msi hypercall, xen did the right thing that did
>> by kernel in baremetal.
> Part of this problem is that all of the interrupt vector setting (either
> be it GSI, MSI or MSI-X) is handled by the hypervisor. That means the kernel
> consults the hypervisor for the right 'vector' value for all of the different
> types of interrupts. And that 'vector' value is not even used - the interrupts
> first hit the hypervisor - which dispatches them to the guest via a software
> event channel mechanism (a bitmap of 'active' events - and an event can be
> tied to a physical interrupt or an IPI, etc).
Yes, for below sequence, request_irq calls EVTCHNOP_bind_pirq hypercall 
and Xen will
get MSI IRQ unmasked.

       pci_enable_msix()
       request_irq()

> Even more recently we have been clamping down - so that the kernel pagetables
> for the MSI-X tables for example are R/O - so it can't write (or over-write)
> with a different vector value (or the same one). The hypervisor is the one
> that does this change.
>
> Perhaps a different way of fixing this is making the '__msi_mask_irq' and
> '__msix_mask_irq' be part of the x86.msi function ops? And then the platform
> can over-write it with its own mechanism for masking/unmasking? (and in case
> of Xen it would be a nop as that has already been done by the hypervisor?)
The idea looks good
> The 'write_msi_msg' we don't have to worry about as it is only used by
> default_restore_msi_irqs (which is part of the x86.msi and can be over-written).
>
> Perhaps something like this (Testing it right now):
I'd suggest to test with qlogic card with which the bug only reproduce.

zduan
> diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
> index 828a156..0f1be11 100644
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -172,6 +172,7 @@ struct x86_platform_ops {
>   
>   struct pci_dev;
>   struct msi_msg;
> +struct msi_desc;
>   
>   struct x86_msi_ops {
>   	int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type);
> @@ -182,6 +183,8 @@ struct x86_msi_ops {
>   	void (*teardown_msi_irqs)(struct pci_dev *dev);
>   	void (*restore_msi_irqs)(struct pci_dev *dev, int irq);
>   	int  (*setup_hpet_msi)(unsigned int irq, unsigned int id);
> +	u32 (*msi_mask_irq)(struct msi_desc *desc, u32 mask, u32 flag);
> +	u32 (*msix_mask_irq)(struct msi_desc *desc, u32 flag);
>   };
>   
>   struct IO_APIC_route_entry;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index 8ce0072..021783b 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -116,6 +116,8 @@ struct x86_msi_ops x86_msi = {
>   	.teardown_msi_irqs	= default_teardown_msi_irqs,
>   	.restore_msi_irqs	= default_restore_msi_irqs,
>   	.setup_hpet_msi		= default_setup_hpet_msi,
> +	.msi_mask_irq		= default_msi_mask_irq,
> +	.msix_mask_irq		= default_msix_mask_irq,
>   };
>   
>   /* MSI arch specific hooks */
> @@ -138,6 +140,14 @@ void arch_restore_msi_irqs(struct pci_dev *dev, int irq)
>   {
>   	x86_msi.restore_msi_irqs(dev, irq);
>   }
> +u32 arch_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
> +{
> +	return x86_msi.msi_mask_irq(desc, mask, flag);
> +}
> +u32 arch_msix_mask_irq(struct msi_desc *desc, u32 flag)
> +{
> +	return x86_msi.msix_mask_irq(desc, flag);
> +}
>   #endif
>   
>   struct x86_io_apic_ops x86_io_apic_ops = {
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index 48e8461..5eee495 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -382,7 +382,14 @@ static void xen_teardown_msi_irq(unsigned int irq)
>   {
>   	xen_destroy_irq(irq);
>   }
> -
> +static u32 xen_nop_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
> +{
> +	return 0;
> +}
> +static u32 xen_nop_msix_mask_irq(struct msi_desc *desc, u32 flag)
> +{
> +	return 0;
> +}
>   #endif
>   
>   int __init pci_xen_init(void)
> @@ -406,6 +413,8 @@ int __init pci_xen_init(void)
>   	x86_msi.setup_msi_irqs = xen_setup_msi_irqs;
>   	x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
>   	x86_msi.teardown_msi_irqs = xen_teardown_msi_irqs;
> +	x86_msi.msi_mask_irq = xen_nop_msi_mask_irq;
> +	x86_msi.msix_mask_irq = xen_nop_msix_mask_irq;
>   #endif
>   	return 0;
>   }
> @@ -485,6 +494,8 @@ int __init pci_xen_initial_domain(void)
>   	x86_msi.setup_msi_irqs = xen_initdom_setup_msi_irqs;
>   	x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
>   	x86_msi.restore_msi_irqs = xen_initdom_restore_msi_irqs;
> +	x86_msi.msi_mask_irq = xen_nop_msi_mask_irq;
> +	x86_msi.msix_mask_irq = xen_nop_msix_mask_irq;
>   #endif
>   	xen_setup_acpi_sci();
>   	__acpi_register_gsi = acpi_register_gsi_xen;
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index d5f90d6..7916699 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -185,7 +185,7 @@ static inline __attribute_const__ u32 msi_enabled_mask(u16 control)
>    * reliably as devices without an INTx disable bit will then generate a
>    * level IRQ which will never be cleared.
>    */
> -static u32 __msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
> +u32 default_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
>   {
>   	u32 mask_bits = desc->masked;
>   
> @@ -199,9 +199,14 @@ static u32 __msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
>   	return mask_bits;
>   }
>   
> +__weak u32 arch_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
> +{
> +	return default_msi_mask_irq(desc, mask, flag);
> +}
> +
>   static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
>   {
> -	desc->masked = __msi_mask_irq(desc, mask, flag);
> +	desc->masked = arch_msi_mask_irq(desc, mask, flag);
>   }
>   
>   /*
> @@ -211,7 +216,7 @@ static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
>    * file.  This saves a few milliseconds when initialising devices with lots
>    * of MSI-X interrupts.
>    */
> -static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
> +u32 default_msix_mask_irq(struct msi_desc *desc, u32 flag)
>   {
>   	u32 mask_bits = desc->masked;
>   	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
> @@ -224,9 +229,14 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
>   	return mask_bits;
>   }
>   
> +__weak u32 arch_msix_mask_irq(struct msi_desc *desc, u32 flag)
> +{
> +	return default_msix_mask_irq(desc, flag);
> +}
> +
>   static void msix_mask_irq(struct msi_desc *desc, u32 flag)
>   {
> -	desc->masked = __msix_mask_irq(desc, flag);
> +	desc->masked = arch_msix_mask_irq(desc, flag);
>   }
>   
>   static void msi_set_mask_bit(struct irq_data *data, u32 flag)
> @@ -902,7 +912,7 @@ void pci_msi_shutdown(struct pci_dev *dev)
>   	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &ctrl);
>   	mask = msi_capable_mask(ctrl);
>   	/* Keep cached state to be restored */
> -	__msi_mask_irq(desc, mask, ~mask);
> +	arch_msi_mask_irq(desc, mask, ~mask);
>   
>   	/* Restore dev->irq to its default pin-assertion irq */
>   	dev->irq = desc->msi_attrib.default_irq;
> @@ -998,7 +1008,7 @@ void pci_msix_shutdown(struct pci_dev *dev)
>   	/* Return the device with MSI-X masked as initial states */
>   	list_for_each_entry(entry, &dev->msi_list, list) {
>   		/* Keep cached states to be restored */
> -		__msix_mask_irq(entry, 1);
> +		arch_msix_mask_irq(entry, 1);
>   	}
>   
>   	msix_set_enable(dev, 0);
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index b17ead8..87cce50 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -64,6 +64,8 @@ void arch_restore_msi_irqs(struct pci_dev *dev, int irq);
>   
>   void default_teardown_msi_irqs(struct pci_dev *dev);
>   void default_restore_msi_irqs(struct pci_dev *dev, int irq);
> +u32 default_msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag);
> +u32 default_msix_mask_irq(struct msi_desc *desc, u32 flag);
>   
>   struct msi_chip {
>   	struct module *owner;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ