lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200507121850.GB85463@otc-nc-03>
Date:   Thu, 7 May 2020 05:18:50 -0700
From:   "Raj, Ashok" <ashok.raj@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     "Raj, Ashok" <ashok.raj@...ux.intel.com>,
        Evan Green <evgreen@...omium.org>,
        Mathias Nyman <mathias.nyman@...ux.intel.com>, x86@...nel.org,
        linux-pci <linux-pci@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        "Ghorai, Sukumar" <sukumar.ghorai@...el.com>,
        "Amara, Madhusudanarao" <madhusudanarao.amara@...el.com>,
        "Nandamuri, Srikanth" <srikanth.nandamuri@...el.com>,
        Ashok Raj <ashok.raj@...el.com>
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Hi Thomas

We did a bit more tracing and it looks like the IRR check is actually
not happening on the right cpu. See below.

On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
> >
> > msi_set_affinit ()
> > {
> > ....
> >         unlock_vector_lock();
> >
> >         /*
> >          * Check whether the transition raced with a device interrupt and
> >          * is pending in the local APICs IRR. It is safe to do this outside
> >          * of vector lock as the irq_desc::lock of this interrupt is still
> >          * held and interrupts are disabled: The check is not accessing the
> >          * underlying vector store. It's just checking the local APIC's
> >          * IRR.
> >          */
> >         if (lapic_vector_set_in_irr(cfg->vector))
> >                 irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
> 
> No. This catches the transitional interrupt to the new vector on the
> original CPU, i.e. the one which is running that code.

Mathias added some trace to his xhci driver when the isr is called.

Below is the tail of my trace with last two times xhci_irq isr is called:

    <idle>-0     [003] d.h.   200.277971: xhci_irq: xhci irq
    <idle>-0     [003] d.h.   200.278052: xhci_irq: xhci irq

Just trying to follow your steps below with traces. The traces follow
the same comments in the source.

> 
> Again the steps are:
> 
>  1) Allocate new vector on new CPU

        /* Allocate a new target vector */
        ret = parent->chip->irq_set_affinity(parent, mask, force);

migration/3-24    [003] d..1   200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0

> 
>  2) Set new vector on original CPU

        /* Redirect it to the new vector on the local CPU temporarily */
        old_cfg.vector = cfg->vector;
        irq_msi_update_msg(irqd, &old_cfg);

migration/3-24    [003] d..1   200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6
> 
>  3) Set new vector on new CPU

        /* Now transition it to the target CPU */
        irq_msi_update_msg(irqd, cfg);

     migration/3-24    [003] d..1   200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33



     if (lapic_vector_set_in_irr(cfg->vector))
	irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


migration/3-24    [003] d..1   200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0

> 
> So we have 3 points where an interrupt can fire:
> 
>  A) Before #2
> 
>  B) After #2 and before #3
> 
>  C) After #3
> 
> #A is hitting the old vector which is still valid on the old CPU and
>    will be handled once interrupts are enabled with the correct irq
>    descriptor - Normal operation (same as with maskable MSI)
> 
> #B This must be checked in the IRR because the there is no valid vector
>    on the old CPU.

The check for IRR seems like on a random cpu3 vs checking for the new vector 33
on old cpu 6?

This is the place when we force the retrigger without the IRR check things seem to fix itself.

> 
> #C is handled on the new vector on the new CPU
> 


Did we miss something? 

Cheers,
Ashok

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ