[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87wo5nj48a.fsf@nanos.tec.linutronix.de>
Date: Thu, 07 May 2020 14:53:41 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: "Raj\, Ashok" <ashok.raj@...el.com>
Cc: "Raj\, Ashok" <ashok.raj@...ux.intel.com>,
Evan Green <evgreen@...omium.org>,
Mathias Nyman <mathias.nyman@...ux.intel.com>, x86@...nel.org,
linux-pci <linux-pci@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
Bjorn Helgaas <bhelgaas@...gle.com>,
"Ghorai\, Sukumar" <sukumar.ghorai@...el.com>,
"Amara\, Madhusudanarao" <madhusudanarao.amara@...el.com>,
"Nandamuri\, Srikanth" <srikanth.nandamuri@...el.com>,
Ashok Raj <ashok.raj@...el.com>
Subject: Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug
Ashok,
"Raj, Ashok" <ashok.raj@...el.com> writes:
> We did a bit more tracing and it looks like the IRR check is actually
> not happening on the right cpu. See below.
What?
> On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
>> >
>> > msi_set_affinit ()
>> > {
>> > ....
>> > unlock_vector_lock();
>> >
>> > /*
>> > * Check whether the transition raced with a device interrupt and
>> > * is pending in the local APICs IRR. It is safe to do this outside
>> > * of vector lock as the irq_desc::lock of this interrupt is still
>> > * held and interrupts are disabled: The check is not accessing the
>> > * underlying vector store. It's just checking the local APIC's
>> > * IRR.
>> > */
>> > if (lapic_vector_set_in_irr(cfg->vector))
>> > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>>
>> No. This catches the transitional interrupt to the new vector on the
>> original CPU, i.e. the one which is running that code.
>
> Mathias added some trace to his xhci driver when the isr is called.
>
> Below is the tail of my trace with last two times xhci_irq isr is called:
>
> <idle>-0 [003] d.h. 200.277971: xhci_irq: xhci irq
> <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq
>
> Just trying to follow your steps below with traces. The traces follow
> the same comments in the source.
>
>>
>> Again the steps are:
>>
>> 1) Allocate new vector on new CPU
>
> /* Allocate a new target vector */
> ret = parent->chip->irq_set_affinity(parent, mask, force);
>
> migration/3-24 [003] d..1 200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0
>
>>
>> 2) Set new vector on original CPU
>
> /* Redirect it to the new vector on the local CPU temporarily */
> old_cfg.vector = cfg->vector;
> irq_msi_update_msg(irqd, &old_cfg);
>
> migration/3-24 [003] d..1 200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6
On old CPU 6? This runs on CPU 3 which is wrong to begin with.
>> 3) Set new vector on new CPU
>
> /* Now transition it to the target CPU */
> irq_msi_update_msg(irqd, cfg);
>
> migration/3-24 [003] d..1 200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33
>
>
>
> if (lapic_vector_set_in_irr(cfg->vector))
> irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>
>
> migration/3-24 [003] d..1 200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0
>
>>
>> So we have 3 points where an interrupt can fire:
>>
>> A) Before #2
>>
>> B) After #2 and before #3
>>
>> C) After #3
>>
>> #A is hitting the old vector which is still valid on the old CPU and
>> will be handled once interrupts are enabled with the correct irq
>> descriptor - Normal operation (same as with maskable MSI)
>>
>> #B This must be checked in the IRR because the there is no valid vector
>> on the old CPU.
>
> The check for IRR seems like on a random cpu3 vs checking for the new vector 33
> on old cpu 6?
The whole sequence runs on CPU 3. If old CPU was 6 then this should
never run on CPU 3.
> This is the place when we force the retrigger without the IRR check things seem to fix itself.
It's not fixing it. It's papering over the root cause.
> Did we miss something?
Yes, you missed to analyze why this runs on CPU3 when old CPU is 6. But
the last interrupt actually was on CPU3.
> <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq
Can you please provide the full trace and the patch you used to generate
it?
Thanks,
tglx
Powered by blists - more mailing lists