[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <221f6dadb6d8ce06f30a24baaa2777e90d75b130.camel@redhat.com>
Date: Mon, 26 Jan 2026 17:26:47 -0500
From: Radu Rendec <rrendec@...hat.com>
To: Jon Hunter <jonathanh@...dia.com>, Thomas Gleixner <tglx@...nel.org>,
Manivannan Sadhasivam
<mani@...nel.org>
Cc: Daniel Tsai <danielsftsai@...gle.com>, Marek Behún
<kabel@...nel.org>, Krishna Chaitanya Chundru <quic_krichai@...cinc.com>,
Bjorn Helgaas <bhelgaas@...gle.com>, Rob Herring <robh@...nel.org>,
Krzysztof Wilczyński <kwilczynski@...nel.org>, Lorenzo
Pieralisi <lpieralisi@...nel.org>, Jingoo Han <jingoohan1@...il.com>,
Brian Masney <bmasney@...hat.com>, Eric Chanudet <echanude@...hat.com>,
Alessandro Carminati <acarmina@...hat.com>, Jared Kangas
<jkangas@...hat.com>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, "linux-tegra@...r.kernel.org"
<linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v3 3/3] PCI: dwc: Enable MSI affinity support
Hi Jon,
On Mon, 2026-01-26 at 22:07 +0000, Jon Hunter wrote:
> On 26/01/2026 07:59, Thomas Gleixner wrote:
> > On Thu, Jan 22 2026 at 18:31, Radu Rendec wrote:
> > > The CPUs are taken offline one by one, starting with CPU 7. The code in
> > > question runs on the dying CPU, and with hardware interrupts disabled
> > > on all CPUs. The (simplified) call stack looks like this:
> > >
> > > irq_migrate_all_off_this_cpu
> > > for_each_active_irq
> > > migrate_one_irq
> > > irq_do_set_affinity
> > > irq_chip_redirect_set_affinity (via chip->irq_set_affinity)
> > >
> > > The debug patch I gave you adds:
> > > * a printk to irq_chip_redirect_set_affinity (which is very small)
> > > * a printk at the beginning of migrate_one_irq
> > >
> > > Also, the call to irq_do_set_affinity is almost the last thing that
> > > happens in migrate_one_irq, and that for_each_active_irq loop is quite
> > > small too. So, there isn't much happening between the printk in
> > > irq_chip_redirect_set_affinity for the msi irq (which we do see in the
> > > log) and the printk in migrate_one_irq for the next irq (which we don't
> > > see).
> >
> > This doesn't make any sense at all. irq_chip_redirect_set_affinity() is
> > only accessing interrupt descriptor associated memory and the new
> > redirection CPU is the same as the previous one as the mask changes from
> > 0xff to 0x7f and therefore cpumask_first() yields 0 in both cases.
> >
> > According to the provided dmesg, this happens on linux-next.
> >
> > Jon, can you please validate that this happens as well on
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/msi
>
>
> I tried this branch and I see suspend failing with that branch too. If I
> revert this change on top of your branch or -next, I don't see any
> problems.
The closest hardware I have access to is Jetson Xavier NX, and you
already mentioned you couldn't reproduce the issue there (and it looks
like I can't even get a hold of that board anyway). So I'm going to ask
you to test a few more things for me.
Can you please apply the patch below on top of the previous one I sent?
The suspect is the spinlock lock in irq_migrate_all_off_this_cpu(),
although I can't think of any reason why it shouldn't be free. But I
don't have any better idea, and I would like to narrow down the spot
where hotplug gets stuck.
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index d8c62547f9d06..69c44da68e3a9 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -178,9 +178,11 @@ void irq_migrate_all_off_this_cpu(void)
for_each_active_irq(irq) {
bool affinity_broken;
+ pr_info("%s: irq %u\n", __func__, irq);
desc = irq_to_desc(irq);
scoped_guard(raw_spinlock, &desc->lock) {
affinity_broken = migrate_one_irq(desc);
+ pr_info("%s: migrate_one_irq -> %u\n", __func__, affinity_broken);
if (affinity_broken && desc->affinity_notify)
irq_affinity_schedule_notify_work(desc);
}
--
Thanks,
Radu
Powered by blists - more mailing lists