[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3edbf9bd-4542-40d3-888c-470e793a46c0@nvidia.com>
Date: Fri, 23 Jan 2026 13:25:28 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Radu Rendec <rrendec@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
Manivannan Sadhasivam <mani@...nel.org>
Cc: Daniel Tsai <danielsftsai@...gle.com>, Marek Behún
<kabel@...nel.org>, Krishna Chaitanya Chundru <quic_krichai@...cinc.com>,
Bjorn Helgaas <bhelgaas@...gle.com>, Rob Herring <robh@...nel.org>,
Krzysztof Wilczyński <kwilczynski@...nel.org>,
Lorenzo Pieralisi <lpieralisi@...nel.org>, Jingoo Han
<jingoohan1@...il.com>, Brian Masney <bmasney@...hat.com>,
Eric Chanudet <echanude@...hat.com>,
Alessandro Carminati <acarmina@...hat.com>, Jared Kangas
<jkangas@...hat.com>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org,
"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v3 3/3] PCI: dwc: Enable MSI affinity support
On 22/01/2026 23:31, Radu Rendec wrote:
...
> Thanks very much for running the test and for the logs. The good news
> is good ol' printk debugging seems to be working, and the last message
> in the log is indeed related to dw-pci irq affinity control, which is
> what the patch touches. So we're on to something. The bad news is I
> can't yet figure out what's wrong.
>
> The CPUs are taken offline one by one, starting with CPU 7. The code in
> question runs on the dying CPU, and with hardware interrupts disabled
> on all CPUs. The (simplified) call stack looks like this:
>
> irq_migrate_all_off_this_cpu
> for_each_active_irq
> migrate_one_irq
> irq_do_set_affinity
> irq_chip_redirect_set_affinity (via chip->irq_set_affinity)
>
> The debug patch I gave you adds:
> * a printk to irq_chip_redirect_set_affinity (which is very small)
> * a printk at the beginning of migrate_one_irq
>
> Also, the call to irq_do_set_affinity is almost the last thing that
> happens in migrate_one_irq, and that for_each_active_irq loop is quite
> small too. So, there isn't much happening between the printk in
> irq_chip_redirect_set_affinity for the msi irq (which we do see in the
> log) and the printk in migrate_one_irq for the next irq (which we don't
> see).
>
> My first thought is to add more printk's between those two and narrow
> down the spot where it gets stuck.
>
> I think the fastest way to debug it is if I can test myself. I tried to
> reproduce the issue on a Jetson AGX Orin, and I couldn't. By the way,
> how often does it hang? e.g., out of say 10 suspend attempts, how many
> fail?
For Jetson AGX Xavier it fails on the first suspend attempt.
> I do have access to a Jetson Xavier NX (in theory) but it looks like
> there's a lab issue with that board, which hopefully gets sorted out
> tomorrow. If I can't get a hold of that board (or can't reproduce the
> problem on it), I may ask you to try a few other things. In any case,
> I'll update this thread again either tomorrow or (more likely) early
> next week.
Weirdly I don't see this with Jetson Xavier NX. However, could be worth
trying but you may wish to revert this change [0] because it is causing
other issues for Jetson Xavier NX.
Jon
[0]
https://lore.kernel.org/linux-tegra/e32b0819-2c29-4c83-83d5-e28dc4b2b01f@nvidia.com/
--
nvpublic
Powered by blists - more mailing lists