linux-kernel - Re: [PATCH v3 3/3] PCI: dwc: Enable MSI affinity support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3edbf9bd-4542-40d3-888c-470e793a46c0@nvidia.com>
Date: Fri, 23 Jan 2026 13:25:28 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Radu Rendec <rrendec@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
 Manivannan Sadhasivam <mani@...nel.org>
Cc: Daniel Tsai <danielsftsai@...gle.com>, Marek Behún
 <kabel@...nel.org>, Krishna Chaitanya Chundru <quic_krichai@...cinc.com>,
 Bjorn Helgaas <bhelgaas@...gle.com>, Rob Herring <robh@...nel.org>,
 Krzysztof Wilczyński <kwilczynski@...nel.org>,
 Lorenzo Pieralisi <lpieralisi@...nel.org>, Jingoo Han
 <jingoohan1@...il.com>, Brian Masney <bmasney@...hat.com>,
 Eric Chanudet <echanude@...hat.com>,
 Alessandro Carminati <acarmina@...hat.com>, Jared Kangas
 <jkangas@...hat.com>, linux-pci@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 "linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v3 3/3] PCI: dwc: Enable MSI affinity support



On 22/01/2026 23:31, Radu Rendec wrote:

...

> Thanks very much for running the test and for the logs. The good news
> is good ol' printk debugging seems to be working, and the last message
> in the log is indeed related to dw-pci irq affinity control, which is
> what the patch touches. So we're on to something. The bad news is I
> can't yet figure out what's wrong.
> 
> The CPUs are taken offline one by one, starting with CPU 7. The code in
> question runs on the dying CPU, and with hardware interrupts disabled
> on all CPUs. The (simplified) call stack looks like this:
> 
> irq_migrate_all_off_this_cpu
>    for_each_active_irq
>      migrate_one_irq
>        irq_do_set_affinity
>          irq_chip_redirect_set_affinity (via chip->irq_set_affinity)
> 
> The debug patch I gave you adds:
>   * a printk to irq_chip_redirect_set_affinity (which is very small)
>   * a printk at the beginning of migrate_one_irq
> 
> Also, the call to irq_do_set_affinity is almost the last thing that
> happens in migrate_one_irq, and that for_each_active_irq loop is quite
> small too. So, there isn't much happening between the printk in
> irq_chip_redirect_set_affinity for the msi irq (which we do see in the
> log) and the printk in migrate_one_irq for the next irq (which we don't
> see).
> 
> My first thought is to add more printk's between those two and narrow
> down the spot where it gets stuck.
> 
> I think the fastest way to debug it is if I can test myself. I tried to
> reproduce the issue on a Jetson AGX Orin, and I couldn't. By the way,
> how often does it hang? e.g., out of say 10 suspend attempts, how many
> fail?

For Jetson AGX Xavier it fails on the first suspend attempt.

> I do have access to a Jetson Xavier NX (in theory) but it looks like
> there's a lab issue with that board, which hopefully gets sorted out
> tomorrow. If I can't get a hold of that board (or can't reproduce the
> problem on it), I may ask you to try a few other things. In any case,
> I'll update this thread again either tomorrow or (more likely) early
> next week.

Weirdly I don't see this with Jetson Xavier NX. However, could be worth 
trying but you may wish to revert this change [0] because it is causing 
other issues for Jetson Xavier NX.

Jon

[0] 
https://lore.kernel.org/linux-tegra/e32b0819-2c29-4c83-83d5-e28dc4b2b01f@nvidia.com/


-- 
nvpublic