lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96c9d483f67be02fa1dba736fea465216d0c3269.camel@redhat.com>
Date: Thu, 22 Jan 2026 18:31:01 -0500
From: Radu Rendec <rrendec@...hat.com>
To: Jon Hunter <jonathanh@...dia.com>, Thomas Gleixner <tglx@...utronix.de>,
  Manivannan Sadhasivam	 <mani@...nel.org>
Cc: Daniel Tsai <danielsftsai@...gle.com>, Marek Behún	
 <kabel@...nel.org>, Krishna Chaitanya Chundru <quic_krichai@...cinc.com>, 
 Bjorn Helgaas <bhelgaas@...gle.com>, Rob Herring <robh@...nel.org>,
 Krzysztof Wilczyński	 <kwilczynski@...nel.org>, Lorenzo
 Pieralisi <lpieralisi@...nel.org>, Jingoo Han	 <jingoohan1@...il.com>,
 Brian Masney <bmasney@...hat.com>, Eric Chanudet	 <echanude@...hat.com>,
 Alessandro Carminati <acarmina@...hat.com>, Jared Kangas	
 <jkangas@...hat.com>, linux-pci@...r.kernel.org,
 linux-kernel@...r.kernel.org,  "linux-tegra@...r.kernel.org"	
 <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v3 3/3] PCI: dwc: Enable MSI affinity support

Hi Jon,

On Wed, 2026-01-21 at 14:00 +0000, Jon Hunter wrote:
> 
> On 20/01/2026 22:30, Radu Rendec wrote:
> 
> ...
> 
> > > So it looks like it is hanging when disabling the non-boot CPUs. So far
> > > it only appears to happen on Tegra194.
> > > 
> > > Let me know if you have any suggestions.
> > 
> > Ouch. I'm afraid this is going to be much harder to figure out than the
> > previous one, especially since I can't get access easily to a board to
> > test on. I will try to reserve a board and reproduce the bug.
> > 
> > Meanwhile, if you (or someone else in your team) can spare a few cycles,
> > could you please try to reproduce the bug again with the debug patch
> > below applied, and a few other changes:
> >   * enable debug messages in kernel/irq/cpuhotplug.c;
> >   * save the contents of /proc/interrupts to a file before suspending;
> >   * add "no_console_suspend" to the kernel command line (although it
> >     looks like you already have it).
> > 
> > It will be much more verbose during suspend but hopefully we can at
> > least figure out how far along it goes and how it's related to the MSI
> > affinity configuration.
> 
> 
> Thanks. I have dumped the boot log with the prints here:
> 
> https://pastebin.com/G8c2ssdt
> 
> And the dump of /proc/interrupts here:
> 
> https://pastebin.com/Wqzxw3r6
> 
> Looks like the last thing I see entering suspend is ...
> 
>   irq_chip_redirect_set_affinity: irq 162 mask 0x7f
> 
> That appears to be a PCIe interrupt. Let me know if there are more tests 
> I can run.

Thanks very much for running the test and for the logs. The good news
is good ol' printk debugging seems to be working, and the last message
in the log is indeed related to dw-pci irq affinity control, which is
what the patch touches. So we're on to something. The bad news is I
can't yet figure out what's wrong.

The CPUs are taken offline one by one, starting with CPU 7. The code in
question runs on the dying CPU, and with hardware interrupts disabled
on all CPUs. The (simplified) call stack looks like this:

irq_migrate_all_off_this_cpu
  for_each_active_irq
    migrate_one_irq
      irq_do_set_affinity
        irq_chip_redirect_set_affinity (via chip->irq_set_affinity)

The debug patch I gave you adds:
 * a printk to irq_chip_redirect_set_affinity (which is very small)
 * a printk at the beginning of migrate_one_irq

Also, the call to irq_do_set_affinity is almost the last thing that
happens in migrate_one_irq, and that for_each_active_irq loop is quite
small too. So, there isn't much happening between the printk in
irq_chip_redirect_set_affinity for the msi irq (which we do see in the
log) and the printk in migrate_one_irq for the next irq (which we don't
see).

My first thought is to add more printk's between those two and narrow
down the spot where it gets stuck.

I think the fastest way to debug it is if I can test myself. I tried to
reproduce the issue on a Jetson AGX Orin, and I couldn't. By the way,
how often does it hang? e.g., out of say 10 suspend attempts, how many
fail?

I do have access to a Jetson Xavier NX (in theory) but it looks like
there's a lab issue with that board, which hopefully gets sorted out
tomorrow. If I can't get a hold of that board (or can't reproduce the
problem on it), I may ask you to try a few other things. In any case,
I'll update this thread again either tomorrow or (more likely) early
next week.

-- 
Thanks,
Radu


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ