linux-kernel - Re: [PATCH v2 1/2] genirq: Retain depth for managed IRQs across CPU hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMcHhXrT-y3EotxrcCZ0Pj8Sic6wsPSmRiW7NSzdG=9iH8xqKg@mail.gmail.com>
Date: Wed, 11 Jun 2025 08:56:40 +0200
From: Aleksandrs Vinarskis <alex.vinarskis@...il.com>
To: Brian Norris <briannorris@...omium.org>
Cc: Thomas Gleixner <tglx@...utronix.de>, Tsai Sung-Fu <danielsftsai@...gle.com>, 
	Douglas Anderson <dianders@...omium.org>, linux-kernel@...r.kernel.org, 
	Johan Hovold <johan@...nel.org>
Subject: Re: [PATCH v2 1/2] genirq: Retain depth for managed IRQs across CPU hotplug

On Tue, 10 Jun 2025 at 22:07, Brian Norris <briannorris@...omium.org> wrote:
>
> Hi Alex,
>
> On Mon, Jun 09, 2025 at 08:19:58PM +0200, Aleksandrs Vinarskis wrote:
> > On Mon, 9 Jun 2025 at 19:13, Brian Norris <briannorris@...omium.org> wrote:
> > > On Fri, Jun 06, 2025 at 02:21:54PM +0200, Aleksandrs Vinarskis wrote:
> > > > It appears that this commit introduces a critical bug observed on at least
> > > > some Qualcomm Snapdragon X1E/X1P laptops, rendering the suspend function
> > > > unusable.
>
> For my reference, are these laptops represented by the
> arch/arm64/boot/dts/qcom/x1e80100.dtsi family of device trees? I'm just
> trying to reason through what sorts of driver(s) are in use here, in
> case there's something I'm overlooking, as I don't have the laptop in
> question to test.

Hi,

Yes. Dell XPS 9345 is arch/arm64/boot/dts/qcom/x1e80100.dtsi based,
and Asus Zenbook A14 is arch/arm64/boot/dts/qcom/x1p42100.dtsi based,
which is a derivative but has a slightly different PCIe setup. So far
both laptops would behave in the same ways.

>
> > > > With this change in place, after successful suspend the device either:
> > > > 1. Cannot wake up at all. Screen stays black, even though PM has existed
> > > > suspend (observed by external LEDs controlled by PM)
> > > >
> > > > 2. Wakes up eventually after minutes (instead of seconds) with SSD related
> > > > errors in dmesg. System still exhibits errors eg. UI icons are not properly
> > > > loaded, WiFi does not (always) connect.
>
> FYI, my assumption here based on the log snippets and the patch in
> question is that "only" the NVMe driver's IRQs are getting b0rked by my
> change. I could imagine that would produce the above symptoms in most
> laptop configurations, because failing disk I/O will likely block most
> wakeup-related activity, and cause all sorts of UI and system daemon
> (e.g., WiFi supplicant) misbehavior.
>
> > > I'm sorry to hear this has caused regressions. I don't yet know why your
> > > particular problems have occurred, but I did notice last week that there
> > > were some issues with the patch in question. I wrote a patch which I'll
> > > append, and have started (but not completely finished) testing it.
> > > Perhaps you could try it out and let me know how it goes?
> >
> > Hi Brian,
> >
> > I have tested your attached patch in addition to the original one, and
> > unfortunately it did not resolve the problem on either of the two
> > laptops: neither managed to wake up, just like before.
> > Will be happy to promptly test other proposed solutions.
>
> Thanks for the testing. I've found a few problems with my proposed
> patch, and I've come up with the appended alternative that solves them.
> Could you give it a try?

Just tested, and it appears to solve it, though I see some errors on
wakeup that I don't remember seeing before. I will test-drive this
setup for a day to provide better feedback and confirm if it is
related to the fixup or not.

>
> Also, if it's not too much trouble (and especially if my patch still
> doesn't help you), could you also provide a more complete kernel log and
> kernel .config file? (Attachment is fine with me. Or a direct email, if
> somehow the lists don't like it.) It's possible that would give me more
> hints as to what's going wrong for you.

I will share the logs with and without the fixup by private email
attachment in a bit.

Thanks for looking into this,
Alex


>
> Thanks,
> Brian
>
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index b0e0a7332993..3e873c5ce623 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -205,12 +205,15 @@ __irq_startup_managed(struct irq_desc *desc, const struct cpumask *aff,
>
>  void irq_startup_managed(struct irq_desc *desc)
>  {
> +       struct irq_data *d = irq_desc_get_irq_data(desc);
> +
>         /*
>          * Only start it up when the disable depth is 1, so that a disable,
>          * hotunplug, hotplug sequence does not end up enabling it during
>          * hotplug unconditionally.
>          */
>         desc->depth--;
> +       irqd_clr_managed_shutdown(d);
>         if (!desc->depth)
>                 irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
>  }
> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
> index f07529ae4895..755346ea9819 100644
> --- a/kernel/irq/cpuhotplug.c
> +++ b/kernel/irq/cpuhotplug.c
> @@ -210,13 +210,6 @@ static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
>             !irq_data_get_irq_chip(data) || !cpumask_test_cpu(cpu, affinity))
>                 return;
>
> -       /*
> -        * Don't restore suspended interrupts here when a system comes back
> -        * from S3. They are reenabled via resume_device_irqs().
> -        */
> -       if (desc->istate & IRQS_SUSPENDED)
> -               return;
> -
>         if (irqd_is_managed_and_shutdown(data))
>                 irq_startup_managed(desc);
>