linux-kernel - Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1712281531120.1899@nanos>
Date:   Thu, 28 Dec 2017 15:48:15 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Alexandru Chirvasitu <achirvasub@...il.com>
cc:     Dou Liyang <douly.fnst@...fujitsu.com>,
        Pavel Machek <pavel@....cz>,
        kernel list <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        "Maciej W. Rozycki" <macro@...ux-mips.org>,
        Mikael Pettersson <mikpelinux@...il.com>,
        Josh Poulson <jopoulso@...rosoft.com>,
        Mihai Costache <v-micos@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Marc Zyngier <marc.zyngier@....com>, linux-pci@...r.kernel.org,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Dexuan Cui <decui@...rosoft.com>,
        Simon Xiao <sixiao@...rosoft.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Jork Loeser <Jork.Loeser@...rosoft.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        devel@...uxdriverproject.org, KY Srinivasan <kys@...rosoft.com>
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > Ok, lets take a step back. The bisect/kexec attempts led us away from the
> > initial problem which is the machine locking up after login, right?
> >
> 
> Yes; sorry about that..

Nothing to be sorry about.

>     x86/vector: Replace the raw_spin_lock() with
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7504491..e5bab02 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
>                              const struct cpumask *dest, bool force)
>  {
>         struct apic_chip_data *apicd = apic_chip_data(irqd);
> +       unsigned long flags;
>         int err;
>  
>         /*
> @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
>             (apicd->is_managed || apicd->can_reserve))
>                 return IRQ_SET_MASK_OK;
>  
> -       raw_spin_lock(&vector_lock);
> +       raw_spin_lock_irqsave(&vector_lock, flags);
>         cpumask_and(vector_searchmask, dest, cpu_online_mask);
>         if (irqd_affinity_is_managed(irqd))
>                 err = assign_managed_vector(irqd, vector_searchmask);
>         else
>                 err = assign_vector_locked(irqd, vector_searchmask);
> -       raw_spin_unlock(&vector_lock);
> +       raw_spin_unlock_irqrestore(&vector_lock, flags);
>         return err ? err : IRQ_SET_MASK_OK;
>  }
> 
> With this, I still get the lockup messages after login, but not the
> freezes!

That's really interesting. There should be no code path which calls into
that with interrupts enabled. I assume you never ran that kernel with
CONFIG_PROVE_LOCKING=y.

Find below a debug patch which should show us the call chain for that
case. Please apply that on top of Dou's patch so the machine stays
accessible. Plain output from dmesg is sufficient.

> The lockups register in the log, which I am attaching (see below for
> attachment naming conventions).

Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
looks very familiar. I'd like to see the above result first and then I'll
send you another pile of patches which might cure that RCU issue.

Thanks,

	tglx

8<-------------------
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
 	unsigned long flags;
 	int err;
 
+	WARN_ON_ONCE(!irqs_disabled());
+
 	/*
 	 * Core code can call here for inactive interrupts. For inactive
 	 * interrupts which use managed or reservation mode there is no