lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171228142117.GA10658@chirva-slack.chirva-slack>
Date:   Thu, 28 Dec 2017 09:21:17 -0500
From:   Alexandru Chirvasitu <achirvasub@...il.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Dou Liyang <douly.fnst@...fujitsu.com>,
        Pavel Machek <pavel@....cz>,
        kernel list <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        "Maciej W. Rozycki" <macro@...ux-mips.org>,
        Mikael Pettersson <mikpelinux@...il.com>,
        Josh Poulson <jopoulso@...rosoft.com>,
        Mihai Costache <v-micos@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Marc Zyngier <marc.zyngier@....com>, linux-pci@...r.kernel.org,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Dexuan Cui <decui@...rosoft.com>,
        Simon Xiao <sixiao@...rosoft.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Jork Loeser <Jork.Loeser@...rosoft.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        devel@...uxdriverproject.org, KY Srinivasan <kys@...rosoft.com>
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

Thanks for all of this!

On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> > On Wed, Dec 20, 2017 at 11:58:57AM +0800, Dou Liyang wrote:
> > > At 12/20/2017 08:31 AM, Thomas Gleixner wrote:
> > > > > I had never heard of 'bisect' before this casual mention (you might tell
> > > > > I am a bit out of my depth). I've since applied it to Linus' tree between
> > > > 
> > > > > bebc608 Linux 4.14 (good)
> > > > > 
> > > > > and
> > > > > 
> > > > > 4fbd8d1 Linux 4.15-rc1 (bad)
> > > > 
> > > > Is Linus current head 4.15-rc4 bad as well?
> > > > 
> > > [...]
> > 
> > Yes. Exactly the same symptoms on
> > 
> > 1291a0d5 Linux 4.15-rc4
> > 
> > compiled just now from Linus' tree. 
> 
> Ok, lets take a step back. The bisect/kexec attempts led us away from the
> initial problem which is the machine locking up after login, right?
>

Yes; sorry about that..

> Could you try the patch below on top of Linus tree (rc5+)?
> 
> Thanks,
> 
> 	tglx
> 
> 8<---------------
> --- a/arch/x86/kernel/apic/apic_flat_64.c
> +++ b/arch/x86/kernel/apic/apic_flat_64.c
> @@ -151,7 +151,7 @@ static struct apic apic_flat __ro_after_
>  	.apic_id_valid			= default_apic_id_valid,
>  	.apic_id_registered		= flat_apic_id_registered,
>  
> -	.irq_delivery_mode		= dest_LowestPrio,
> +	.irq_delivery_mode		= dest_Fixed,
>  	.irq_dest_mode			= 1, /* logical */
>  
>  	.disable_esr			= 0,
> --- a/arch/x86/kernel/apic/probe_32.c
> +++ b/arch/x86/kernel/apic/probe_32.c
> @@ -105,7 +105,7 @@ static struct apic apic_default __ro_aft
>  	.apic_id_valid			= default_apic_id_valid,
>  	.apic_id_registered		= default_apic_id_registered,
>  
> -	.irq_delivery_mode		= dest_LowestPrio,
> +	.irq_delivery_mode		= dest_Fixed,
>  	/* logical delivery broadcast to all CPUs: */
>  	.irq_dest_mode			= 1,
>  
> --- a/arch/x86/kernel/apic/x2apic_cluster.c
> +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> @@ -184,7 +184,7 @@ static struct apic apic_x2apic_cluster _
>  	.apic_id_valid			= x2apic_apic_id_valid,
>  	.apic_id_registered		= x2apic_apic_id_registered,
>  
> -	.irq_delivery_mode		= dest_LowestPrio,
> +	.irq_delivery_mode		= dest_Fixed,
>  	.irq_dest_mode			= 1, /* logical */
>  
>  	.disable_esr			= 0,
> 

I tried both patches that you guys sent in the last couple of
messages. I applied them separately to the last 4.15-rc5 kernel I had
(the one for which I sent Dou the journalctl output). The diffs are
both to that version.

Results follow.


(1)

Dou's patch:

------------------------------------------------------------

    x86/vector: Replace the raw_spin_lock() with

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 7504491..e5bab02 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
                             const struct cpumask *dest, bool force)
 {
        struct apic_chip_data *apicd = apic_chip_data(irqd);
+       unsigned long flags;
        int err;
 
        /*
@@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
            (apicd->is_managed || apicd->can_reserve))
                return IRQ_SET_MASK_OK;
 
-       raw_spin_lock(&vector_lock);
+       raw_spin_lock_irqsave(&vector_lock, flags);
        cpumask_and(vector_searchmask, dest, cpu_online_mask);
        if (irqd_affinity_is_managed(irqd))
                err = assign_managed_vector(irqd, vector_searchmask);
        else
                err = assign_vector_locked(irqd, vector_searchmask);
-       raw_spin_unlock(&vector_lock);
+       raw_spin_unlock_irqrestore(&vector_lock, flags);
        return err ? err : IRQ_SET_MASK_OK;
 }

------------------------------------------------------------

With this, I still get the lockup messages after login, but not the
freezes!

The lockups register in the log, which I am attaching (see below for
attachment naming conventions).

The computer's still clearly impaired (ethernet won't connect again
for instance, and the CPU distress messages happen periodically
throughout the tty session), but at least it's logged now.

---

(2)

Thomas' patch:

------------------------------------------------------------

    apic patch from tglx

diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c
index aa85690..1f734d4 100644
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -151,7 +151,7 @@ static struct apic apic_flat __ro_after_init = {
        .apic_id_valid                  = default_apic_id_valid,
        .apic_id_registered             = flat_apic_id_registered,
 
-       .irq_delivery_mode              = dest_LowestPrio,
+       .irq_delivery_mode              = dest_Fixed,
        .irq_dest_mode                  = 1, /* logical */
 
        .disable_esr                    = 0,
diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c
index fa22017..765cded 100644
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -105,7 +105,7 @@ static struct apic apic_default __ro_after_init = {
        .apic_id_valid                  = default_apic_id_valid,
        .apic_id_registered             = default_apic_id_registered,
 
-       .irq_delivery_mode              = dest_LowestPrio,
+       .irq_delivery_mode              = dest_Fixed,
        /* logical delivery broadcast to all CPUs: */
        .irq_dest_mode                  = 1,
 
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 622f13c..39568bd 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -184,7 +184,7 @@ static struct apic apic_x2apic_cluster __ro_after_init = {
        .apic_id_valid                  = x2apic_apic_id_valid,
        .apic_id_registered             = x2apic_apic_id_registered,
 
-       .irq_delivery_mode              = dest_LowestPrio,
+       .irq_delivery_mode              = dest_Fixed,
        .irq_dest_mode                  = 1, /* logical */
 
        .disable_esr                    = 0,

------------------------------------------------------------

This gives me the same disabling lockups as before, i.e. I have to
reboot. Correspondingly, the log I'm attaching for this kernel won't
be of much use, because it's called with `journalctl --boot=-1` after
the fact.

Might still be of some use..

---

The log files I'm attaching comply with the following naming pattern:

'dou' means the log comes from the kernel patched with Dou's patch;

'thms' refers to Thomas' patch;

'jrnl' means it came from journalctl with various boot=? options; 

'dmesg' means it came from calling dmesg;

'noparams' means the kernel was called with no additional parameters;

'debug' means it was called with 'apic=debug'. 

---

P.S.

It was very considerate to send the attachment Dou, but that shouldn't
be necessary anymore; the issues I had with 'git apply' were
blank-space-related, and I've managed to resolve them.

So anything copy-pastable directly in the message body should do if I
try more patches.

Thank you!

View attachment "log-jrnl-dou-noparam" of type "text/plain" (99975 bytes)

View attachment "log-jrnl-dou-debug" of type "text/plain" (104495 bytes)

View attachment "log-dmesg-dou-noparam" of type "text/plain" (62501 bytes)

View attachment "log-jrnl-thms-debug" of type "text/plain" (73252 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ