lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 13 Jun 2015 09:15:47 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
Cc:	mingo@...hat.com, tglx@...utronix.de, hpa@...or.com, pavel@....cz,
	rjw@...ysocki.net, x86@...nel.org, linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org, Denys Vlasenko <dvlasenk@...hat.com>,
	Andy Lutomirski <luto@...capital.net>,
	Borislav Petkov <bp@...en8.de>,
	Brian Gerst <brgerst@...il.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Kleen, Andi" <andi.kleen@...el.com>
Subject: [PATCH, DEBUG] x86/32: Add small delay after resume


* Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com> wrote:

> >  Also, could you please describe how the failure triggers in your system: how 
> > many times do you have to suspend/resume to trigger the segfaults, and is 
> > there anything that makes the failures less or more likely?
>
> It is very random. Sometimes only few hundred trys reproduce this issue. Some 
> other times it requires thousands of trys (sometimes not reproducible at all for 
> days) It is very time sensitive.

So the very same kernel image will produce different crash patterns depending on 
the time of day? That suggests heat/hardware problems.

> [...] A small delay or some debug code in resume path prevents this to crash.

Fun...

> The BIOS folks created special version to check if they are corrupting any DS, 
> but they were not able to catch any corruption. [...]

So is it true that we always execute wakeup_pmode_return first after we return 
from the BIOS?

If so then the BIOS touching DS cannot be an issue, as we re-initialize all 
segment selectors, which reloads the descriptors:

ENTRY(wakeup_pmode_return)
wakeup_pmode_return:
        movw    $__KERNEL_DS, %ax
        movw    %ax, %ss
        movw    %ax, %ds
        movw    %ax, %es
        movw    %ax, %fs
        movw    %ax, %gs

        # reload the gdt, as we need the full 32 bit address
        lidt    saved_idt
        lldt    saved_ldt
        ljmp    $(__KERNEL_CS), $1f

> [...] Since these are special deployed systems running critical application, 
> need to request the tests again with your changes. May take long time.

So my second patch is clearly broken as per Brian Gerst's comments.

What I would suggest is to try a patch that adds just 100 NOPs or so - attached 
below. This patch will add a small delay without any side effects (other than 
changing the kernel image layout).

If that makes the crash go away, then I'd say the likelihood that it's hardware 
related increases substantially: maybe a PLL has not stabilized yet sufficiently 
after resume, or there's some latent heat sensitivity and the fan has not started 
up yet, etc.

( You can then use this simple delay generating patch in production systems as 
  well, to work around the problem. Maybe convince the BIOS folks to add a delay 
  like this to their resume path before they call Linux. )

Thanks,

	Ingo

=================>

 arch/x86/kernel/acpi/wakeup_32.S | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/acpi/wakeup_32.S b/arch/x86/kernel/acpi/wakeup_32.S
index 665c6b7d2ea9..ef26999da80a 100644
--- a/arch/x86/kernel/acpi/wakeup_32.S
+++ b/arch/x86/kernel/acpi/wakeup_32.S
@@ -10,6 +10,12 @@
 
 ENTRY(wakeup_pmode_return)
 wakeup_pmode_return:
+
+	/* Timing delay of a few dozen cycles: give the hardware some time to recover */
+	.rept 100
+	nop
+	.endr
+
 	movw	$__KERNEL_DS, %ax
 	movw	%ax, %ss
 	movw	%ax, %ds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ