lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 23 Dec 2021 12:07:01 -0800
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Borislav Petkov <bp@...en8.de>, x86@...nel.org,
        linux-kernel@...r.kernel.org, patches@...ts.linux.dev,
        Youquan Song <youquan.song@...el.com>
Subject: [PATCH v2] x86/mce: Reduce number of machine checks taken during
 recovery


From: Youquan Song <youquan.song@...el.com>

When any of the copy functions in arch/x86/lib/copy_user_64.S take a
fault, the fixup code copies the remaining byte count from %ecx to %edx
and unconditionally jumps to .Lcopy_user_handle_tail to continue the
copy in case any more bytes can be copied.

If the fault was #PF this may copy more bytes (because the page fault
handler might have fixed the fault). But when the fault is a machine
check the original copy code will have copied all the way to the poisoned
cache line. So .Lcopy_user_handle_tail will just take another machine
check for no good reason.

Every code path to .Lcopy_user_handle_tail comes from an exception fixup
path, so add a check there to check the trap type (in %eax) and simply
return the count of remaining bytes if the trap was a machine check.

Doing this reduces the number of machine checks taken during synthetic
tests from four to three.

As well as reducing the number of machine checks, this also allows
Skylake generation Xeons to recover some cases that currently fail. The
is because "rep movsb" is only recoverable when source and destination
are well aligned and the byte count is large. That useless call to
.Lcopy_user_handle_tail may violate one or more of these conditions and
generate a fatal machine check.

[Tony: Added more details to commit message]

Signed-off-by: Youquan Song <youquan.song@...el.com>
Signed-off-by: Tony Luck <tony.luck@...el.com>
---
 arch/x86/lib/copy_user_64.S | 9 +++++++++
 1 file changed, 9 insertions(+)

V2: Update based on comments from PeterZ
	- Rebase to tip x86/core branch
	- Add comment documenting eax is now part of calling convention
	  for .Lcopy_user_handle_tail
	- Trim commit comment to make it easier to see that I did
	  audit all the call sequences for .Lcopy_user_handle_tail

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index e6ac38587b40..26781cbe7e37 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -212,6 +212,7 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
  * Don't try to copy the tail if machine check happened
  *
  * Input:
+ * eax x86 trap number - set by fixup_excpetion()
  * rdi destination
  * rsi source
  * rdx count
@@ -220,12 +221,20 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
  * eax uncopied bytes or 0 if successful.
  */
 SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
+	cmp $X86_TRAP_MC,%eax
+	je 3f
+
 	movl %edx,%ecx
 1:	rep movsb
 2:	mov %ecx,%eax
 	ASM_CLAC
 	RET
 
+3:
+	movl %edx,%eax
+	ASM_CLAC
+	RET
+
 	_ASM_EXTABLE_CPY(1b, 2b)
 
 .Lcopy_user_handle_align:

base-commit: 82a8954acd93ae95d6252fb93a3d210c8f71b093
-- 
2.31.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ