linux-kernel - Re: 4.15-rc6 PTI regression: L1 TLB mismatch MCE on Athlon64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180102212706.5gtevvg4rr7rfy5o@pd.tnic>
Date:   Tue, 2 Jan 2018 22:27:06 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     Meelis Roos <mroos@...ux.ee>
Cc:     Linux Kernel list <linux-kernel@...r.kernel.org>, x86@...nel.org,
        linux-edac@...r.kernel.org, Tom Lendacky <thomas.lendacky@....com>
Subject: Re: 4.15-rc6 PTI regression: L1 TLB mismatch MCE on Athlon64

On Tue, Jan 02, 2018 at 10:49:16PM +0200, Meelis Roos wrote:
> This is on a socket 939 Athlon64 3500+, with PTI enabled.

LOL.

> [  316.384669] mce: [Hardware Error]: Machine check events logged
> [  316.384698] [Hardware Error]: Corrected error, no action required.
> [  316.384719] [Hardware Error]: CPU:0 (f:2f:2) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000010011
> [  316.384742] [Hardware Error]: Error Addr: 0x0000ffff81e000e0

That's the [47:12] slice of the virtual address which it tried to execute.

According to our map in mm.txt:

ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor

vs

ffff81e000e0...

which makes me think: WTF now?!

I don't see any hypervisor happening in dmesg...

> [  316.384757] [Hardware Error]: MC1 Error: L1 TLB multimatch.
> [  316.384774] [Hardware Error]: cache level: L1, tx: INSN
> 
> These MCE-s do not happen on 4.14 and 4.15.0-rc4-00041-gace52288edf0. 
> They do happen on each boot into 4.15-rc6. Will try to bisect.

Please do. And try -rc5 too.

And then Linus' pti merges:

52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd
5aa90a84589282b87666f92b6c3c917c8080a9bf
caf9a82657b313106aae8f4a35936c116a152299
64a48099b3b31568ac45716b7fafcb74a0c2fcfe

> I understand there exist patches that turn off PTI on AMD CPUs but the 
> MCE-s seem still interesting.

Yes, there is:

https://lkml.kernel.org/r/20171227054354.20369.94587.stgit@tlendack-t1.amdoffice.net

> 
> Same kernel with "nopti" boot command line option does not show the 
> MCE-s either.
> 
> When the MCE-s happen, they happen with 5 minute interval or slightly 
> more, like this (excerpt from grep mce: /var/log/kern.log, not full 
> dmesg). The first ones always happen at 316 and 627 seconds after 
> bootup.

That's the 5 minute default check interval for corrected errors. You can do

# echo 10 > /sys/devices/system/machinecheck/machinecheck0/check_interval

to decrease it.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.