linux-kernel - issue with x86 FPU state after suspend to ram

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <1354301523-5252-1-git-send-email-vpalatin@chromium.org>
Date:	Fri, 30 Nov 2012 10:52:02 -0800
From:	Vincent Palatin <vpalatin@...omium.org>
To:	Ingo Molnar <mingo@...hat.com>, "H. Peter Anvin" <hpa@...or.com>,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Jarkko Sakkinen <jarkko.sakkinen@...el.com>,
	Duncan Laurie <dlaurie@...omium.org>,
	Olof Johansson <olofj@...omium.org>
Subject: issue with x86 FPU state after suspend to ram

Hi,

On a 4-core Ivybridge platform, when doing a lot of suspend-to-ram/resume
cycles, we were observing processes randomly killed by a SIGFPE.
When dumping the FPU registers state on the SIGFPE (usually a floating stack
underflow/overflow on a floating point arithmetic operation), the FPU registers
looks empty or at least corrupted which was more or less impossible with
respect to the disassembled floating point code.

After doing more tracing, in the faulty case, the process seems to be keeping
FPU ownership over a secondary CPU unplug/re-plug triggered by the suspend.
Then it's doing a lazy restore of its FPU context (ie just using the current
FPU hardware registers as he is the owner) instead of writing them back
to the hardware from the version previously saved in the task context,
despite the fact the whole FPU hardware state has been lost.

Just invalidating the "fpu_owner_task" when disabling a secondary CPU seems
to solve my issue (it's already reset for the primary CPU).

By the way, when FPU the lazy restore patch was discussed back in february,
Ingo commented (in http://permalink.gmane.org/gmane.linux.kernel/1255423) :
"
I guess the CPU hotplug case deserves a comment in the code: CPU
hotplug + replug of the same (but meanwhile reset) CPU is safe
because fpu_owner_task[cpu] gets reset to NULL.
"
That contradicts my previous observation, so maybe I have totally overlooked
something in this mechanism.
Can you comment ?

I'm still putting my patch proposal in this thread.
The issue seems to exist since 3.4 after the FPU lazy restore was actually
implemented by commit 7e16838d "i387: support lazy restore of FPU state".
But the issue is mainly visible on 3.4 and 3.6 since on tip of tree, it is
hidden by the eager fpu implementation for platforms with xsave support,
but it still happens with eagerfpu=off.

To apply this change to 3.4, "this_cpu_write" needs to be replaced by
percpu_write.

--
Vincent

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/