linux-kernel - Re: [PATCH 09/10] x86, pkeys: allow configuration of init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 2 Aug 2016 07:37:56 -0700
From:	Dave Hansen <dave@...1.net>
To:	Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org
Cc:	x86@...nel.org, linux-api@...r.kernel.org,
	linux-arch@...r.kernel.org, linux-mm@...ck.org,
	torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
	luto@...nel.org, mgorman@...hsingularity.net,
	dave.hansen@...ux.intel.com, arnd@...db.de
Subject: Re: [PATCH 09/10] x86, pkeys: allow configuration of init_pkru

On 08/02/2016 01:28 AM, Vlastimil Babka wrote:
> On 07/29/2016 06:30 PM, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@...ux.intel.com>
>> But, having PKRU be 0 (its init value) provides some nonzero
>> amount of optimization potential to the hardware.  It can, for
>> instance, skip writes to the XSAVE buffer when it knows that PKRU
>> is in its init state.
> 
> I'm not very happy with tuning options that need the admin to make
> choice between reliability and performance. Is there no way to to
> optimize similarly for a non-zero init state?

The init state is architecturally defined and the overhead comes from
hardware cost when the register is not in its 'init state'.  There's
nothing I can think of that we can do in software to work around this.

I did try a few things with our XSAVE/XRSTOR code to optimize this since
most tasks will have the same PKRU value, but they didn't pan out and
added more overhead than they removed.

>> The cost of losing this optimization is approximately 100 cycles
>> per context switch for a workload which lightly using XSAVE
>> state (something not using AVX much).  The overhead comes from a
>> combinaation of actually manipulating PKRU and the overhead of
>> pullin in an extra cacheline.
> 
> So the cost is in extra steps in software, not in hardware as you
> mentioned above?

There are two sources of overhead: a RDPKRU/WRPKRU pair of instructions
at fpu__clear() time (mostly called via execve()) and overhead in the
XSAVE and XRSTOR instructions that occurs at context-switch time.

Taking the PKRU state out of the 'init state' makes us read at least one
additional cacheline during XRSTOR, plus some additional work inside the
instruction that the processor has to do to shuffle registers in/out of
memory.  This, I consider hardware overhead.

>> This overhead is not huge, but it's also not something that I
>> think we should unconditionally inflict on everyone.
> 
> Here, everyone means really all processes on system, that never heard of
> PKEs, and pay the cost just because the kernel was configured for it?

Yes, all processes on all systems that have memory protection keys
enabled in hardware.  In a normal workload that's context switching 1000
times a second is about 3/100,000 cycles on a 3GHz processor, which I
haven't been able to measure other than instrumenting the XSAVE/XRSTOR
paths themselves.

I also expect the relative overhead to decrease as more pervasive AVX
use increases the overall overhead of XSAVE. (AVX state is ~1k and PKU's
64b of space pales in comparison).

> But in that case, all PTEs use the key 0 anyway, so the non-zero default
> actually provides no extra reliability/security?

Correct.  It provides no additional security or reliability for
processes not using protection keys.

> Seems suboptimal that
> admins of such system have to recognize such situation themselves and
> change the default?

To be honest, I don't think anyone will notice.  Most folks will run a
kernel with PKU support on the new hardware that contains this feature
from day one and they'll never know about the 0.003% performance penalty
that I *think* this might cause.  Say that the processor with protection
keys is 5% faster than its predecessor (made up number), it will now
appear to be 4.996% faster.