lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAObL_7GCDsfXWRJgkNk7c44GNF0JhQPAH_P0WiYHK7QUX1Bcaw@mail.gmail.com>
Date:	Sun, 24 Jul 2011 17:07:08 -0400
From:	Andrew Lutomirski <luto@....edu>
To:	linux-kernel@...r.kernel.org, x86 <x86@...nel.org>
Subject: [RFC] syscall calling convention, stts/clts, and xstate latency

I was trying to understand the FPU/xstate saving code, and I ran some
benchmarks with surprising results.  These are all on Sandy Bridge
i7-2600.  Please take all numbers with a grain of salt -- they're in
tight-ish loops and don't really take into account real-world cache
effects.

A clts/stts pair takes about 80 ns.  Accessing extended state from
userspace with TS set takes 239 ns.  A kernel_fpu_begin /
kernel_fpu_end pair with no userspace xstate access takes 80 ns
(presumably 79 of those 80 are the clts/stts).  (Note: The numbers in
this paragraph were measured using a hacked-up kernel and KVM.)

With nonzero ymm state, xsave + clflush (on the first cacheline of
xstate) + xrstor takes 128 ns.  With hot cache, xsave = 24ns, xsaveopt
(with unchanged state) = 16 ns, and xrstor = 40 ns.

With nonzero xmm state but zero ymm state, xsave+xrstor drops to 38 ns
and xsaveopt saves another 5 ns.

Zeroing the state completely with vzeroall adds 2 ns.  Not sure what's going on.

All of this makes me think that, at least on Sandy Bridge, lazy xstate
saving is a bad optimization -- if the cache is being nice,
save/restore is faster than twiddling the TS bit.  And the cost of the
trap when TS is set blows everything else away.


Which brings me to another question: what do you think about declaring
some of the extended state to be clobbered by syscall?  Ideally, we'd
treat syscall like a regular function and clobber everything except
the floating point control word and mxcsr.  More conservatively, we'd
leave xmm and x87 state but clobber ymm.  This would let us keep the
cost of the state save and restore down when kernel_fpu_begin is used
in a syscall path and when a context switch happens as a result of a
syscall.

glibc does *not* mark the xmm registers as clobbered when it issues
syscalls, but I suspect that everything everywhere that issues
syscalls does it from a function, and functions are implicitly assumed
to clobber extended state.  (And if anything out there assumes that
ymm state is preserved, I'd be amazed.)


--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ