[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABV8kRyi-5wyiCV3HsPfFx6x1_icV72BSy+5eK8UC3UCexTSCA@mail.gmail.com>
Date: Tue, 7 Apr 2020 00:44:39 -0400
From: Keno Fischer <keno@...iacomputing.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andi Kleen <andi@...stfloor.org>,
Kyle Huey <khuey@...ehuey.com>,
"Robert O'Callahan" <robert@...llahan.org>
Subject: Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
On Mon, Apr 6, 2020 at 11:58 PM Andy Lutomirski <luto@...capital.net> wrote:
>
>
> > On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@...iacomputing.com> wrote:
> >
> > This is a follow-up to my from two-years ago [1].
>
> Your changelog is missing an explanation of why this is useful. Why would a user program want to change XCR0?
Ah, sorry - I wasn't sure what the convention was around repeating the
applicable parts from the v1 changelog in this email.
Here's the description from the v1 patch:
> The rr (http://rr-project.org/) debugger provides user space
> record-and-replay functionality by carefully controlling the process
> environment in order to ensure completely deterministic execution
> of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
> allows rr to move traces across (Intel) machines, by allowing cpuid
> invocations to be reliably recorded and replayed. This works very
> well, with one catch: It is currently not possible to replay a
> recording from a machine supporting a smaller set of XCR0 state
> components on one supporting a larger set. This is because the
> value of XCR0 is observable in userspace (either by explicit
> xgetbv or by looking at the result of xsave) and since glibc
> does observe this value, replay divergence is almost immediate.
> I also suspect that people interested in process (or container)
> live-migration may eventually care about this if a migration happens
> in between a userspace xsave and a corresponding xrstor.
>
> We encounter this problem quite frequently since most of our users
> are using pre-Skylake systems (and thus don't support the AVX512
> state components), while we recently upgraded our main development
> machines to Skylake.
Basically, for rr to work, we need to tightly control any user-visible
CPU behavior,
either by putting in the CPU in the right state or by trapping and emulating
(as we do for rdtsc, cpuid, etc). XCR0 controls a bunch of
user-visible CPU behavior,
namely:
1) The size of the xsave region if xsave is passed an all-ones mask
(which is fairly common)
2) The return value of xgetbv
3) Whether instructions making use of the relevant xstate component traps
In the v1 review, it was raised that user space could be adjusted to
deal with these
issues by always checking support in cpuid first (which is already emulatable).
Unfortunately, we don't control the environment on the record side (rr supports
record on any Intel from the past decade - with the exception of a few that have
microarchitecture bugs causing problems; and kernel versions back to 3.11), so
trying to patch user space is unfortunately a no-go for us (as well as of course
being a debugging tool, so we want to be able to help users debug if they get
uses of these instructions wrong).
Another suggestion in the v1 review was to use a VM instead with an appropriate
XCR0 value. That does mostly work, but has some problems:
1) The performance is quite a bit worse (particularly if we're already
replaying in a virtualized environment)
2) We may want to simultaneously replay tasks with different XCR0
values. This comes
into play e.g. when recording a distributed system where different
nodes in the system
are on hosts with different hardware configurations (the reason you
want to replay them
jointly rather than node-by-node is that this way you can avoid
recording any intra-node
communication, since you can just recompute it from the trace).
As a result, doing this will fully-featured VMs isn't an attractive
proposition. I had looked into
doing something more light-weight using the raw KVM API or something
analogous to what project dune did (http://dune.scs.stanford.edu/ -
basically implementing
linux user space, but where the threads run in guest CPL0 rather than
host CPL3).
My conclusion was that this approach too would require significant
kernel modification to
work well (as well as having the noted performance problems in
virtualized environments).
Sorry if this is too much of an info dump, but I hope this gives some color.
Keno
Powered by blists - more mailing lists