[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b86e4d31-70db-42de-bf54-4ffb03b5cba0@FreeBSD.org>
Date: Thu, 14 Mar 2024 10:36:48 -0700
From: John Baldwin <jhb@...eBSD.org>
To: Dave Hansen <dave.hansen@...el.com>,
Vignesh Balasubramanian <vigbalas@....com>, linux-kernel@...r.kernel.org,
linux-toolchains@...r.kernel.org
Cc: mpe@...erman.id.au, npiggin@...il.com, christophe.leroy@...roup.eu,
aneesh.kumar@...nel.org, naveen.n.rao@...ux.ibm.com, ebiederm@...ssion.com,
keescook@...omium.org, x86@...nel.org, linuxppc-dev@...ts.ozlabs.org,
linux-mm@...ck.org, bpetkov@....com, jinisusan.george@....com, matz@...e.de,
binutils@...rceware.org, felix.willgerodt@...el.com
Subject: Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures
information to x86 core files
On 3/14/24 10:10 AM, Dave Hansen wrote:
> On 3/14/24 09:45, John Baldwin wrote:
>> On 3/14/24 8:37 AM, Dave Hansen wrote:
>>> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>>>> Add a new .note section containing type, size, offset and flags of
>>>> every xfeature that is present.
>>>
>>> Mechanically, I'd much rather have all of that info in the cover letter
>>> in the actual changelog instead.
>>>
>>> I'd also love to see a practical example of what an actual example core
>>> dump looks like on two conflicting systems:
>>>
>>> * Total XSAVE size
>>> * XCR0 value
>>> * XSTATE_BV from the core dump
>>> * XFEATURE offsets for each feature
>>
>> I noticed this when I bought an AMD Ryzen 9 5900X based system for
>> my desktop running FreeBSD and found that the XSAVE core dump notes
>> were not recognized by GDB (FreeBSD dumps an XSAVE register set note
>> that matches the same layout of NT_X86_XSTATE used by Linux).
>
> I just want to make sure that you heard what I asked. I'd like to see a
> practical example of how the real-world enumeration changes between two
> real world systems.
>
> Is that possible?
>
> Here's the raw CPUID data from the XSAVE region on my laptop:
>
>> 0x0000000d 0x00: eax=0x000002e7 ebx=0x00000a88 ecx=0x00000a88 edx=0x00000000
>> 0x0000000d 0x01: eax=0x0000000f ebx=0x00000998 ecx=0x00003900 edx=0x00000000
>> 0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x05: eax=0x00000040 ebx=0x00000440 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x06: eax=0x00000200 ebx=0x00000480 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x07: eax=0x00000400 ebx=0x00000680 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x08: eax=0x00000080 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>> 0x0000000d 0x09: eax=0x00000008 ebx=0x00000a80 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>> 0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>> 0x0000000d 0x0d: eax=0x00000008 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>
> Could we get that for an impacted AMD system, please?
>
> cpuid -1 --raw | grep " 0x0000000d "
>
> should do it.
0x0000000d 0x00: eax=0x00000207 ebx=0x00000988 ecx=0x00000988 edx=0x00000000
0x0000000d 0x01: eax=0x0000000f ebx=0x00000348 ecx=0x00001800 edx=0x00000000
0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
0x0000000d 0x09: eax=0x00000008 ebx=0x00000980 ecx=0x00000000 edx=0x00000000
0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
Here, I think the ebx value for the 0x09 leaf (PKRU) is the relevant difference
here, it is 0xa80 on your laptop and 0x980 on the AMD CPU. (This is the
missing MPX gap on AMD.)
>>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>>> should just bite the bullet and dump out (some of) the raw CPUID space.
>>
>> So the current note I initially proposed and implemented for FreeBSD
>> (https://reviews.freebsd.org/D42136) and an initial patch set for GDB
>> (https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
>> do indeed dump a raw set of CPUID leaves. The version I have for FreeBSD
>> only dumps the raw leaf values for leaf 0x0d though the note format is
>> extensible should additional leaves be needed in the future. One of the
>> questions if we wanted to use a CPUID leaf note is which leaves to dump
>> (e.g. do you dump all of them, or do you just dump the subset that is
>> currently needed).
>
> You dump what is needed and add to the dump over time.
That is what I started with, yes, but am attempting to anticipate future
problems in my list of caveats.
>> Another quirky question is what to do about systems with hetergeneous
>> cores (E vs P for example).
> That's irrelevant for now. The cores may be heterogeneous but the
> userspace ISA and (and thus XSAVE formats) are identical. If they're
> not, then we have bigger problems on our hands.
Yes, I agree on the bigger problems and hope we don't have to solve
them.
>> Currently those systems use the same XSAVE layout across all cores,
>> but other CPUID leaves do already vary across cores on those systems.
>
> There shouldn't be any CPUID leaves that differ _and_ matter to
> userspace and thus core dumps.
Today that is true, yes. I'm fine with making that tradeoff (along
with only dumping a subset of leaves) so long as the consensus is that
is an acceptable tradeoff to make.
>> However, there are other wrinkles with the leaf approach. Namely, one
>> of the use cases that I currently have an ugly hack for in GDB is if
>> you are using gdb against a remote host running gdbserver and then use
>> 'gcore' to generate a core dump. GDB needs to write out a NT_X86_XSTATE
>> note, but that note requires a layout. What GDB does today is just pick
>> a known Intel layout based on the XCR0 mask. However, GDB should ideally
>> start writing out whatever new note we adopt here, so if we dump raw
>> CPUID leaves it means extending the GDB remote protocol so we can query
>> the CPUID leaves from the remote host. On the other hand, if we choose a
>> more abstract format as proposed in this patch, the local GDB (or LLDB
>> or whatever) can generate whatever synthetic layout it wants to write
>> the local NT_X86_XSTATE. (NB: A relevant detail here is that the GDB
>> remote protocol does not pass the entire XSAVE state across as a block,
>> instead gdbserver parses individual register values for AVX, etc.
>> registers and those decoded register values are passed over the
>> protocol.)
>
> So the gdb side says, "Give me PKRU" and the remote side parses the
> XSAVE image, finds PKRU, and sends it over the wire?
Yes.
>> Another question is potentially supporting compact XSAVE format in
>> for NT_X86_XSTATE. Today Linux has some complicated code to re-expand
>> the compat XSAVE format back out to the standard layout for ptrace() and
>> process core dumps.
>
> Yeah, but supporting the compacted format in NT_X86_XSTATE doesn't help
> us at all. We still intermingle user and supervisor state and that
> needs to get repacked _anyway_.
Fair enough.
> In other words, no matter what we do, it's going to be complicated
> because the userspace buffer can't have supervisor state and the kernel
> buffer does have it. The compacted format mismatch is the least of our
> problems.
>
>> (FreeBSD doesn't yet make use of XSAVEC so we
>> haven't yet dealt with that problem.)
>
> ... or XSAVES, which is actually the most relevant here.
>
> Backing up... there are two approaches here:
>
> 1. Dump out raw x86-specific gunk, aka. CPUID contents itself. There
> are a billion ways to do this and lots of complications, including
> the remote protocol implications
> or
> 2. Define an abstract format that works anywhere, not just on x86 and
> not just for XSAVE.
>
> There's no (sane) middle ground. The implementation here (in this
> patch) is fundamentally x86-specific and pretends to be some kind of
> abstracted x86-independent format.
Well, are there other register notes that could benefit from an approach
like this? Most other register notes I'm aware of on various architectures
either have a fixed layout (like the typical general purpose register notes),
or they have a fixed set of registers but the size of individual registers
can vary (thinks like SME or RISC-V's vector extension). XSAVE is the only
one I'm aware of that packs multiple register sets into a single note.
To step back a bit, another approach that could be taken (and I'm not sure
it is worth it at this point) would to stop dumping a single XSAVE note
and dump a separate register note for each feature. That is, dump a note
for AVX (the upper bits of ymmX), a note for PKRU, etc. I think if I had
to pick a strategy at the very beginning that's what I would choose now,
but this isn't the very beginning and that sort of change is likely too
disruptive. (This approach is what happens on other arches today in effect,
e.g. on AArch64.)
--
John Baldwin
Powered by blists - more mailing lists