linux-kernel - Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzaU+citw78376tnN11R0ypyN+nZKTjYN7P1y67Rq+_D+g@mail.gmail.com>
Date: Wed, 5 Feb 2025 15:16:11 -0800
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Indu Bhagat <indu.bhagat@...cle.com>
Cc: Josh Poimboeuf <jpoimboe@...nel.org>, x86@...nel.org, 
	Peter Zijlstra <peterz@...radead.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Ingo Molnar <mingo@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, linux-kernel@...r.kernel.org, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Namhyung Kim <namhyung@...nel.org>, Ian Rogers <irogers@...gle.com>, 
	Adrian Hunter <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org, 
	Mark Brown <broonie@...nel.org>, linux-toolchains@...r.kernel.org, 
	Jordan Rome <jordalgo@...a.com>, Sam James <sam@...too.org>, linux-trace-kernel@...r.kernel.org, 
	Jens Remus <jremus@...ux.ibm.com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Florian Weimer <fweimer@...hat.com>, Andy Lutomirski <luto@...nel.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, Weinan Liu <wnliu@...gle.com>
Subject: Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading
 .sframe headers

On Thu, Jan 30, 2025 at 1:21 PM Indu Bhagat <indu.bhagat@...cle.com> wrote:
>
> On 1/27/25 5:10 PM, Andrii Nakryiko wrote:
> >>>>> +struct sframe_preamble {
> >>>>> +       u16     magic;
> >>>>> +       u8      version;
> >>>>> +       u8      flags;
> >>>>> +} __packed;
> >>>>> +
> >>>>> +struct sframe_header {
> >>>>> +       struct sframe_preamble preamble;
> >>>>> +       u8      abi_arch;
> >>>>> +       s8      cfa_fixed_fp_offset;
> >>>>> +       s8      cfa_fixed_ra_offset;
> >>>>> +       u8      auxhdr_len;
> >>>>> +       u32     num_fdes;
> >>>>> +       u32     num_fres;
> >>>>> +       u32     fre_len;
> >>>>> +       u32     fdes_off;
> >>>>> +       u32     fres_off;
> >>>>> +} __packed;
> >>>>> +
> >>>>> +struct sframe_fde {
> >>>>> +       s32     start_addr;
> >>>>> +       u32     func_size;
> >>>>> +       u32     fres_off;
> >>>>> +       u32     fres_num;
> >>>>> +       u8      info;
> >>>>> +       u8      rep_size;
> >>>>> +       u16 padding;
> >>>>> +} __packed;
> >>>> I couldn't understand from SFrame itself, but why do sframe_header,
> >>>> sframe_preamble, and sframe_fde have to be marked __packed, if it's
> >>>> all naturally aligned (intentionally and by design)?..
> >>> Right, but the spec says they're all packed.  Maybe the point is that
> >>> some future sframe version is free to introduce unaligned fields.
> >>>
> >> SFrame specification aims to keep SFrame header and SFrame FDE members
> >> at aligned boundaries in future versions.
> >>
> >> Only SFrame FRE related accesses may have unaligned accesses.
> > Yeah, and it's actually bothering me quite a lot 🙂 I have a tentative
> > proposal, maybe we can discuss this for SFrame v3? Let me briefly
> > outline the idea.
> >
>
> I looked at the idea below.  It could work wrt unaligned accesses.
>
> Speaking of unaligned accesses, I will ask away: Is the reason to avoid
> unaligned accesses performance hit or are there other practical reasons
> to it ?

Performance hit on architectures like x86-64 that do support
unaligned, but it's actually a CPU error for some other architectures,
so you'd need to code with that in mind, making local aligned copies,
etc. In general, I'd say it's a bit of a red flag that a format that
is meant to be memory-mapped (effectively) and used without
pre-processing requires dealing with unaligned accesses. So if we can
fix that, that would be a win.

>
> > So, currently in v2, FREs within FDEs use an array-of-structs layout.
> > If we use preudo-C type definitions, it would be something like this
> > for FDE + its FREs:
> >
> > struct FDE_and_FREs {
> >      struct sframe_func_desc_entry fde_metadata;
> >
> >      union FRE {
> >          struct FRE8 {
> >              u8 sfre_start_address;
> >              u8 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >          struct FRE16 {
> >              u16 sfre_start_address;
> >              u16 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >          struct FRE32 {
> >              u32 sfre_start_address;
> >              u32 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >      } fres[N] __packed;
> > };
> >
> > where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
> > addresses have the same size, but each FRE has potentially different
> > offsets sizing, so there is no common alignment, and so everything has
> > to be packed and unaligned.
> >
> > But what if we take a struct-of-arrays approach and represent it more like:
> >
> > struct FDE_and_FREs {
> >      struct sframe_func_desc_entry fde_metadata;
> >      u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
> >      u8 sfre_infos[N];
> >      u8 offsets8[M8];
> >      u16 offsets16[M16] __aligned(2);
> >      u32 offsets32[M32] __aligned(4);
> >      /* we can naturally extend to support also u64 offsets */
> > };
> >
> > i.e., we split all FRE records into their three constituents: start
> > addresses, info bytes, and then each FRE can fall into either 8-, 16-,
> > or 32-bit offsets "bucket". We collect all the offsets, depending on
> > their size, into these aligned offsets{8,16,32} arrays (with natural
> > extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
> > ensure proper alignment everywhere.
> >
> > Note, at this point we need to decide if we want to make FREs binary
> > searchable or not.
> >
> > If not, we don't really need anything extra. As we process each
> > start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
> > of how many 8-, 16-, and 32-bit offsets already processed FREs
> > consumed, and when we find the right one, we know exactly the starting
> > index within offset{8,16,32}. Done.
> >
> > But if we were to make FREs binary searchable, we need to basically
> > have an index of offset pointers to quickly find offsetsX[j] position
> > corresponding to FRE #i. For that, we can have an extra array right
> > next to start_addrs, "semantically parallel" to it:
> >
> > u8|u16|u32 start_addrs[N];
> > u8|u16|u32 offset_idxs[N];
> >
> > where start_addrs[i] corresponds to offset_idxs[i], and offset_idxs[i]
> > points to the first offset corresponding to FRE #i in offsetX[] array
> > (depending on FRE's "bitness"). This is a bit more storage for this
> > offset index, but for FDEs with lots of FREs this might be a
> > worthwhile tradeoff.
> >
> > Few points:
> >    a) we can decide this "binary searchability" per-FDE, and for FDEs
> > with 1-2-3 FREs not bother, while those with more FREs would be
> > searchable ones with index. So we can combine both fast lookups,
> > natural alignment of on-disk format, and compactness. The presence of
> > index is just another bit in FDE metadata.
>
> I have been going back and forth on this one. So there seem to be the
> following options here:
>    #1. Make "binary searchability" a per-FDE decision.
>    #2. Make "binary searchability" a per-section decision (I expect
> aarch64 to have very low number of FREs per FDE).
>    #3. Bake "binary searchability" into the SFrame FRE specification.
> So its always ON for all FDEs.  The advantage is that it makes stack
> tracers simpler to implement with less code.
>
> I do think #2, #3 appear simpler in concept.

Whichever makes it easier across the entire stack (compiler, linker,
kernel/unwinder). As long as binary searchability is possible,
especially for FDEs with lots of FREs. Making it per-FDE just allows
to pick most compact (but still with good performance) representation.

>
> >    b) bitness of offset_idxs[] can be coupled with bitness of
> > start_addrs (for simplicity), or could be completely independent and
> > identified by FDE's metadata (2 more bits to define this just like
> > start_addr bitness is defined). Independent probably would be my
> > preference, with linker (or whoever will be producing .sframe data)
> > can pick the smallest bitness that is sufficient to represent
> > everything.
> >
>
> ATM, GAS does apply special logic to decide the bitness of start_addrs
> per function, and ld just uses that info.  Coupling the bitness of
> offset_idx with bitness of start_addrs will be easy (or _easier_ I
> think), but for now, I leave it as "should be doable" :)

Those offsets are relative to fde's start_addr, right? So, generally
speaking, should be usually small? I my understanding is correct, then
yeah, coupling is probably ok.

>
> > Yes, it's a bit more complicated to draw and explain, but everything
> > will be nicely aligned, extensible to 64 bits, and (optionally at
> > least) binary searchable. Implementation-wise on the kernel side it
> > shouldn't be significantly more involved. Maybe the compiler would
> > need to be a bit smarter when producing FDE data, but it's no rocket
> > science.
> >
> > Thoughts?
>
> Combining the requirements from your email and Josh's follow up:
>    - No unaligned accesses
>    - Sorted FREs
>
> I would put compaction as a "good to have" requirement.  It appears to
> me that any compaction will mean a sort of post-processing which will
> interfere with JIT usecase.
>

sgtm