linux-kernel - Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYvrAAJv0+2FceJfizRURUsWxUBzRL-VOquM+VtAe-goA@mail.gmail.com>
Date: Wed, 29 Jan 2025 16:02:34 -0800
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>
Cc: Indu Bhagat <indu.bhagat@...cle.com>, x86@...nel.org, 
	Peter Zijlstra <peterz@...radead.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Ingo Molnar <mingo@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, linux-kernel@...r.kernel.org, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Namhyung Kim <namhyung@...nel.org>, Ian Rogers <irogers@...gle.com>, 
	Adrian Hunter <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org, 
	Mark Brown <broonie@...nel.org>, linux-toolchains@...r.kernel.org, 
	Jordan Rome <jordalgo@...a.com>, Sam James <sam@...too.org>, linux-trace-kernel@...r.kernel.org, 
	Jens Remus <jremus@...ux.ibm.com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Florian Weimer <fweimer@...hat.com>, Andy Lutomirski <luto@...nel.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, Weinan Liu <wnliu@...gle.com>
Subject: Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading
 .sframe headers

On Tue, Jan 28, 2025 at 6:02 PM Josh Poimboeuf <jpoimboe@...nel.org> wrote:
>
> On Mon, Jan 27, 2025 at 05:10:27PM -0800, Andrii Nakryiko wrote:
> > > Yes, in theory, it is allowed (as per the specification) to have an
> > > SFrame section with zero number of FDEs/FREs.  But since such a section
> > > will not be useful, I share the opinion that it makes sense to disallow
> > > it in the current unwinding contexts, for now (JIT usecase may change
> > > things later).
> > >
> >
> > I disagree, actually. If it's a legal thing, it shouldn't be randomly
> > rejected. If we later make use of that, we'd have to worry not to
> > accidentally cause problems on older kernels that arbitrarily rejected
> > empty FDE just because it didn't make sense at some point (without
> > causing any issues).
>
> If such older kernels don't do anything with the section anyway, what's
> the point of pretending they do?
>
> Returning an error would actually make more sense as it communicates
> that the kernel doesn't support whatever hypothetical thing you're
> trying to do with 0 FDEs.
>
> > > SFRAME_F_FRAME_POINTER flag is not being set currently by GAS/GNU ld at all.
> > >
> > > >>> +               dbg("no fde/fre entries\n");
> > > >>> +               return -EINVAL;
> > > >>> +       }
> > > >>> +
> > > >>> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> > > >>> +       if (header_end >= sec->sframe_end) {
> > > >>
> > > >> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?
> > > >
> > > > I suppose so, but again I'm not seeing any reason to support that.
> >
> > Let's invert this. Is there any reason why it shouldn't be supported? ;)
>
> It's simple, we don't add code to "support" some vague hypothetical.
>
> For whatever definition of "support", since there's literally nothing
> the kernel can do with that.

If under the format specification it's legal, there is no reason for
the kernel to proactively reject it, IMO. But ok, whatever.

>
> > But what if we take a struct-of-arrays approach and represent it more like:
> >
> > struct FDE_and_FREs {
> >     struct sframe_func_desc_entry fde_metadata;
> >     u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
> >     u8 sfre_infos[N];
> >     u8 offsets8[M8];
> >     u16 offsets16[M16] __aligned(2);
> >     u32 offsets32[M32] __aligned(4);
> >     /* we can naturally extend to support also u64 offsets */
> > };
> >
> > i.e., we split all FRE records into their three constituents: start
> > addresses, info bytes, and then each FRE can fall into either 8-, 16-,
> > or 32-bit offsets "bucket". We collect all the offsets, depending on
> > their size, into these aligned offsets{8,16,32} arrays (with natural
> > extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
> > ensure proper alignment everywhere.
>
> Makes sense.  Though I also have another idea below.
>
> > Note, at this point we need to decide if we want to make FREs binary
> > searchable or not.
> >
> > If not, we don't really need anything extra. As we process each
> > start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
> > of how many 8-, 16-, and 32-bit offsets already processed FREs
> > consumed, and when we find the right one, we know exactly the starting
> > index within offset{8,16,32}. Done.
> >
> > But if we were to make FREs binary searchable, we need to basically
> > have an index of offset pointers to quickly find offsetsX[j] position
> > corresponding to FRE #i. For that, we can have an extra array right
> > next to start_addrs, "semantically parallel" to it:
> >
> > u8|u16|u32 start_addrs[N];
> > u8|u16|u32 offset_idxs[N];
>
> Binary search would definitely help.  I did a crude histogram for "FREs
> per FDE" for a few binaries on my test system:
>
> gdb (the biggest binary on my test system):
>
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 1
> 40th Percentile: 3
> 50th Percentile: 5
> 60th Percentile: 8
> 70th Percentile: 11
> 80th Percentile: 14
> 90th Percentile: 16
> 100th Percentile: 472
>
> bash:
>
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 3
> 40th Percentile: 5
> 50th Percentile: 7
> 60th Percentile: 9
> 70th Percentile: 12
> 80th Percentile: 16
> 90th Percentile: 17
> 100th Percentile: 46
>
> glibc:
>
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 1
> 40th Percentile: 1
> 50th Percentile: 4
> 60th Percentile: 6
> 70th Percentile: 9
> 80th Percentile: 14
> 90th Percentile: 16
> 100th Percentile: 44
>
> libpython:
>
> 10th Percentile: 1
> 20th Percentile: 3
> 30th Percentile: 4
> 40th Percentile: 6
> 50th Percentile: 8
> 60th Percentile: 11
> 70th Percentile: 12
> 80th Percentile: 16
> 90th Percentile: 20
> 100th Percentile: 112
>
> So binary search would help in a lot of cases.

yep, agreed, seems like a worthwhile thing to support, given the stats
(I suspect big production C++ applications might be even worse in this
regard)

>
> However, if we're going that route, we might want to even consider a
> completely revamped data layout.  For example:
>
> One insight is that the vast majority of (cfa, fp, ra) tuples aren't
> unique.  They could be deduped by storing the unique tuples in a
> standalone 'fre_data' array which is referenced by another
> address-specific array.
>
>   struct fre_data {
>         s8|s16|s32 cfa, fp, ra;
>         u8 info;
>   };
>   struct fre_data fre_data[num_fre_data];
>
> The storage sizes of cfa/fp/ra can be a constant specified in the global
> sframe header.  It's fine all being the same size as it looks like this
> array wouldn't typically be more than a few hundred entries anyway.
>
> Then there would be an array of sorted FRE entries which reference the
> fre_data[] entries:
>
>   struct fre {
>         s32|s64 start_address;
>         u8|u16 fre_data_idx;

even u16 seems dangerous, I'd use u32, not sure it's worth limiting
the format just to 64K unique combinations

>
>   } __packed;
>   struct fre fres[num_fres];
>
> (For alignment reasons that should probably be two separate arrays, even
> though not ideal for cache locality)
>
> Here again the field storage sizes would be specified in the global
> sframe header.
>
> Note FDEs aren't even needed here as the unwinder doesn't need to know
> when a function begins/ends.  The only info needed by the unwinder is
> just the fre_data struct.  So a simple binary search of fres[] is all
> that's really needed.
>
> But wait, there's more...
>
> The binary search could be made significantly faster using a small fast
> lookup array which is indexed evenly across the text address offset
> space, similar to what ORC does:
>
>   u32 fre_chunks[num_chunks];
>
> The text address space (starting at offset 0) can be split into
> 'num_chunks' chunks of size 'chunk_size'.  The value of
> fre_chunks[offset/chunk_size] is an index into the fres[] array.
>
> Taking my gdb binary as an example:
>
> .text is 6417058 bytes, with 146997 total sframe FREs.  Assuming a chunk
> size of 1024, fre_chunks[] needs 6417058/1024 = 6267 entries.
>
> For unwinding at text offset 0x400000, the index into fre_chunks[] would
> be 0x400000/1024 = 4096.  If fre_chunks[4096] == 96074 and
> fre_chunks[4096+1] == 96098, you need only do a binary search of the 24
> entries between fres[96074] && fres[96098] rather than searching the
> entire 146997 byte array.
>
> .sframe size calculation:
>
>     374 unique fre_data entries (out of 146997 total FREs!)
>     = 374 * (2 * 3) = 2244 bytes
>
>     146997 fre entries
>     = 146997 * (4 + 2) = 881982 bytes
>
>     .text size 6417058 (chunk_size = 1024, num_chunks=6267)
>     = 6267 * 8 = 43869 bytes
>
> Approximate total .sframe size would be 2244 + 881982 + 8192 = 906k,
> plus negligible header size.  Which is smaller than the v2 .sframe on my
> gdb binary (985k).
>
> With the chunked lookup table, the avg lookup is:
>
>   log2(146997/6267) = ~4.5 iterations
>
> whereas a full binary search would be:
>
>   log2(146997) = 17 iterations
>
> So assuming I got all that math right, it's over 3x faster and the
> binary is smaller (or at least should be roughly comparable).

I'm not sure about this chunked lookup approach for arbitrary user
space applications. Those executable sections can be a) big and b)
discontiguous. E.g., one of the production binaries I looked at. Here
are its three main executable sections:

...
  [17] .bolt.org.text    PROGBITS         000000000b00e640  0ae0d640
       0000000011ad621c  0000000000000000  AX       0     0     64
...
  [48] .text             PROGBITS         000000001e600000  1ce00000
       0000000000775dd8  0000000000000000  AX       0     0     2097152
  [49] .text.cold        PROGBITS         000000001ed75e00  1d575e00
       00000000007d3271  0000000000000000  AX       0     0     64
...

Total text size is about 300MB:
>>> 0x0000000000775dd8 + 0x00000000007d3271 + 0x0000000011ad621c
312603237

Section #17 ends at:

>>> hex(0x0000000011ad621c + 0x000000000b00e640)
'0x1cae485c'

While .text starts at 000000001e600000, so we have a gap of ~28MB:

>>> 0x000000001e600000 - 0x1cae485c
28424100

So unless we do something more clever to support multiple
discontiguous chunks, this seems like a bad fit for user space.

I think having all this just binary searchable is already a big win
anyways and should be plenty fast, no?

>
> Of course the downside is it's an all new format.  Presumably the linker
> would need to do more work than it's currently doing, e.g., find all the
> duplicates and set up the data accordingly.

I do like the idea of deduplicating those records and just indexing
them, as in practice this should probably be much more compact. So it
might be worth looking into this some more. I'll defer to Indu.

>
> --
> Josh