[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18064090-3418-4005-b35e-1afaeb2b4c95@oracle.com>
Date: Thu, 6 Nov 2025 12:42:41 -0800
From: Indu Bhagat <indu.bhagat@...cle.com>
To: Fangrui Song <maskray@...rceware.org>
Cc: linux-toolchains@...r.kernel.org, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: Concerns about SFrame viability for userspace stack walking
On 11/6/25 1:20 AM, Fangrui Song wrote:
> On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@...cle.com> wrote:
>>
>> On 11/5/25 12:21 AM, Fangrui Song wrote:
>>>> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@...cle.com> wrote:
>>>> On 2025-10-29 11:53 p.m., Fangrui Song wrote:
>>>>> I've been following the SFrame discussion and wanted to share some
>>>>> concerns about its viability for userspace adoption, based on concrete
>>>>> measurements and comparison with existing compact unwind implementations
>>>>> in LLVM.
>>>>>
>>>>> **Size overhead concerns**
>>>>>
>>>>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
>>>>> approximately 10% larger than the combined size of .eh_frame
>>>>> and .eh_frame_hdr (8.06 MiB total).
>>>>> This is problematic because .eh_frame cannot be eliminated - it contains
>>>>> essential information for restoring callee-saved registers, LSDA, and
>>>>> personality information needed for debugging (e.g. reading local
>>>>> variables in a coredump) and C++ exception handling.
>>>>>
>>>>> This means adopting SFrame would result in carrying both formats, with a
>>>>> large net size increase.
>>>>>
>>>>> **Learning from existing compact unwind implementations**
>>>>>
>>>>> It's worth noting that LLVM has had a battle-tested compact unwind
>>>>> format in production use since 2009 with OS X 10.6, which transitioned
>>>>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
>>>>>
>>>>> __text section: 0x4a55470 bytes
>>>>> __unwind_info section: 0x79060 bytes (0.6% of __text)
>>>>> __eh_frame section: 0x58 bytes
>>>>>
>>>>
>>>> I believe this is only synchronous? If yes, do you think this is a fair
>>>> measurement to compare against ?
>>>>
>>>> Does the compact unwind info scheme work well for cases of
>>>> shrink-wrapping ? How about the case of AArch64, where the ABI does not
>>>> mandate if and where frame record is created ?
>>>>
>>>> For the numbers above, does it ensure precise stack traces ?
>>>>
>>>> From the The Apple Compact Unwinding Format document
>>>> (https://faultlore.com/blah/compact-unwinding/),
>>>> "One consequence of only having one opcode for a whole function is that
>>>> functions will generally have incorrect instructions for the function’s
>>>> prologue (where callee-saved registers are individually PUSHed onto the
>>>> stack before the rest of the stack space is allocated)."
>>>>
>>>> "Presumably this isn’t a very big deal, since there’s very few
>>>> situations where unwinding would involve a function still executing its
>>>> prologue/epilogue."
>>>>
>>>> Well, getting precise stack traces is a big deal and the users want them.
>>>
>>> **Shrink-wrapping and precise stack traces**: Yes, compact unwind
>>> handles these through an extension proposed by OpenVMS (not yet
>>> upstreamed to LLVM):
>>> https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html
>>>
>>
>> Thanks for the link.
>>
>> The above questions were strictly in the context of the battle-tested
>> "The Apple Compact Unwinding Format" in production in the lld/MachO
>> implementation, not for the proposed OpenVMS extensions.
>>
>> Is it possible to get answers to those questions with that context in place?
>>
>> If shrink-wrapping and precise stack traces isnt supported without the
>> OpenVMS extension (that is not yet implemented), arent we comparing
>> apples vs pears here ?
>
> You're right to ask for clarification.
> The extended compact unwind information works with shrink wrapping.
>
Sorry, again, not asking about the "extended".
If I may: So, this is a convoluted way of saying the current
implementation of the Apple Compact Unwind Info (lld/MachO, which was
used to get the data) does not support shrink wrapping. The
documentation of the format I am refering to
(https://faultlore.com/blah/compact-unwinding/).
That said, the point I have been driving to:
The Apple Compact Unwind format
(https://faultlore.com/blah/compact-unwinding/) does not support shrink
wrapping and neither is for asynchronous stack walking. Comparing that
data to what SFrame gives is comparing apples to pears. Misleading.
(The reason I asked the question to begin with is because I wasn't sure
if the documentation is out of date).
> For context, a FDE in .eh_frame costs at least 20 bytes (often 30+),
> plus its associated .eh_frame_hdr entry costs 8 bytes.
> Even a larger compact unwind descriptor at 8 bytes yields significant
> savings compared to .eh_frame. Tripling that to 24 bytes is still a
> substantial win.
>
> Additionally, very few functions benefit from shrink wrapping
> optimization. When needed, we require multiple unwind description
> records (typically 3).
>
>>> Technical details of the extension:
>>>
>>> - A single unwind group describes a (prologue_part1, prologue_part2,
>>> body, epilogue) tuple.
>>> - The prologue is conceptually split into two parts: the first part
>>> extends up to and including the instruction that decreases RSP; the
>>> second part extends to a point after the last preserved register is
>>> saved but before any preserved register is modified (this location is
>>> not unique, providing flexibility).
>>> + When unwinding in the prologue, the RSP register value can be
>>> inferred from the PC and the set of saved registers.
>>> - Since register restoration is idempotent (restoring preserved
>>> registers multiple times during unwinding causes no harm), there is no
>>> need to describe `pop $reg` sequences. The unwind group needs just one
>>> bit to describe whether the 1-byte `ret` instruction is present.
>>
>> Is this true for the case of asynchronous stack tracing too ?
>
> Yes. I believe it means the epilogue mirrors the prologue. Since we
> know which registers were saved in the prologue, we can infer the pop
> instructions in the epilogue and compute the SP offset when unwinding
> in the middle of an epilogue.
>
This is not asynchronous then.
This meddles with the core business of an optimizing compiler which may
want to organize epilogue/prologue differently.
>>> - The `length` field in the compact unwind group descriptor is
>>> repurposed to describe the prologue's two parts.
>>> - By composing multiple unwind groups, potentially with zero-sized
>>> prologues or omitting `ret` instructions in epilogues, it can describe
>>> functions with shrink wrapping or tail duplication optimization.
>>> - Null frame groups (with no prologue or epilogue) are the default and
>>> can describe trampolines and PLT stubs.
>>
>> PLT stubs may use stack (push to stack). As per the document "A null
>> frame (MODE = 8) is the simplest possible frame, with no allocated stack
>> of either kind (hence no saved registers)". So null frame can be used
>> for PLT only if the functions invoking the PLT stub were using an
>> RBP-based frame. Isnt it ?
>> BTW, but both EH Frame and SFrame have specific, condensed
>> representation for metadata for PLT entries.
>
> A profiler can trivially retrieve the return address using the default
> rule: if a code region is not covered by metadata, assume the return
> address is available at *rsp (x86-64) or in the link register (most
> other architectures).
>
> This ld-generated unwind info feature is largely obsolete nowadays due
> to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries
> behave as functions without a prologue, so a profiler can trivially
> retrieve the return address using the default unwinding rule.
>
>>>
>>
>> Anyway, thanks for the summary.
>>
>> I see that OpenVMS extension for asynchronous compact unwind descriptors
>> is an RFC state ATM. But few important observations and questions:
>>
>> - As noted in the recently revived discussion,
>> https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471,
>> there is going to be a *non-negligible* size overhead as soon as you
>> move towards a specification for asynchronous (vs the current
>> specification that caters to synchronous only). Now add to it, the
>> quirks of each architecture/ABI :). Any comments ?
>
> As mentioned, even a larger compact unwind descriptor at 8 bytes
> yields significant savings compared to .eh_frame, and is also
> substantially smaller than SFrame.
>
>> - From the document: "Use of any preserved register must be delayed
>> until all of the preserved registers have been saved."
>> Q: Does this work well with optimizing compilers ? Is this an ABI
>> change being asked for multiple architectures ?
>
> I think this is about support for callee-saved registers, a feature
> SFrame doesn't have.
>
SFrame doesn't have it, because it doesnt need to carry this information
for stack tracing. OpenVMS RFC effort, OTOH, is about subsuming
.eh_frame and be _the_ stack tracing/stack unwinding format. The latter
*has to* work this out.
> I need to think about the details, but this thread is probably not the
> best place to discuss them.
>
Absolutely, I agree, not the best place or time to pin down the details
of an RFC at all. But cannot let an unfair argument just fly by.
The point I am driving to with these questions around the OpenVMS
asynchronous info RFC:
- 'OpenVMS extensions for asynchronous stack unwinding' in an RFC which
still needs work.
- It remains to be seen how this proposal manages the fine line of
space-efficiency while trying to be the goto format for asynchronous
stack unwinding together with fast, precise and low-overhead stack tracing.
- SFrame is for stack tracing only. Subsuming .eh_frame is not in the
plans.
>> - From the document: "It appears technically feasible for a null frame
>> function to have a personality routine. However, the utility of such a
>> capability seems too meager to justify allowing this. We propose to not
>> support this." and "If the first attempt to lookup an unwind group for
>> an exception address fails, then it is (tentatively) assumed to have
>> occurred within a null frame function or in a part of a function
>> that is adequately described by a null frame. The presumed return
>> address is (virtually or actually) popped from the top of stack and
>> looked up. This second attempted lookup must succeed, in which case
>> processing continues normally. A failure is a fatal error."
>> Q: Is this a problem, especially because the goal is to evolve the
>> OpenVMS RFC proposal is subsume .eh_frame ?
>
> I think this just hard-encodes the default rule, similar to what
> SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed
> offset from the CFA when entering a new function."
>
> While I haven't given this much thought yet, I don't think this
> introduces problems that SFrame doesn't have.
>
Correction: Not true. This is configurable in SFrame. s390x needs RA
tracking (not fixed offset) and is supported in SFrame.
>> Are there people actively working towards bringing this to fruition?
>>
>>> Now, to compare this against SFrame's space efficiency for synchronous
>>> unwinding, I've built llvm-mc, opt, and clang with
>>> -fno-asynchronous-unwind-tables -funwind-tables across multiple build
>>> configurations (clang vs gcc, frame pointer vs sframe).
>>> [snip]>>>
>>> .sframe for sync is not noticeably smaller than that for async. This
>>> is probably because
>>> there are still many DW_CFA_advance_loc ops even in
>>> -fno-asynchronous-unwind-tables -funwind-tables builds.
>>>
>>
>> Possible that its because in the Apple Compact Unwind Format, the linker
>> optimizes compact unwind descriptors into the three-level paged
>> structure, effectively de-duplicating some content.
>
> Yes, the linker does perform deduplication and builds the paged index
> structure. However, the fundamental compactness comes from the
> encoding itself: each regular function is described with just 4 bytes
> in the common encoding, compared to .sframe's much larger per-FDE
> overhead.
> The two-level lookup table optimization amplifies this advantage.
>
>>>>> (On macOS you can check the section size with objdump --arch x86_64 -
>>>>> h clang and dump the unwind info with objdump --arch x86_64 --unwind-
>>>>> info clang)
>>>>>
>>>>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as
>>>>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post:
>>>>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
>>>>>
>>>>> The compact unwind format achieves this efficiency through a two-level
>>>>> page table structure. It describes common frame layouts compactly and
>>>>> falls back to DWARF only when necessary, allowing most DWARF CFI entries
>>>>> to be eliminated while maintaining full functionality. For more details,
>>>>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO
>>>>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/
>>>>> UnwindInfoSection.cpp
>>>>>
>>>>
>>>> How does your vision of "linker-friendly" stack tracing/stack unwinding
>>>> format reconcile with these suggested approaches ? As far as I can tell,
>>>> these formats also require linker created indexes and are
>>>> non-concatenable (custom handling in every linker). Something you've
>>>> had "significant concerns" about.
>>>>
>>
>> This question is unanswered: What do you think about
>> "linker-friendliness" of the current implementation of the lld/MachO
>> implementation of the compact unwind format in LLVM ?
>
> The linker input and output use different section names, so a dumb
> linker would work as long as the runtime accepts the concatenated
> sections.
>
> My vision for an ELF compact unwind format uses separate section names
> for link-time vs. runtime representations. The compiler output format
> should be concatenable, with linker index-building as an optional
> optimization that improves performance but isn't mandatory for
> correctness.
>
> I'll going to add more details
> https://maskray.me/blog/2025-09-28-remarks-on-sframe
>
>
>>>
>>> We can distinguish between linking-time and execution-time
>>> representations by using different section names.
>>> The OpenVMS specification says:
>>>
>>> It is useful to note that the run-time representation of unwind
>>> information can vary from little more than a simple concatenation of
>>> the compile-time information to a substantial rewriting of unwind
>>> information by the linker. The proposal favors simple concatenation
>>> while maintaining the same ordering of groups as their associated
>>> code.
>>>
>>> The runtime library can build this index at runtime and cache it to disk.
>>>
>>
>> This will include the dynamic linker and the stack tracer in the Linux
>> kernel (the latter when stack tracing user space stacks). Do you think
>> this is feasible ?
>>
>>> Once the design becomes sufficiently stable, we can introduce an
>>> opt-in linker option --xxxxframe-index that builds an index from
>>> recognized format versions while reporting warnings for unrecognized
>>> ones.> We need to carefully design this mechanism to be stable and robust,
>>> avoiding frequent format updates.
>>>>> From
>>>> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
>>>> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch
>>>> Table'') is created by the linker using information in the unwind
>>>> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section
>>>> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker
>>>> may use the provided unwind descriptors directly or replace them with
>>>> equivalent optimized forms based on its optimization strategies."
>>>>
>>>> Above all, do users want a solution which requires falling back on
>>>> DWARF-based processing for precise stack tracing ?
>>>
>>> The key distinction is that compact unwind handles the vast majority
>>> of functions without DWARF—the macOS measurements show __unwind_info
>>> at 0.6% of __text size with __eh_frame reduced to negligible size
>>> (0x58 bytes). While SFrame also cannot handle all frames, compact
>>> unwind achieves dramatic size reductions by making DWARF the exception
>>> rather than requiring it alongside a supplementary format.
>>>
>>
>> As we have tried to reason, this is a misleading comparison. The compact
>> unwind tables format:
>> - needs to be extended for asynchronous stack unwinding
>> - needs to be extended for other ABI/architectures
>> - Making it concatenable / linker-friendly will also likely impose
>> some negative effects on size.
>
> The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS
> proposal demonstrates that supporting asynchronous unwinding is
> straightforward.
>
> Making it linker-friendly does not impose negative effects on the
> output section size.
>
OK, well, I agree to disagree :)
Looking forward to some movement on the OpenVMS asynchronous unwind RFC
to see resolution to some of the issues, and some data to back that claim.
>>> The DWARF fallback provides flexibility for additional coverage when
>>> needed, but nothing is lost (at least for the clang binary on macOS)
>>> if DWARF fallback were disabled in a hypothetical future linux-perf
>>> implementation.
>>>
>>
>> Fair enough, thats something for linux-perf/kernel to decide. Once the
>> OpenVMS RFC is sufficiently shaped to become a viable replacement for
>> .eh_frame, this question will be for the stakeholders to decide.
>
> Agreed. My concern is that .sframe is being deployed before we've
> fully explored whether a more compact and efficient alternative is
> achievable.
>
>
>>>>> **The AArch64 case: size matters even more**
>>>>>
>>>>> The size consideration becomes even more critical for AArch64, which is
>>>>> heavily deployed on mobile phones.
>>>>> There's an active feature request for compact unwind support in the
>>>>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
>>>>> This underscores the broader industry need for efficient unwind
>>>>> information that doesn't duplicate data or significantly increase binary
>>>>> size.
>>>>>
>>>>
>>>> Our measurements with a dataset of about 1400 userspace artifacts
>>>> (binaries and shared libraries) show that the SFrame/(EH Frame + EH
>>>> Frame HDR) ratio is:
>>>> - Average of 0.70 on AArch64.
>>>> - Average of 1.00 on x86_64.
>>>>
>>>> Projecting the size of what you observe for clang binary on x86_64 to
>>>> conclude the size ratio on AArch64 is not very wise to do.
>>>>
>>>> Whether the size impact is worth the benefit: its a choice for users to
>>>> make. SFrame offers the users fast, precise stack traces with simple
>>>> stack tracers.
>>>
>>> Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on
>>> AArch64, this represents substantial memory overhead when considering:
>>>
>>> .eh_frame is already large and being complained about.
>>> Being unable to eliminate it (needed for debugging and C++ exceptions)
>>> and adding 0.70x more means significant additional overhead for users.
>>>
>>>>> There are at least two formats the ELF one can learn from: LLVM's
>>>>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
>>>>>
>>>>
>>>> Please, if you have any concrete suggestions (keeping the above goals in
>>>> mind), you already know how/where to engage.
>>>
>>> I've provided concrete suggestions throughout this discussion.
>>>
>>
>> Apologies, I should have been more precise. And I ask because you know
>> the details about both SFrame and the variants of Compact Unwind
>> Descriptor formats at this point :). If you have concrete suggestions to
>> improve the SFrame format for size, please let us know.
>
> At this point, I'm not certain about specific modifications to .sframe
> itself. I think we should start from scratch, drawing ideas from
> compact unwind information and Windows ARM64.
>
> The existing compact unwind information uses the following 4-byte descriptor:
>
> uint32_t mode_specific_encoding : 24; // vary with different modes
>
> uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK
>
> uint32_t has_lsda : 1;
> uint32_t personality_index : 2;
> uint32_t is_not_function_start : 1;
>
Thanks.
SFrame is not for stack unwinding. Subsuming .eh_frame is topic for
another day. SFrame does not intend to go that route.
> We probably need a less-restricted version and account for different
> architecture needs. The result would still be significantly smaller
> than SFrame v2 and the future v3 (unless it's completely rewritten).
>
> We should probably design an optional two-level lookup table mechanism
> for additional savings (at the cost of linker friendliness).
>
>>>>> **Path forward**
>>>>>
>>>>> Unless SFrame can actually replace .eh_frame (rather than supplementing
>>>>> it as an accelerator for linux-perf) and demonstrate sizes smaller
>>>>> than .eh_frame - matching the efficiency of existing compact unwind
>>>>> approaches — I question its practical viability for userspace.
>>>>> The current design appears to add overhead rather than reduce it.
>>>>> This isn't to suggest we should simply adopt the existing compact unwind
>>>>> format wholesale.
>>>>> The x86-64 design dates back to 2009 or earlier, and there are likely
>>>>> improvements we can make. However, we should aim for similar or better
>>>>> efficiency gains.
>>>>>
>>>>> For additional context, I've documented my detailed analysis at:
>>>>>
>>>>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
>>>>> mandatory index building problems, section group compliance and garbage
>>>>> collection issues, and version compatibility challenges)
>>>>
>>>> GC issue is a bug currently tracked and with a target milestone of 2.46.
>>>>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-
>>>>> offs (size analysis)
>>>>>
>
> The GC issue would not have happened at all if we had used multiple
> sections and thought about ELF and linker convention :)
Thanks for engaging.
Powered by blists - more mailing lists