lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20251120120838.1754634-11-sashal@kernel.org>
Date: Thu, 20 Nov 2025 07:08:20 -0500
From: Sasha Levin <sashal@...nel.org>
To: patches@...ts.linux.dev,
	stable@...r.kernel.org
Cc: Jiri Olsa <jolsa@...nel.org>,
	Song Liu <song@...nel.org>,
	Alexei Starovoitov <ast@...nel.org>,
	"Steven Rostedt (Google)" <rostedt@...dmis.org>,
	Sasha Levin <sashal@...nel.org>,
	peterz@...radead.org,
	mingo@...hat.com,
	acme@...nel.org,
	namhyung@...nel.org,
	linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH AUTOSEL 6.17-5.4] Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"

From: Jiri Olsa <jolsa@...nel.org>

[ Upstream commit 6d08340d1e354787d6c65a8c3cdd4d41ffb8a5ed ]

This reverts commit 83f44ae0f8afcc9da659799db8693f74847e66b3.

Currently we store initial stacktrace entry twice for non-HW ot_regs, which
means callers that fail perf_hw_regs(regs) condition in perf_callchain_kernel.

It's easy to reproduce this bpftrace:

  # bpftrace -e 'tracepoint:sched:sched_process_exec { print(kstack()); }'
  Attaching 1 probe...

        bprm_execve+1767
        bprm_execve+1767
        do_execveat_common.isra.0+425
        __x64_sys_execve+56
        do_syscall_64+133
        entry_SYSCALL_64_after_hwframe+118

When perf_callchain_kernel calls unwind_start with first_frame, AFAICS
we do not skip regs->ip, but it's added as part of the unwind process.
Hence reverting the extra perf_callchain_store for non-hw regs leg.

I was not able to bisect this, so I'm not really sure why this was needed
in v5.2 and why it's not working anymore, but I could see double entries
as far as v5.10.

I did the test for both ORC and framepointer unwind with and without the
this fix and except for the initial entry the stacktraces are the same.

Acked-by: Song Liu <song@...nel.org>
Signed-off-by: Jiri Olsa <jolsa@...nel.org>
Link: https://lore.kernel.org/r/20251104215405.168643-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@...nel.org>
Acked-by: Steven Rostedt (Google) <rostedt@...dmis.org>
Signed-off-by: Sasha Levin <sashal@...nel.org>
---

LLM Generated explanations, may be completely bogus:

Now let me compile my comprehensive analysis:

## COMPREHENSIVE ANALYSIS: Perf Stack Trace Duplicate Entry Bug

### 1. COMMIT MESSAGE ANALYSIS

**Subject**: Revert "perf/x86: Always store regs->ip in
perf_callchain_kernel()"

**Key Information**:
- This is a **revert** of commit
  83f44ae0f8afcc9da659799db8693f74847e66b3
- **Problem**: Stores initial stacktrace entry **twice** for non-HW
  pt_regs
- **Reproducer provided**: Easy reproduction with bpftrace showing
  duplicate `bprm_execve+1767` entry
- **Testing**: Tested with both ORC and framepointer unwinders
- **Maintainer support**: Acked-by Song Liu (original author), Steven
  Rostedt (tracing), signed by Alexei Starovoitov (BPF)
- **Note**: This is commit 46b2650126939 which is a backport of mainline
  fix 6d08340d1e354

### 2. DEEP CODE RESEARCH AND TECHNICAL ANALYSIS

**Understanding the Bug Mechanism**:

The bug occurs in `perf_callchain_kernel()` function. Let me trace
through the code flow:

**Current buggy code (lines 2790-2796)**:
```c
if (perf_callchain_store(entry, regs->ip))  // ALWAYS stores regs->ip
first
    return;

if (perf_hw_regs(regs))
    unwind_start(&state, current, regs, NULL);
else
    unwind_start(&state, current, NULL, (void *)regs->sp);  // Also
stores regs->ip during unwind!
```

**Fixed code**:
```c
if (perf_hw_regs(regs)) {
    if (perf_callchain_store(entry, regs->ip))  // Only store for HW
regs
        return;
    unwind_start(&state, current, regs, NULL);
} else {
    unwind_start(&state, current, NULL, (void *)regs->sp);  // Let
unwinder add regs->ip
}
```

**What is `perf_hw_regs()`?**
>From line 2774-2777, it checks if registers came from an IRQ/exception
handler:
```c
static bool perf_hw_regs(struct pt_regs *regs)
{
    return regs->flags & X86_EFLAGS_FIXED;
}
```

**The Two Paths**:

1. **HW regs path** (IRQ/exception context):
   - Registers captured by hardware interrupt
   - `perf_hw_regs(regs)` returns `true`
   - Need to explicitly store `regs->ip`
   - Call `unwind_start(&state, current, regs, NULL)` with full regs

2. **Non-HW regs path** (software context like tracepoints):
   - Registers captured by `perf_arch_fetch_caller_regs()` (line 614-619
     in perf_event.h)
   - `perf_hw_regs(regs)` returns `false`
   - Call `unwind_start(&state, current, NULL, (void *)regs->sp)` with
     only stack pointer
   - **The unwinder automatically includes regs->ip during the unwinding
     process**
   - **BUG**: The buggy code ALSO stores it explicitly, causing
     duplication

**Bug Introduction History**:

The buggy commit 83f44ae0f8afcc9da659799db8693f74847e66b3 was introduced
in June 2019 (v5.2-rc7) to fix a different issue where RIP wasn't being
saved for BPF selftests. However, that fix was overly broad - it
unconditionally stored regs->ip even for non-HW regs where the unwinder
would already include it.

**Affected Versions**: v5.2 through v6.17 (over 6 years of stable
kernels!)

### 3. SECURITY ASSESSMENT

**No security implications** - this is a correctness bug in stack trace
generation, not a security vulnerability. No CVE, no privilege
escalation, no information leak beyond what's already visible.

### 4. FEATURE VS BUG FIX CLASSIFICATION

**Definitively a BUG FIX**:
- Fixes incorrect stack trace output (duplicate entries)
- Does NOT add new functionality
- Restores correct behavior that was broken in v5.2

### 5. CODE CHANGE SCOPE ASSESSMENT

**Extremely small and surgical**:
- **1 file changed**: `arch/x86/events/core.c`
- **7 lines modified**: Moving the `perf_callchain_store()` call inside
  the `if` block
- **Single function affected**: `perf_callchain_kernel()`
- **No API changes, no new exports, no dependencies**

### 6. BUG TYPE AND SEVERITY

**Bug Type**: Logic error - incorrect conditional placement causing
duplicate stack trace entry

**Severity**: **MEDIUM to HIGH**
- **User-visible**: YES - anyone using perf/eBPF tools sees duplicate
  entries
- **Functional impact**: Corrupts observability data, misleading
  debugging information
- **Common scenario**: Affects ALL users collecting stack traces from
  tracepoints, kprobes, raw_tp
- **No system crashes**: But breaks critical debugging/profiling
  workflows

### 7. USER IMPACT EVALUATION

**Very High Impact**:

**Affected Users**:
- **eBPF/BPF users**: bpftrace, bcc tools (very common in production)
- **Performance engineers**: Using perf for profiling with stack traces
- **System administrators**: Debugging with tracepoints
- **Cloud providers**: Running observability tools
- **Container environments**: Kubernetes/Docker using eBPF monitoring

**Real-world scenario**:
```bash
# bpftrace -e 'tracepoint:sched:sched_process_exec { print(kstack()); }'
bprm_execve+1767
bprm_execve+1767    # <-- DUPLICATE! Confusing and wrong
do_execveat_common.isra.0+425
...
```

This makes stack traces misleading and harder to analyze. For production
observability, **correct stack traces are essential**.

### 8. REGRESSION RISK ANALYSIS

**Very Low Regression Risk**:

1. **Simple revert**: Reverting a previous change, well understood
2. **Tested thoroughly**:
   - Tested with ORC and framepointer unwinders
   - BPF selftests added (commit 5b98eca7fae8e)
   - Easy reproduction case provided
3. **Maintainer consensus**: Multiple Acked-by from subsystem
   maintainers
4. **In mainline since Nov 2025**: Has been in v6.18-rc6+ for testing
5. **Localized change**: Only affects one function, no ripple effects

### 9. MAINLINE STABILITY

**Strong mainline stability**:
- **Mainline commit**: 6d08340d1e354 (Nov 5, 2025)
- **First appeared**: v6.18-rc6
- **Testing**: Includes reproducible test case and BPF selftests
- **Reviews**: Acked by Song Liu (original author acknowledging the
  revert is correct)
- **Signed-off**: Alexei Starovoitov (BPF maintainer)

### 10. STABLE KERNEL RULES COMPLIANCE

**Checking against stable kernel criteria**:

1. ✅ **Obviously correct**: YES - simple logic fix, moves conditional
   correctly
2. ✅ **Fixes real bug**: YES - duplicate stack trace entries (easy to
   reproduce)
3. ✅ **Important issue**: YES - breaks observability for common
   workflows
4. ✅ **Small and contained**: YES - 7 lines in one function
5. ✅ **No new features**: CORRECT - only fixes existing functionality
6. ✅ **Applies cleanly**: YES - this IS a stable backport
   (46b2650126939)

**Backport Status**: This commit (46b2650126939) is already a backport
to stable of mainline commit 6d08340d1e354, committed by Sasha Levin on
Nov 17, 2025.

### ADDITIONAL CONTEXT

**Why was the original fix needed in 2019?**
The original commit 83f44ae0f8afcc9da659799db8693f74847e66b3 was fixing
a BPF selftest where RIP wasn't being saved. However, the fix was too
broad and caused this duplication issue.

**Why revert now?**
The author (Jiri Olsa) notes: "I was not able to bisect this, so I'm not
really sure why this was needed in v5.2 and why it's not working
anymore, but I could see double entries as far as v5.10." The unwinder
behavior has evolved, and the explicit store is no longer needed (and
causes duplication) for non-HW regs.

**Related commits**:
- There's a companion fix (e9e73b80b50ef) for fgraph/BPF stack unwinding
  also backported
- BPF selftests added to prevent regression

### CONCLUSION

This commit fixes a **real, user-visible bug** that affects a **very
common use case** (stack trace collection via perf/eBPF tools). The bug
has existed since v5.2 (2019) and affects ALL stable kernels from v5.2
through v6.17.

**Correctness**: The fix is small, surgical, well-tested, and has strong
maintainer support.

**Impact**: High - affects production observability workflows used by
system administrators, performance engineers, and cloud providers.

**Risk**: Very low - simple logic fix, thoroughly tested, in mainline
since Nov 2025.

**Stable Tree Alignment**: This perfectly fits stable kernel criteria -
it's a small, obviously correct fix for an important user-visible bug
that affects commonly-used functionality.

**YES**

 arch/x86/events/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7610f26dfbd90..38f7102e2dacc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2787,13 +2787,13 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
 		return;
 	}
 
-	if (perf_callchain_store(entry, regs->ip))
-		return;
-
-	if (perf_hw_regs(regs))
+	if (perf_hw_regs(regs)) {
+		if (perf_callchain_store(entry, regs->ip))
+			return;
 		unwind_start(&state, current, regs, NULL);
-	else
+	} else {
 		unwind_start(&state, current, NULL, (void *)regs->sp);
+	}
 
 	for (; !unwind_done(&state); unwind_next_frame(&state)) {
 		addr = unwind_get_return_address(&state);
-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ