[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zf60avr1.ffs@tglx>
Date: Mon, 26 Jan 2026 22:50:58 +0100
From: Thomas Gleixner <tglx@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>, David Matlack <dmatlack@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Michael Jeanson
<mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com>, "Paul E. McKenney" <paulmck@...nel.org>,
X86 ML <x86@...nel.org>, Sean Christopherson <seanjc@...gle.com>, Wei Liu
<wei.liu@...nel.org>
Subject: Re: SIGSEGVs after 39a167560a61 ("rseq: Optimize event setting")
On Mon, Jan 26 2026 at 21:47, Peter Zijlstra wrote:
> On Mon, Jan 26, 2026 at 11:46:27AM -0800, David Matlack wrote:
>> I started seeing SIGSEGVs in Google's remote test executor when
>> running on hosts at v6.19-rc6. Bisecting led me to this commit:
>>
>> 39a167560a61 ("rseq: Optimize event setting")
>>
>> I discovered this issue while running VFIO selftests against v6.19-rc6,
>> but realized the issue has nothing to do with the selftests themselves.
>> Even running "sleep" as the test is enough to trigger this issue in the
>> executor.
>>
>> I know that Google uses rseq in its userspace software stack, so I
>> assume this is some bad interaction between that implementation and
>> commit 39a167560a61.
>>
>> Unfortunately, the remote test executor that is receiving the SIGSEGV is
>> not open source so I don't have a repro I can share. But I can easily
>> reproduce the issue with my setup so I'd be happy to help with testing
>> any fixes or debug patches.
>>
>> I've attached the .config that I used when reproducing this issue. The
>> host I am using is an Intel server with EMR CPUs in case that matters.
>
> Is this using tcmalloc? If so, that is somewhat expected because
> tcmalloc is known to violate upstream rseq ABI. IIRC you should get a
> nice splat if you enable rseq debug mode (echo 1 > /debug/rseq/debug).
The correctness of these changes has been validated by the rseq
selftests and I don't see how that commit would violate the guaranteed
ABI.
> Perhaps this is the nudge Google needs to go fix this.
The real question is whether the segfault is triggered from the rseq
sanity checks or if the application segfaults becauses it relies on
something something which is not guaranteed by the ABI. As this is
secret sauce, I can't tell.
Just for the record: I tried to build tcmalloc and get some tests done
with it, but the documentation is abysmal and I have no intention to
debug that bazel insanity.
Thanks,
tglx
Powered by blists - more mailing lists