[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALzav=f=01o9b9pSJjAv0YLPSXpnh3jdNpt0=yACTd8FjBMCbQ@mail.gmail.com>
Date: Mon, 26 Jan 2026 14:27:21 -0800
From: David Matlack <dmatlack@...gle.com>
To: Thomas Gleixner <tglx@...nel.org>
Cc: Peter Zijlstra <peterz@...radead.org>, LKML <linux-kernel@...r.kernel.org>,
Michael Jeanson <mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, "Paul E. McKenney" <paulmck@...nel.org>,
X86 ML <x86@...nel.org>, Sean Christopherson <seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>
Subject: Re: SIGSEGVs after 39a167560a61 ("rseq: Optimize event setting")
On Mon, Jan 26, 2026 at 1:51 PM Thomas Gleixner <tglx@...nel.org> wrote:
>
> On Mon, Jan 26 2026 at 21:47, Peter Zijlstra wrote:
> > On Mon, Jan 26, 2026 at 11:46:27AM -0800, David Matlack wrote:
> >> I started seeing SIGSEGVs in Google's remote test executor when
> >> running on hosts at v6.19-rc6. Bisecting led me to this commit:
> >>
> >> 39a167560a61 ("rseq: Optimize event setting")
> >>
> >> I discovered this issue while running VFIO selftests against v6.19-rc6,
> >> but realized the issue has nothing to do with the selftests themselves.
> >> Even running "sleep" as the test is enough to trigger this issue in the
> >> executor.
> >>
> >> I know that Google uses rseq in its userspace software stack, so I
> >> assume this is some bad interaction between that implementation and
> >> commit 39a167560a61.
> >>
> >> Unfortunately, the remote test executor that is receiving the SIGSEGV is
> >> not open source so I don't have a repro I can share. But I can easily
> >> reproduce the issue with my setup so I'd be happy to help with testing
> >> any fixes or debug patches.
> >>
> >> I've attached the .config that I used when reproducing this issue. The
> >> host I am using is an Intel server with EMR CPUs in case that matters.
> >
> > Is this using tcmalloc? If so, that is somewhat expected because
> > tcmalloc is known to violate upstream rseq ABI. IIRC you should get a
> > nice splat if you enable rseq debug mode (echo 1 > /debug/rseq/debug).
>
> The correctness of these changes has been validated by the rseq
> selftests and I don't see how that commit would violate the guaranteed
> ABI.
>
> > Perhaps this is the nudge Google needs to go fix this.
>
> The real question is whether the segfault is triggered from the rseq
> sanity checks or if the application segfaults becauses it relies on
> something something which is not guaranteed by the ABI. As this is
> secret sauce, I can't tell.
I tried enabling /debug/rseq/debug but many of the daemons on my host
started crash-looping so much that I wasn't able to even run my test.
Next I tried disabling CONFIG_RSEQ and as expected the issue went
away. I will use that for now to unblock my VFIO testing.
I have reported the tcmalloc regression internally within Google to
figure out what next step they want to take.
>
> Just for the record: I tried to build tcmalloc and get some tests done
> with it, but the documentation is abysmal and I have no intention to
> debug that bazel insanity.
>
> Thanks,
>
> tglx
Powered by blists - more mailing lists