[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f3843ba-31d4-4acf-b3b9-0d2a131dca23@efficios.com>
Date: Wed, 28 Jan 2026 06:28:33 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Dmitry Vyukov <dvyukov@...gle.com>
Cc: David Matlack <dmatlack@...gle.com>, Thomas Gleixner <tglx@...nel.org>,
Marco Elver <elver@...gle.com>, Peter Zijlstra <peterz@...radead.org>,
LKML <linux-kernel@...r.kernel.org>, Michael Jeanson
<mjeanson@...icios.com>, Jens Axboe <axboe@...nel.dk>,
"Paul E. McKenney" <paulmck@...nel.org>, X86 ML <x86@...nel.org>,
Sean Christopherson <seanjc@...gle.com>, Wei Liu <wei.liu@...nel.org>
Subject: Re: SIGSEGVs after 39a167560a61 ("rseq: Optimize event setting")
On 2026-01-28 03:54, Dmitry Vyukov wrote:
> On Tue, 27 Jan 2026 at 21:34, Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
>>
>> +CC Dmitry and Marco.
>>
>> On 2026-01-26 17:35, Mathieu Desnoyers wrote:
>>> On 2026-01-26 17:27, David Matlack wrote:
>>>> On Mon, Jan 26, 2026 at 1:51 PM Thomas Gleixner <tglx@...nel.org> wrote:
>>> [...]
>>>>>> Perhaps this is the nudge Google needs to go fix this.
>>>>>
>>>>> The real question is whether the segfault is triggered from the rseq
>>>>> sanity checks or if the application segfaults becauses it relies on
>>>>> something something which is not guaranteed by the ABI. As this is
>>>>> secret sauce, I can't tell.
>>>>
>>>> I tried enabling /debug/rseq/debug but many of the daemons on my host
>>>> started crash-looping so much that I wasn't able to even run my test.
>>>>
>>>> Next I tried disabling CONFIG_RSEQ and as expected the issue went
>>>> away. I will use that for now to unblock my VFIO testing.
>>>>
>>>> I have reported the tcmalloc regression internally within Google to
>>>> figure out what next step they want to take.
>>>
>>> Note that I've proposed to help out the tcmalloc people a few
>>> times in the past years to fix this, but I've been told that
>>> it was not a priority on their end, and that they would not be
>>> able to even test whatever I would come up with.
>
> I see the patch sets t->rseq.event.ids_changed when cid changes.
> David, perhaps we miss the same update vcpu changes b/c logically vcpu
> is the same as cid?
> Otherwise I don't see why this change would interfere with tcmalloc's
> [ab]use of rseq. Fences (membarrier) should continue to update rseq
> unconditionally for tcmalloc, but that's not the scheduler fast paths.
I suspect that tcmalloc's aliasing of the rseq cpu_id_start field
with its own data structure, corrupting its content, and expecting the
kernel to update it on every preemption does not work anymore, because
the kernel only updates it when the cpu_id actually changes.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists