linux-kernel - Re: Fwd: Persistent rt_sigreturn segfaults on KVM VMs after upgrade to 5.15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZGY9twXBuTWpliAB@google.com>
Date:   Thu, 18 May 2023 08:01:11 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     Bagas Sanjaya <bagasdotme@...il.com>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux Regressions <regressions@...ts.linux.dev>,
        Linux KVM <kvm@...r.kernel.org>,
        Paolo Bonzini <pbonzini@...hat.com>, Theodor Milkov <tm@....bg>
Subject: Re: Fwd: Persistent rt_sigreturn segfaults on KVM VMs after upgrade
 to 5.15

On Thu, May 18, 2023, Bagas Sanjaya wrote:
> On 5/18/23 20:57, Bagas Sanjaya wrote:
> > Hi,
> > 
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> > 
> >> I'm experiencing sporadic but persistent segmentation faults on the KVM
> >> VMs I manage. These faults began appearing after upgrading from Linux
> >> Kernel 4.x to 5.15.59. I further upgraded to 5.15.91 and transitioned the
> >> userspace from Debian 10 (buster) to Debian 11 (bullseye), yet the issues
> >> persist. Notably, the libc has also changed in the process as seen in the
> >> following error logs:

Was the host or guest kernel upgraded?  If the guest kernel was upgraded, it's
unlikely, though still possible, that this is a KVM bug.

> >> post.sh[21952]: bad frame in rt_sigreturn frame:000072db65961bb8
> >> ip:6c25f82a9a5d sp:72db65962168 orax:ffffffffffffffff in
> >> libc-2.28.so[6c25f8294000+147000]
> >>
> >> cron[7626]: bad frame in rt_sigreturn frame:000073ddebeb6ff8
> >> ip:72ad9f44d594 sp:73ddebeb75a8 orax:ffffffffffffffff in
> >> libc-2.28.so[72ad9f3a9000+147000]
> >>
> >> cron[64687]: bad frame in rt_sigreturn frame:000073265764b038
> >> ip:67c7b5a0f14a sp:73265764b5f0 orax:ffffffffffffffff in
> >> libc-2.31.so[67c7b596f000+159000]
> >>
> >> worker.py[54568]: bad frame in rt_sigreturn frame:000078eef6591cf8
> >> ip:6c9f9b2a604e sp:78eef6592298 orax:ffffffffffffffff in
> >> libpthread-2.31.so[6c9f9b29a000+10000]
> >>
> >>
> >> The segmentation faults occur 1-3 times daily across approximately 1000
> >> VMs running on hundreds of (supermicro, intel cpu) bare-metal servers.
> >> Currently, there's no reliable way for me to reproduce the issue. I
> >> initially considered this bug -
> >> https://www.spinics.net/lists/linux-tip-commits/msg61293.html - as a
> >> possible cause, but judging from the comments it likely isn't.
> >>
> >> The best approximation to a reproducer I have is a Python script that
> >> initiates several child processes and continuously sends them a sigusr1
> >> signal. Still, it takes a few hours to trigger the issue even when running
> >> this script on several hundred VMs.
> >>
> >> Switching to the 6.x kernel isn't immediately feasible as these are
> >> production systems with specific requirements. The transition is planned
> >> but will likely take several months.
> >>
> >> I'm looking for suggestions on how to more reliably reproduce this
> >> problem. Then I could try different old and new kernels and maybe narrow
> >> it down.
> > 
> > See bugzilla for the full thread.
> > 
> > Anyway, I'm adding it to regzbot:
> > 
> > #regzbot introduced: v4.19..v5.15 https://bugzilla.kernel.org/show_bug.cgi?id=217457
> > #regzbot title: bad frame in rt_sigreturn (libc-related?) regression after 5.15 upgrade
> > 
> 
> Oops, I forgot to add the reporter:
> 
> #regzbot from: Theodor Milkov <tm@....bg>
> 
> Sorry for inconvenience.
> 
> -- 
> An old man doll... just what I always wanted! - Clara
>