linux-kernel - Re: segfaults of processes while being killed after commit "mm: make the page fault mmap locking killable"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <85876d36-ca1f-4ba4-9065-4e7fc58329c0@proxmox.com>
Date:   Wed, 26 Jul 2023 08:51:24 +0200
From:   Thomas Lamprecht <t.lamprecht@...xmox.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Fiona Ebner <f.ebner@...xmox.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Oleg Nesterov <oleg@...hat.com>
Cc:     akpm@...ux-foundation.org,
        Wolfgang Bumiller <w.bumiller@...xmox.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: segfaults of processes while being killed after commit "mm: make
 the page fault mmap locking killable"

On 25/07/2023 18:38, Linus Torvalds wrote:
> But before we revert it, would you mind trying out the attached
> trivial patch instead?

Not Fiona, but as I was still online yesterday I got around to already
try that patch out, after adding the missing `tsk` task_struct param
to the fatal_signal_pending call.
With the patched kernel booted, the original case we found in the wild
went from logging a segfault roughly twice per hour before, to none
afterward, and that with a bit more than 10h of boot time.
Fiona might have a more definitive confirmation, as IIRC she got a
better (= faster) reproducer used for bisecting.

> 
> I'd also still be interested if the symptoms were anything else than
> 'show_unhandled_signals' causing the show_signal_msg() dance, and
> resulting in a message something like
> 
>     a.out[1567]: segfault at xyz ip [..] likely on CPU X
> 
> in dmesg...

exactly, it was just like that with no actual fall out. The messages
were like:

> pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)

And the slightly odd code triggering this was basically a fork, where
the child wrote a message to the parent via a unix socket pair and
then called exit. The parent read that message and then send a SIGKILL
to the child process, i.e., the child exit and parent killing the
child process would be pretty closely aligned, basically racing with
each other.

cheers,
 Thomas