linux-kernel - Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whJqZLKPR-cpX-V4wJTXVX-_tG5Vjuj2q9knvKGCPdfkg@mail.gmail.com>
Date:   Sun, 11 Jun 2023 20:14:25 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Dave Chinner <david@...morbit.com>
Cc:     "Darrick J. Wong" <djwong@...nel.org>,
        Zorro Lang <zlang@...hat.com>, linux-xfs@...r.kernel.org,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Mike Christie <michael.christie@...cle.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report]
 fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <david@...morbit.com> wrote:
>
> I guess the regression fix needs a regression fix....

Yup.

>From the description of the problem, it sounds like this happens on
real hardware, no vhost anywhere?

Or maybe Darrick (who doesn't see the issue) is running on raw
hardware, and you and Zorro are running in a virtual environment?

It sounds like zap_other_threads() and coredump_task_exit() do not
agree about the core_state->nr_threads counting, which is part of what
changed there.

[ Goes off to look ]

Hmm. Both seem to be using the same test for

    (t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER

which I don't love - I don't think io_uring threads should participate
in core dumping either, so I think the test could just be

    (t->flags & PF_IO_WORKER)

but that shouldn't be the issue here.

But according to

  https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/

it's clearly hanging in wait_for_completion_state() in
coredump_wait(), so it really looks like some confusion about that
core_waiters (aka core_state->nr_threads) count.

Oh. Humm. Mike changed that initial rough patch of mine, and I had
moved the "if you don't participate in c ore dumps" test up also past
the "do_coredump()" logic.

And I think it's horribly *wrong* for a thread that doesn't get
counted for core-dumping to go into do_coredump(), because then it
will set the "core_state" to possibly be the core-state of the vhost
thread that isn't even counted.

So *maybe* this attached patch might fix it? I haven't thought very
deeply about this, but vhost workers most definitely shouldn't call
do_coredump(), since they are then not counted.

(And again, I think we should just check that PF_IO_WORKER bit, not
use this more complex test, but that's a separate and bigger change).

              Linus

View attachment "patch.diff" of type "text/x-patch" (554 bytes)