linux-kernel - Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87r0qhrrvr.fsf@email.froward.int.ebiederm.org>
Date:   Mon, 12 Jun 2023 03:45:12 -0500
From:   "Eric W. Biederman" <ebiederm@...ssion.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Dave Chinner <david@...morbit.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Zorro Lang <zlang@...hat.com>, linux-xfs@...r.kernel.org,
        Mike Christie <michael.christie@...cle.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report]
 fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

Linus Torvalds <torvalds@...ux-foundation.org> writes:

> On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <david@...morbit.com> wrote:
>>
>> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
>> >
>> > So that "!=" should obviously have been a "==".
>>
>> Same as without the condition - all the fsstress tasks hang in
>> do_coredump().
>
> Ok, that at least makes sense. Your "it made things worse" made me go
> "What?" until I noticed the stupid backwards test.
>
> I'm not seeing anything else that looks odd in that commit
> f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
> regression").
>
> Let's see if somebody else goes "Ahh" when they wake up tomorrow...

It feels like there have been about half a dozen bugs pointed out in
that version of the patch.  I am going to have to sleep before I can get
as far as "Ahh"

One thing that really stands out for me is.

if (test_if_loop_should_continue) {
	set_current_state(TASK_INTERRUPTIBLE);
        schedule();
}

/* elsewhere */
llist_add(...);
wake_up_process()

So it is possible that the code can sleep indefinitely waiting for a
wake-up that has already come, because the order of set_current_state
and the test are in the wrong order.

Unfortunately I don't see what would effect a coredump on a process that
does not trigger the vhost_worker code.

About the only thing I can image is if io_uring is involved.  Some of
the PF_IO_WORKER code was changed, and the test
"((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)"
was sprinkled around.

That is the only code outside of vhost specific code that was changed.

Is io_uring involved in the cases that hang?

Eric