linux-kernel - Re: regression caused by block: freeze the queue earlier in del

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f2c28043-59e6-0aee-b8bf-df38525ee899@leemhuis.info>
Date:   Tue, 20 Sep 2022 11:11:50 +0200
From:   Thorsten Leemhuis <regressions@...mhuis.info>
To:     Dusty Mabe <dusty@...tymabe.com>, Ming Lei <ming.lei@...hat.com>,
        Christoph Hellwig <hch@....de>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>
Subject: Re: regression caused by block: freeze the queue earlier in
 del_gendisk

Hi, this is your Linux kernel regression tracker.

On 13.09.22 04:36, Dusty Mabe wrote:
> On 9/12/22 21:55, Ming Lei wrote:
>> On Mon, Sep 12, 2022 at 09:16:18AM +0200, Christoph Hellwig wrote:
>>> On Fri, Sep 09, 2022 at 04:24:40PM +0800, Ming Lei wrote:
>>>> On Wed, Sep 07, 2022 at 09:33:24AM +0200, Christoph Hellwig wrote:
>>>>> On Thu, Sep 01, 2022 at 03:06:08PM +0800, Ming Lei wrote:
>>>>>> It is a bit hard to associate the above commit with reported issue.
>>>>>
>>>>> So the messages clearly are about something trying to open a device
>>>>> that went away at the block layer, but somehow does not get removed
>>>>> in time by udev (which seems to be a userspace bug in CoreOS).  But
>>>>> even with that we really should not hang.
>>>>
>>>> Xiao Ni provides one script[1] which can reproduce the issue more or less.
>>>
>>> I've run the reproduced 10000 times on current mainline, and while
>>> it prints one of the autoloading messages per run, I've not actually
>>> seen any kind of hang.
>>
>> I can't reproduce the hang too.
> 
> I obviously can reproduce the issue with the test in our Fedora CoreOS
> test suite. It's part of a framework (i.e. it's not simple some script
> you can run) but it is very reproducible so one can add some instrumentation
> to the kernel and feed it through a build/test cycle to see different
> results or logs.
> 
> I'm willing to share this with other people (maybe a screen share or
> some written down instructions) if anyone would be interested.

This thread looked stalled, or was there any progress in the past week?
If not: Fedora apparently removed the patch in their kernels a while
ago, as quite a few users where hitting it. What is preventing us from
doing the same in mainline and 5.19.y until the issue can be resolved?
The description of a09b314005f3 ("block: freeze the queue earlier in
del_gendisk") doesn't sound like the change does something crucial that
can't wait a bit. I might be totally wrong with that, but I think it's
my duty to ask that question at this point.

>> What I meant is that new raid disk can be added by mdadm after stopping
>> the imsm container and raid disk with the autoloading messages printed,
>> I understand this behavior isn't correct, but I am not familiar with
>> raid enough.
>>
>> It might be related with the delay deleting gendisk from wq & md kobj
>> release handler.
>>
>> During reboot, if mdadm does this stupid thing without stopping, the hang
>> could be caused.
>>
>> I think the root cause is that why mdadm tries to open/add new raid bdev
>> crazily during reboot.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.