lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 3 Sep 2022 09:47:16 -0400
From:   Dusty Mabe <dusty@...tymabe.com>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, hch@....de,
        linux-raid@...r.kernel.org
Subject: Re: regression caused by block: freeze the queue earlier in
 del_gendisk



On 9/1/22 03:06, Ming Lei wrote:
> Hi Dusty,

Hi Ming,

> 
> On Fri, Aug 26, 2022 at 12:15:22PM -0400, Dusty Mabe wrote:
>> Hey All,
>>
>> I think I've found a regression introduced by:
>>
>> a09b314 o block: freeze the queue earlier in del_gendisk
>>
>> In Fedora CoreOS we have tests that set up RAID1 on the /boot/ and /root/ partitions
>> and then subsequently removes one of the disks to simulate a failure. Sometime recently
> 
> Do you have test case which doesn't need raid1 over /boot or /root? such
> as by create raid1 over two disks, then mount & remove one of device, ...
> 
> It isn't easy to setup/observe such test case and observe what is wrong.

I don't have such a test case. For Fedora CoreOS we have a very
specific partition layout [1] so it's not easy to change that
and continue to run our test framework.

That being said there are plenty of people in the bug report [2]
that are reporint seeing this as well, so they might have other
test cases they can share.

[1] https://github.com/coreos/fedora-coreos-tracker/blob/main/Design.md#disk-layout
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2121791

> 
>> this test started timing out occasionally. Looking a bit closer it appears instances are
>> getting stuck during reboot with a bunch of looping messages:
>>
>> ```
>> [   17.978854] block device autoloading is deprecated and will be removed.
>> [   17.982555] block device autoloading is deprecated and will be removed.
>> [   17.985537] block device autoloading is deprecated and will be removed.
>> [   17.987546] block device autoloading is deprecated and will be removed.
>> [   17.989540] block device autoloading is deprecated and will be removed.
>> [   17.991547] block device autoloading is deprecated and will be removed.
>> [   17.993555] block device autoloading is deprecated and will be removed.
>> [   17.995539] block device autoloading is deprecated and will be removed.
>> [   17.997577] block device autoloading is deprecated and will be removed.
>> [   17.999544] block device autoloading is deprecated and will be removed.
>> [   22.979465] blkdev_get_no_open: 1666 callbacks suppressed
>> ...
>> ...
>> ...
>> [  618.221270] blkdev_get_no_open: 1664 callbacks suppressed
>> [  618.221273] block device autoloading is deprecated and will be removed.
>> [  618.224274] block device autoloading is deprecated and will be removed.
>> [  618.227267] block device autoloading is deprecated and will be removed.
>> [  618.229274] block device autoloading is deprecated and will be removed.
>> [  618.231277] block device autoloading is deprecated and will be removed.
>> [  618.233277] block device autoloading is deprecated and will be removed.
>> [  618.235282] block device autoloading is deprecated and will be removed.
>> [  618.237370] block device autoloading is deprecated and will be removed.
>> [  618.239356] block device autoloading is deprecated and will be removed.
>> [  618.241290] block device autoloading is deprecated and will be removed.
>> ```
>>
>> Using the Fedora kernels I narrowed it down to being introduced between 
>> `kernel-5.19.0-0.rc3.27.fc37` (good) and `kernel-5.19.0-0.rc4.33.fc37` (bad).
>>
>> I then did a bisect and found:
>>
>> ```
>> $ git bisect bad
>> a09b314005f3a0956ebf56e01b3b80339df577cc is the first bad commit
>> commit a09b314005f3a0956ebf56e01b3b80339df577cc
>> Author: Christoph Hellwig <hch@....de>
>> Date:   Tue Jun 14 09:48:27 2022 +0200
>>
>>     block: freeze the queue earlier in del_gendisk
> 
> It is a bit hard to associate the above commit with reported issue.

Indeed, though I think now there is enough emperical evidence that
points directly at this commit. It may ultimately end up as not the
root cause, but it's definitely related.

Dusty

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ