linux-kernel - Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9de7c031-58e6-56df-6b9c-b10952551d02@redhat.com>
Date: Wed, 31 Jul 2024 17:33:41 -0400
From: Matthew Sakai <msakai@...hat.com>
To: Konstantin Kharlamov <Hi-Angel@...dex.ru>,
 Yu Kuai <yukuai1@...weicloud.com>, Song Liu <song@...nel.org>,
 linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yangerkun@...wei.com" <yangerkun@...wei.com>,
 "yukuai (C)" <yukuai3@...wei.com>
Cc: dm-devel@...ts.linux.dev
Subject: Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under
 load


On 7/31/24 10:14, Konstantin Kharlamov wrote:
> CC'ing VDO maintainers, because the problem is only reproducible with
> VDO, so potentially they might have some ideas.

I don't see anything that implicates VDO directly. The blocked VDO 
threads (with the test patch) seem to be stuck in raid5_make_request() 
so it seems like the raid itself is not handling requests in a timely 
manner.

There is one potentially useful detail, however: VDO mostly submits 4K 
bios. The large number of smaller bios may be exacerbating an issue in 
the raid5.

Matt

> On Mon, 2024-07-22 at 20:56 +0300, Konstantin Kharlamov wrote:
>> Hi, sorry for the delay, I had to give away the nodes and we had a
>> week
>> of teambuilding and company party, so for the past week I only
>> managed
>> to hack away stripping debug symbols, get another node and set it up.
>>
>> Experiments below are based off of vanilla 6.9.8 kernel *without*
>> your
>> patch.
>>
>> On Mon, 2024-07-15 at 09:56 +0800, Yu Kuai wrote:
>>> Line number will be helpful.
>>
>> So, after tinkering with building scripts I managed to build modules
>> with debug symbols (not the kernel itself but should be good enough),
>> but for some reason kernel doesn't show line numbers in stacktraces.
>> No
>> idea what could be causing it, so I had to decode line numbers
>> manually, below is an output where I inserted line numbers for
>> raid456
>> manually after decoding them with `gdb`.
>>
>>      […]
>>      [ 1677.293366]  <TASK>
>>      [ 1677.293661]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>      [ 1677.293972]  ? _raw_spin_unlock_irq+0x10/0x30
>>      [ 1677.294276]  ? _raw_spin_unlock_irq+0xa/0x30
>>      [ 1677.294586]  raid5d at drivers/md/raid5.c:6572
>>      [ 1677.294910]  md_thread+0xc1/0x170
>>      [ 1677.295228]  ? __pfx_autoremove_wake_function+0x10/0x10
>>      [ 1677.295545]  ? __pfx_md_thread+0x10/0x10
>>      [ 1677.295870]  kthread+0xff/0x130
>>      [ 1677.296189]  ? __pfx_kthread+0x10/0x10
>>      [ 1677.296498]  ret_from_fork+0x30/0x50
>>      [ 1677.296810]  ? __pfx_kthread+0x10/0x10
>>      [ 1677.297112]  ret_from_fork_asm+0x1a/0x30
>>      [ 1677.297424]  </TASK>
>>      […]
>>      [ 1705.296253]  <TASK>
>>      [ 1705.296554]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>      [ 1705.296864]  ? _raw_spin_unlock_irq+0x10/0x30
>>      [ 1705.297172]  ? _raw_spin_unlock_irq+0xa/0x30
>>      [ 1677.294586]  raid5d at drivers/md/raid5.c:6597
>>      [ 1705.297794]  md_thread+0xc1/0x170
>>      [ 1705.298099]  ? __pfx_autoremove_wake_function+0x10/0x10
>>      [ 1705.298409]  ? __pfx_md_thread+0x10/0x10
>>      [ 1705.298714]  kthread+0xff/0x130
>>      [ 1705.299022]  ? __pfx_kthread+0x10/0x10
>>      [ 1705.299333]  ret_from_fork+0x30/0x50
>>      [ 1705.299641]  ? __pfx_kthread+0x10/0x10
>>      [ 1705.299947]  ret_from_fork_asm+0x1a/0x30
>>      [ 1705.300257]  </TASK>
>>      […]
>>      [ 1733.296255]  <TASK>
>>      [ 1733.296556]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>      [ 1733.296862]  ? _raw_spin_unlock_irq+0x10/0x30
>>      [ 1733.297170]  ? _raw_spin_unlock_irq+0xa/0x30
>>      [ 1677.294586]  raid5d at drivers/md/raid5.c:6572
>>      [ 1733.297792]  md_thread+0xc1/0x170
>>      [ 1733.298096]  ? __pfx_autoremove_wake_function+0x10/0x10
>>      [ 1733.298403]  ? __pfx_md_thread+0x10/0x10
>>      [ 1733.298711]  kthread+0xff/0x130
>>      [ 1733.299018]  ? __pfx_kthread+0x10/0x10
>>      [ 1733.299330]  ret_from_fork+0x30/0x50
>>      [ 1733.299637]  ? __pfx_kthread+0x10/0x10
>>      [ 1733.299943]  ret_from_fork_asm+0x1a/0x30
>>      [ 1733.300251]  </TASK>
>>
>>> Meanwhile, can you check if the underlying
>>> disks has IO while raid5 stuck, by /sys/block/[device]/inflight.
>>
>> The two devices that are left after the 3rd one is removed has these
>> numbers that don't change with time:
>>
>>      [Mon Jul 22 20:18:06 @ ~]:> for d in dm-19 dm-17; do echo -n $d;
>> cat
>>      /sys/block/$d/inflight; done
>>      dm-19       9        1
>>      dm-17      11        2
>>      [Mon Jul 22 20:18:11 @ ~]:> for d in dm-19 dm-17; do echo -n $d;
>> cat
>>      /sys/block/$d/inflight; done
>>      dm-19       9        1
>>      dm-17      11        2
>>
>> They also don't change after I return the disk back (which is to be
>> expected I guess, given that the lockup doesn't go away).
>>
>>>>
>>>>> At first, can the problem reporduce with raid1/raid10? If not,
>>>>> this
>>>>> is
>>>>> probably a raid5 bug.
>>>>
>>>> This is not reproducible with raid1 (i.e. no lockups for raid1),
>>>> I
>>>> tested that. I didn't test raid10, if you want I can try (but
>>>> probably
>>>> only after the weekend, because today I was asked to give the
>>>> nodes
>>>> away, for the weekend at least, to someone else).
>>>
>>> Yes, please try raid10 as well. For now I'll say this is a raid5
>>> problem.
>>
>> Tested: raid10 works just fine, i.e. no lockup and fio continues
>> having non-zero IOPS.
>>
>>>>> The best will be that if I can reporduce this problem myself.
>>>>> The problem is that I don't understand the step 4: turning off
>>>>> jbod
>>>>> slot's power, is this only possible for a real machine, or can
>>>>> I
>>>>> do
>>>>> this in my VM?
>>>>
>>>> Well, let's say that if it is possible, I don't know a way to do
>>>> that.
>>>> The `sg_ses` commands that I used
>>>>
>>>> 	sg_ses --dev-slot-num=9 --set=3:4:1   /dev/sg26 #
>>>> turning
>>>> off
>>>> 	sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 #
>>>> turning
>>>> on
>>>>
>>>> …sets and clears the value of the 3:4:1 bit, where the bit is
>>>> defined
>>>> by the JBOD's manufacturer datasheet. The 3:4:1 specifically is
>>>> defined
>>>> by "AIC" manufacturer. That means the command as is unlikely to
>>>> work on
>>>> a different hardware.
>>>
>>> I never do this before, I'll try.
>>>>
>>>> Well, while on it, do you have any thoughts why just using a
>>>> `echo
>>>> 1 >
>>>> /sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps
>>>> kernel
>>>> not emulate device disappearance too well?
>>>
>>> echo 1 > delete just delete the disk from kernel, and scsi/dm-raid
>>> will
>>> know that this disk is deleted. However, the disk will stay in
>>> kernel
>>> for the other way, dm-raid does not aware that underlying disks are
>>> problematic and IO will still be generated and issued.
>>>
>>> Thanks,
>>> Kuai
>