lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c60881e-d28f-d8d5-099c-b9678bd69db9@huaweicloud.com>
Date: Mon, 15 Jul 2024 09:56:05 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Konstantin Kharlamov <Hi-Angel@...dex.ru>,
 Yu Kuai <yukuai1@...weicloud.com>, Song Liu <song@...nel.org>,
 linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yangerkun@...wei.com" <yangerkun@...wei.com>,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under
 load

Hi,

在 2024/07/13 21:50, Konstantin Kharlamov 写道:
> On Sat, 2024-07-13 at 19:06 +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/07/12 20:11, Konstantin Kharlamov 写道:
>>> Good news: you diff seems to have fixed the problem! I would have
>>> to
>>> test more extensively in another environment to be completely sure,
>>> but
>>> by following the minimal steps-to-reproduce I can no longer
>>> reproduce
>>> the problem, so it seems to have fixed the problem.
>>
>> That's good. :)
>>>
>>> Bad news: there's a new lockup now 😄 This one seems to happen
>>> after
>>> the disk is returned back; unless the action of returning back
>>> matches
>>> accidentally the appearing stacktraces, which still might be
>>> possible
>>> even though I re-tested multiple times. It's because the traces
>>> (below) seems not to always appear. However, even when traces do
>>> not
>>> appear, IO load on the fio that's running in the background drops
>>> to
>>> zero, so something seems definitely wrong.
>>
>> Ok, I need to investigate more for this. The call stack is not much
>> helpful.
> 
> Is it not helpful because of missing line numbers or in general? If
> it's the missing line numbers I'll try to fix that. We're using some
> Debian scripts that create deb packages, and well, they don't work well
> with debug information (it's being put to separate package, but even if
> it's installed the kernel traces still don't have line numbers). I
> didn't investigate into it, but I can if that will help.

Line number will be helpful. Meanwhile, can you check if the underlying
disks has IO while raid5 stuck, by /sys/block/[device]/inflight.
> 
>> At first, can the problem reporduce with raid1/raid10? If not, this
>> is
>> probably a raid5 bug.
> 
> This is not reproducible with raid1 (i.e. no lockups for raid1), I
> tested that. I didn't test raid10, if you want I can try (but probably
> only after the weekend, because today I was asked to give the nodes
> away, for the weekend at least, to someone else).

Yes, please try raid10 as well. For now I'll say this is a raid5
problem.
> 
>> The best will be that if I can reporduce this problem myself.
>> The problem is that I don't understand the step 4: turning off jbod
>> slot's power, is this only possible for a real machine, or can I do
>> this in my VM?
> 
> Well, let's say that if it is possible, I don't know a way to do that.
> The `sg_ses` commands that I used
> 
> 	sg_ses --dev-slot-num=9 --set=3:4:1   /dev/sg26 # turning off
> 	sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 # turning on
> 
> …sets and clears the value of the 3:4:1 bit, where the bit is defined
> by the JBOD's manufacturer datasheet. The 3:4:1 specifically is defined
> by "AIC" manufacturer. That means the command as is unlikely to work on
> a different hardware.

I never do this before, I'll try.
> 
> Well, while on it, do you have any thoughts why just using a `echo 1 >
> /sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps kernel
> not emulate device disappearance too well?

echo 1 > delete just delete the disk from kernel, and scsi/dm-raid will
know that this disk is deleted. However, the disk will stay in kernel
for the other way, dm-raid does not aware that underlying disks are
problematic and IO will still be generated and issued.

Thanks,
Kuai

> .
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ