[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cd0f300f-49a0-235a-099c-e76af5ed7c05@huaweicloud.com>
Date: Thu, 7 Mar 2024 11:34:57 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: junxiao.bi@...cle.com, Song Liu <song@...nel.org>,
Linux regressions mailing list <regressions@...ts.linux.dev>
Cc: gregkh@...uxfoundation.org, linux-kernel@...r.kernel.org,
linux-raid@...r.kernel.org, stable@...r.kernel.org,
Dan Moulding <dan@...m.net>, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
successfully bisected
Hi,
在 2024/03/02 4:26, junxiao.bi@...cle.com 写道:
> Hi Dan & Song,
>
> I have not root cause this yet, but would like share some findings from
> the vmcore Dan shared. From what i can see, this doesn't look like a md
> issue, but something wrong with block layer or below.
I would like to take a look at vmcore as well. How dose Dan sharing the
vmcore? I don't find it in the thread.
Thanks,
Kuai
>
> 1. There were multiple process hung by IO over 15mins.
>
> crash> ps -m | grep UN
> [0 00:15:50.424] [UN] PID: 957 TASK: ffff88810baa0ec0 CPU: 1
> COMMAND: "jbd2/dm-3-8"
> [0 00:15:56.151] [UN] PID: 1835 TASK: ffff888108a28ec0 CPU: 2
> COMMAND: "dd"
> [0 00:15:56.187] [UN] PID: 876 TASK: ffff888108bebb00 CPU: 3
> COMMAND: "md0_reclaim"
> [0 00:15:56.185] [UN] PID: 1914 TASK: ffff8881015e6740 CPU: 1
> COMMAND: "kworker/1:2"
> [0 00:15:56.255] [UN] PID: 403 TASK: ffff888101351d80 CPU: 7
> COMMAND: "kworker/u21:1"
>
> 2. Let pick md0_reclaim to take a look, it is waiting done super_block
> update. We can see there were two pending superblock write and other
> pending io for the underling physical disk, which caused these process
> hung.
>
> crash> bt 876
> PID: 876 TASK: ffff888108bebb00 CPU: 3 COMMAND: "md0_reclaim"
> #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
> #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
> #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
> #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
> #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
> #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
> #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
> #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
> #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>
> crash> mddev.pending_writes,disks 0xffff888108335800
> pending_writes = {
> counter = 2 <<<<<<<<<< 2 active super block write
> },
> disks = {
> next = 0xffff88810ce85a00,
> prev = 0xffff88810ce84c00
> },
> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending
> 0xffff88810ce85a00
> ffff88810ce85a00
> kobj.name = 0xffff8881067c1a00 "dev-dm-1",
> nr_pending = {
> counter = 0
> },
> ffff8881083ace00
> kobj.name = 0xffff888100a93280 "dev-sde",
> nr_pending = {
> counter = 10 <<<<
> },
> ffff8881010ad200
> kobj.name = 0xffff8881012721c8 "dev-sdc",
> nr_pending = {
> counter = 8 <<<<<
> },
> ffff88810ce84c00
> kobj.name = 0xffff888100325f08 "dev-sdd",
> nr_pending = {
> counter = 2 <<<<<
> },
>
> 3. From block layer, i can find the inflight IO for md superblock write
> which has been pending 955s which matches with the hung time of
> "md0_reclaim"
>
> crash>
> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io
> ffff888103b4c300
> q = 0xffff888103a00d80,
> mq_hctx = 0xffff888103c5d200,
> cmd_flags = 38913,
> rq_flags = 139408,
> start_time_ns = 1504179024146,
> bio = 0x0,
> biotail = 0xffff888120758e40,
> state = MQ_RQ_COMPLETE,
> __data_len = 0,
> flush = {
> seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
> saved_end_io = 0x0
> },
> end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>
> crash> p tk_core.timekeeper.tkr_mono.base
> $1 = 2459916243002
> crash> eval 2459916243002-1504179024146
> hexadecimal: de86609f28
> decimal: 955737218856 <<<<<<< IO pending time is 955s
> octal: 15720630117450
> binary:
> 0000000000000000000000001101111010000110011000001001111100101000
>
> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
> bi_iter = {
> bi_sector = 8, <<<< super block offset
> bi_size = 0,
> bi_idx = 0,
> bi_bvec_done = 0
> },
> bi_end_io = 0xffffffff817dca50 <super_written>,
> crash> dev -d | grep ffff888103a00d80
> 8 ffff8881003ab000 sdd ffff888103a00d80 0 0 0
>
> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is
> still pending. That's because each md superblock write was marked with
> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush,
> data, and post_flush. Once each step complete, it will be marked in
> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH |
> REQ_FSEQ_DATA, so the last step "post_flush" has not be done. Another
> wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even
> "data" step already done.
>
> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
> fq = 0xffff88810332e240,
> crash> blk_flush_queue 0xffff88810332e240
> struct blk_flush_queue {
> mq_flush_lock = {
> {
> rlock = {
> raw_lock = {
> {
> val = {
> counter = 0
> },
> {
> locked = 0 '\000',
> pending = 0 '\000'
> },
> {
> locked_pending = 0,
> tail = 0
> }
> }
> }
> }
> }
> },
> flush_pending_idx = 1,
> flush_running_idx = 1,
> rq_status = 0 '\000',
> flush_pending_since = 4296171408,
> flush_queue = {{
> next = 0xffff88810332e250,
> prev = 0xffff88810332e250
> }, {
> next = 0xffff888103b4c348, <<<< the request is in this list
> prev = 0xffff888103b4c348
> }},
> flush_data_in_flight = 1, >>>>>> still 1
> flush_rq = 0xffff888103c2e000
> }
>
> crash> list 0xffff888103b4c348
> ffff888103b4c348
> ffff88810332e260
>
> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
> tag = -1,
> state = MQ_RQ_IDLE,
> ref = {
> counter = 0
> },
>
> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> some issue which leading to the io request from md layer stayed in a
> partial complete statue. I can't see how this can be related with the
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"")
>
>
> Dan,
>
> Are you able to reproduce using some regular scsi disk, would like to
> rule out whether this is related with virtio-scsi?
>
> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> official mainline v6.8-rc5 without any other patches?
>
>
> Thanks,
>
> Junxiao.
>
> On 2/23/24 6:13 PM, Song Liu wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@...mhuis.info> wrote:
>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>> Just a friendly reminder that this regression still exists on the
>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>> few regressions recently, but it seems like resolving this one is
>>> stalled. Or were you able to reproduce the issue or make some progress
>>> and I just missed it?
>> Sorry for the delay with this issue. I have been occupied with some
>> other stuff this week.
>>
>> I haven't got luck to reproduce this issue. I will spend more time
>> looking
>> into it next week.
>>
>>> And if not, what's the way forward here wrt to the release of 6.8?
>>> Revert the culprit and try again later? Or is that not an option for one
>>> reason or another?
>> If we don't make progress with it in the next week, we will do the
>> revert,
>> same as we did with stable kernels.
>>
>>> Or do we assume that this is not a real issue? That it's caused by some
>>> oddity (bit-flip in the metadata or something like that?) only to be
>>> found in Dan's setup?
>> I don't think this is because of oddities. Hopefully we can get more
>> information about this soon.
>>
>> Thanks,
>> Song
>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>
>
> .
>
Powered by blists - more mailing lists