linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPhsuW49L8B9K8QFg68v=zG9ywMehUTD18DaG4PexEt-3mzQqQ@mail.gmail.com>
Date: Wed, 24 Jan 2024 16:01:47 -0800
From: Song Liu <song@...nel.org>
To: Dan Moulding <dan@...m.net>, junxiao.bi@...cle.com
Cc: gregkh@...uxfoundation.org, linux-kernel@...r.kernel.org, 
	linux-raid@...r.kernel.org, regressions@...ts.linux.dev, 
	stable@...r.kernel.org, yukuai1@...weicloud.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
 successfully bisected

Thanks for the information!


On Tue, Jan 23, 2024 at 3:58 PM Dan Moulding <dan@...m.net> wrote:
>
> > This appears the md thread hit some infinite loop, so I would like to
> > know what it is doing. We can probably get the information with the
> > perf tool, something like:
> >
> > perf record -a
> > perf report
>
> Here you go!
>
> # Total Lost Samples: 0
> #
> # Samples: 78K of event 'cycles'
> # Event count (approx.): 83127675745
> #
> # Overhead  Command          Shared Object                   Symbol
> # ........  ...............  ..............................  ..................................................
> #
>     49.31%  md0_raid5        [kernel.kallsyms]               [k] handle_stripe
>     18.63%  md0_raid5        [kernel.kallsyms]               [k] ops_run_io
>      6.07%  md0_raid5        [kernel.kallsyms]               [k] handle_active_stripes.isra.0
>      5.50%  md0_raid5        [kernel.kallsyms]               [k] do_release_stripe
>      3.09%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
>      2.48%  md0_raid5        [kernel.kallsyms]               [k] r5l_write_stripe
>      1.89%  md0_raid5        [kernel.kallsyms]               [k] md_wakeup_thread
>      1.45%  ksmd             [kernel.kallsyms]               [k] ksm_scan_thread
>      1.37%  md0_raid5        [kernel.kallsyms]               [k] stripe_is_lowprio
>      0.87%  ksmd             [kernel.kallsyms]               [k] memcmp
>      0.68%  ksmd             [kernel.kallsyms]               [k] xxh64
>      0.56%  md0_raid5        [kernel.kallsyms]               [k] __wake_up_common
>      0.52%  md0_raid5        [kernel.kallsyms]               [k] __wake_up
>      0.46%  ksmd             [kernel.kallsyms]               [k] mtree_load
>      0.44%  ksmd             [kernel.kallsyms]               [k] try_grab_page
>      0.40%  ksmd             [kernel.kallsyms]               [k] follow_p4d_mask.constprop.0
>      0.39%  md0_raid5        [kernel.kallsyms]               [k] r5l_log_disk_error
>      0.37%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irq
>      0.33%  md0_raid5        [kernel.kallsyms]               [k] release_stripe_list
>      0.31%  md0_raid5        [kernel.kallsyms]               [k] release_inactive_stripe_list

It appears the thread is indeed doing something. I haven't got luck to
reproduce this on my hosts. Could you please try whether the following
change fixes the issue (without reverting 0de40f76d567)? I will try to
reproduce the issue on my side.

Junxiao,

Please also help look into this.

Thanks,
Song