linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPhsuW7QHq4e+cHvZcw8c=ePpeSM69UKTEi8P40=-jOZn+YyyA@mail.gmail.com>
Date: Tue, 30 Jan 2024 20:55:39 -0800
From: Song Liu <song@...nel.org>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: Blazej Kucman <blazej.kucman@...ux.intel.com>, Dan Moulding <dan@...m.net>, carlos@...ica.ufpr.br, 
	gregkh@...uxfoundation.org, junxiao.bi@...cle.com, 
	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org, 
	regressions@...ts.linux.dev, stable@...r.kernel.org, 
	"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
 successfully bisected

On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>
> Hi, Blazej!
>
> 在 2024/01/31 0:26, Blazej Kucman 写道:
> > Hi,
> >
> > On Fri, 26 Jan 2024 08:46:10 -0700
> > Dan Moulding <dan@...m.net> wrote:
> >>
> >> That's a good suggestion, so I switched it to use XFS. It can still
> >> reproduce the hang. Sounds like this is probably a different problem
> >> than the known ext4 one.
> >>
> >
> > Our daily tests directed at mdadm/md also detected a problem with
> > identical symptoms as described in the thread.
> >
> > Issue detected with IMSM metadata but it also reproduces with native
> > metadata.
> > NVMe disks under VMD controller were used.
> >
> > Scenario:
> > 1. Create raid10:
> > mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> > --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> > --size=7864320 --run
> > 2. Create FS
> > mkfs.ext4 /dev/md/r10d4s128-15_A
> > 3. Set faulty one raid member:
> > mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> > 4. Stop raid devies:
> > mdadm -Ss
> >
> > Expected result:
> > The raid stops without kernel hangs and errors.
> >
> > Actual result:
> > command "mdadm -Ss" hangs,
> > hung_task occurs in OS.
>
> Can you test the following patch?
>
> Thanks!
> Kuai
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e3a56a958b47..a8db84c200fe 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws)
>                          rcu_read_lock();
>                  }
>          rcu_read_unlock();
> -       if (atomic_dec_and_test(&mddev->flush_pending))
> +       if (atomic_dec_and_test(&mddev->flush_pending)) {
> +               /* The pair is percpu_ref_get() from md_flush_request() */
> +               percpu_ref_put(&mddev->active_io);
> +
>                  queue_work(md_wq, &mddev->flush_work);
> +       }
>   }
>
>   static void md_submit_flush_data(struct work_struct *ws)

This fixes the issue in my tests. Please submit the official patch.
Also, we should add a test in mdadm/tests to cover this case.

Thanks,
Song