[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c4f84de0-ee0e-4e29-a9f5-346823bb3d53@eyal.emu.id.au>
Date: Sat, 23 Dec 2023 10:00:05 +1100
From: eyal@...l.emu.id.au
To: linux-raid@...r.kernel.org, linux-ext4@...r.kernel.org,
carlos@...ica.ufpr.br
Subject: Re: parity raid and ext4 get stuck in writes
On 23/12/23 07:48, Carlos Carvalho wrote:
> This is finally a summary of a long standing problem. When lots of writes to
> many files are sent in a short time the kernel gets stuck and stops sending
> write requests to the disks. Sometimes it recovers and finally sends the
> modified pages to permanent storage, sometimes not and eventually other
> functions degrade and the machine crashes.
>
> A simple way to reproduce: expand a kernel source tree, like
> xzcat linux-6.5.tar.xz | tar x -f -
>
> With the default vm settings for dirty_background_ratio and dirty_ratio this
> will finish quickly with ~1.5GB of dirty pages in ram and ~100k inodes to be
> written and the kernel gets stuck.
>
> The bug exists in all 6.* kernels; I've tested the latest release of all
> 6.[1-6]. However some conditions must exist for the problem to appear:
>
> - there must be many inodes to be flushed; just many bytes in a few files don't
> show the problem
> - it happens only with ext4 on a parity raid array
This may be unrelated but there is an open problem that looks somewhat similar.
It is tracked at
https://bugzilla.kernel.org/show_bug.cgi?id=217965
If your fs is mounted with a non-zero 'stripe=' (as RAID arrays usually are),
try to get around the issue with
$ sudo mount -o remount,stripe=0 YourFS
If it makes a difference then you may be looking at a similar issue.
> I've moved one of our arrays to xfs and everything works fine, so it's either
> specific to ext4 or xfs is not affected. When the lockup happens the flush
> kworker starts using 100% cpu permanently. I have not observed the bug in
> raid10, only in raid[56].
>
> The problem is more easily triggered with 6.[56] but 6.1 is also affected.
The issue was seen in kernels 6.5 and later but not in 6.4, so maybe not the same thing.
> Limiting dirty_bytes and dirty_background_bytes to low values reduce the
> probability of lockup, probably because the process generating writes is
> stopped before too many files are created.
HTH
--
Eyal at Home (eyal@...l.emu.id.au)
Powered by blists - more mailing lists