lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <c4f84de0-ee0e-4e29-a9f5-346823bb3d53@eyal.emu.id.au> Date: Sat, 23 Dec 2023 10:00:05 +1100 From: eyal@...l.emu.id.au To: linux-raid@...r.kernel.org, linux-ext4@...r.kernel.org, carlos@...ica.ufpr.br Subject: Re: parity raid and ext4 get stuck in writes On 23/12/23 07:48, Carlos Carvalho wrote: > This is finally a summary of a long standing problem. When lots of writes to > many files are sent in a short time the kernel gets stuck and stops sending > write requests to the disks. Sometimes it recovers and finally sends the > modified pages to permanent storage, sometimes not and eventually other > functions degrade and the machine crashes. > > A simple way to reproduce: expand a kernel source tree, like > xzcat linux-6.5.tar.xz | tar x -f - > > With the default vm settings for dirty_background_ratio and dirty_ratio this > will finish quickly with ~1.5GB of dirty pages in ram and ~100k inodes to be > written and the kernel gets stuck. > > The bug exists in all 6.* kernels; I've tested the latest release of all > 6.[1-6]. However some conditions must exist for the problem to appear: > > - there must be many inodes to be flushed; just many bytes in a few files don't > show the problem > - it happens only with ext4 on a parity raid array This may be unrelated but there is an open problem that looks somewhat similar. It is tracked at https://bugzilla.kernel.org/show_bug.cgi?id=217965 If your fs is mounted with a non-zero 'stripe=' (as RAID arrays usually are), try to get around the issue with $ sudo mount -o remount,stripe=0 YourFS If it makes a difference then you may be looking at a similar issue. > I've moved one of our arrays to xfs and everything works fine, so it's either > specific to ext4 or xfs is not affected. When the lockup happens the flush > kworker starts using 100% cpu permanently. I have not observed the bug in > raid10, only in raid[56]. > > The problem is more easily triggered with 6.[56] but 6.1 is also affected. The issue was seen in kernels 6.5 and later but not in 6.4, so maybe not the same thing. > Limiting dirty_bytes and dirty_background_bytes to low values reduce the > probability of lockup, probably because the process generating writes is > stopped before too many files are created. HTH -- Eyal at Home (eyal@...l.emu.id.au)
Powered by blists - more mailing lists