lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZYX2AS8isUHtbMXe@fisica.ufpr.br>
Date: Fri, 22 Dec 2023 17:48:01 -0300
From: Carlos Carvalho <carlos@...ica.ufpr.br>
To: linux-ext4@...r.kernel.org, linux-raid@...r.kernel.org
Subject: parity raid and ext4 get stuck in writes

This is finally a summary of a long standing problem. When lots of writes to
many files are sent in a short time the kernel gets stuck and stops sending
write requests to the disks. Sometimes it recovers and finally sends the
modified pages to permanent storage, sometimes not and eventually other
functions degrade and the machine crashes.

A simple way to reproduce: expand a kernel source tree, like
xzcat linux-6.5.tar.xz | tar x -f -

With the default vm settings for dirty_background_ratio and dirty_ratio this
will finish quickly with ~1.5GB of dirty pages in ram and ~100k inodes to be
written and the kernel gets stuck.

The bug exists in all 6.* kernels; I've tested the latest release of all
6.[1-6]. However some conditions must exist for the problem to appear:

- there must be many inodes to be flushed; just many bytes in a few files don't
  show the problem
- it happens only with ext4 on a parity raid array

I've moved one of our arrays to xfs and everything works fine, so it's either
specific to ext4 or xfs is not affected. When the lockup happens the flush
kworker starts using 100% cpu permanently. I have not observed the bug in
raid10, only in raid[56].

The problem is more easily triggered with 6.[56] but 6.1 is also affected.

Limiting dirty_bytes and dirty_background_bytes to low values reduce the
probability of lockup, probably because the process generating writes is
stopped before too many files are created.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ