[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250806144650.GA778805@mit.edu>
Date: Wed, 6 Aug 2025 10:46:50 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Mingyu He <mingyu.he@...pee.com>
Cc: Andreas Dilger <adilger.kernel@...ger.ca>, linux-ext4@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [BUG] ext4: mballoctor issue observed in fs/ext4/mballoc.c
ext4_mb_regular_allocator on kernel 6.6
On Wed, Aug 06, 2025 at 04:26:49PM +0800, Mingyu He wrote:
> Hi EXT4 maintainers,
>
> I would like to report a potential bug related to the ext4 allocator
> implementation, specifically in the file `fs/ext4/mballoc.c`.
Yeah, this is a known issue with ext4's RAID support. The problem is
that we're trying too hard to try to find a precise RAID stripe
alignment. There are a couple of things that could be done to solve
the issue, but none of them are easy.
* Cache the number of stripe aligned regions in a particular block
group, so we can skip the block groups where searching is for a
stripe alignment is hopeless. This will reduce the CPU time spent
searching all of the block groups for each alignment, but on a
freshly mounted disk, initial allocations will still be slow since
we would need to read the block allocation bitmaps into memory and
then process them. We would also need to keep the cache of the
number of stripe aligned regions to a minimum.
* Have a hard limit on the amount of time (either wall clock time or
CPU time) spent searching for stripe aligned bitmaps. If none are
available, bail out early.
* Use a more efficient in-memory data structure for storing the free
block information. Today, we use a buddy bitmap, which is great if
we are doing power of two allocations (which for non-RAID file
systems, we try to do whenever possible, up to trying to allocate
more space than what was asked for in case the user tries to append
to the file later). If the RAID stripe size is power-of-two
aligned, the buddy bitmap would be fine, but very often, that isn't
the case. This still requires initially reading the block bitmap
into memory in order to convert to that more efficient in-memory
data structure, but it is simpler than...
* Use a more efficient on-disk data structure, such as a b-tree. This
requires an on-disk format change, which means we would need to
update e2fsprogs, and we would have to worry about backwards
compatibility in case the file system needs to be mounted on an
older kernel.
If someone is interested in working on these options (which I view as
a new feature, not as a bug fix), please contact me and I'm happy to
discuss further.
Alternatively, a workaround is to simply disable the RAID stripe
information in the superblock. You can do this via "tune2fs -E
stripe_width 0 /dev/sdXX". For a file system which is fragmented such
that finding stripe aligned free space is hopeless, this isn't going
to hurt, and it will definitely help. In the most recent version of
e2fsprogs, this is now the default in mke2fs for non-rotational (e.g.,
thin provisioned, or flash based) storage devices:
commit b61f182b2de1ea75cff935037883ba1a8c7db623
Author: Theodore Ts'o <tytso@....edu>
Date: Sun May 4 14:07:14 2025 -0400
mke2fs: don't set the raid stripe for non-rotational devices by default
The ext4 block allocator is not at all efficient when it is asked to
enforce RAID alignment. It is especially bad for flash-based devices,
or when the file system is highly fragmented. For non-rotational
devices, it's fine to set the stride parameter (which controls
spreading the allocation bitmaps across the RAID component devices,
which always makessense); but for the stripe parameter (which asks the
ext4 block alocator to try _very_ hard to find RAID stripe aligned
devices) it's probably not a good idea.
Add new mke2fs.conf parameters with the defaults:
[defaults]
set_raid_stride = always
set_raid_stripe = disk
Even for RAID arrays based on HDD's, we can still have problems for
highly fragmented file systems. This will need to solved in the
kernel, probably by having some kind of wall clock or CPU time
limitation for each block allocation or adding some kind of
optimization which is faster than using our current buddy bitmap
implementation, especially if the stripe size is not multiple of a
power of two. But for SSD's, it's much less likely to make sense even
if we have an optimized block allocator, because if you've paid $$$
for a flash-based RAID array, the cost/benefit tradeoffs of doing less
optimized stripe RMW cycles versus the block allocator time and CPU
overhead is harder to justify without a lot of optimization effort.
If and when we can improve the ext4 kernel implementation (and it gets
rolled out to users using LTS kernels), we can change the defaults.
And of course, system administrators can always change
/etc/mke2fs.conf settings.
Signed-off-by: Theodore Ts'o <tytso@....edu>
Cheers,
- Ted
Powered by blists - more mailing lists