[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251208083246.320965-3-yukuai@fnnas.com>
Date: Mon, 8 Dec 2025 16:32:46 +0800
From: Yu Kuai <yukuai@...as.com>
To: tytso@....edu,
adilger.kernel@...ger.ca,
linux-ext4@...r.kernel.org
Cc: linux-kernel@...r.kernel.org,
yukuai@...as.com
Subject: [PATCH 2/2] ext4: align preallocation size to stripe width
When stripe width (io_opt) is configured, align the predicted
preallocation size to stripe boundaries. This ensures optimal I/O
performance on RAID and other striped storage devices by avoiding
partial stripe operations.
The current implementation uses hardcoded size predictions (16KB, 32KB,
64KB, etc.) that are not stripe-aware. This causes physical block
offsets on disk to be misaligned to stripe boundaries, leading to
read-modify-write penalties on RAID arrays and reduced performance.
This patch makes size prediction stripe-aware by using multiples of
stripe size (1x, 2x, 4x, 8x, 16x, 32x) when s_stripe is set.
Additionally, the start offset is aligned to stripe boundaries using
rounddown(), which works correctly for both power-of-2 and non-power-of-2
stripe sizes. For devices without stripe configuration, the original
behavior is preserved.
The predicted size is limited to max free chunk size (2 << bsbits) to
ensure reasonable allocation requests, with the limit rounded down to
maintain stripe alignment.
Test case:
Device: 32-disk RAID5, 64KB chunk size
Stripe: 496 blocks (31 data disks × 16 blocks/disk)
Before patch (misaligned physical offsets):
ext: logical_offset: physical_offset: length:
0: 0.. 63487: 34816.. 98303: 63488
1: 63488..126975: 100352..163839: 63488
2: 126976..190463: 165888..229375: 63488
3: 190464..253951: 231424..294911: 63488
4: 253952..262143: 296960..305151: 8192
Physical offsets: 34816 % 496 = 96 (misaligned)
100352 % 496 = 160 (misaligned)
165888 % 496 = 224 (misaligned)
→ Causes partial stripe writes on RAID
After patch (aligned physical offsets):
ext: logical_offset: physical_offset: length:
0: 0.. 17855: 9920.. 27775: 17856
1: 17856.. 42159: 34224.. 58527: 24304
2: 42160.. 73407: 65968.. 97215: 31248
3: 73408.. 97711: 99696..123999: 24304
... (all extents aligned until EOF)
Physical offsets: 9920 % 496 = 0 (aligned)
34224 % 496 = 0 (aligned)
65968 % 496 = 0 (aligned)
Extent lengths: 17856=496×36, 24304=496×49, 31248=496×63
→ Optimal RAID performance, no partial stripe writes
Benefits:
- Eliminates read-modify-write operations on RAID arrays
- Improves sequential write performance on striped devices
- Maintains proper alignment throughout file lifetime
- Works with any stripe size (power-of-2 or not)
Signed-off-by: Yu Kuai <yukuai@...as.com>
---
fs/ext4/mballoc.c | 60 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index eb46a4f5fb4f..dbd0b239cc96 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4500,7 +4500,10 @@ static inline bool ext4_mb_check_size(loff_t req, loff_t size,
/*
* Predict file size for preallocation. Returns the predicted size
- * in bytes and sets start_off if alignment is needed for large files.
+ * in bytes. When stripe width (io_opt) is configured, returns sizes
+ * that are multiples of stripe for optimal RAID performance.
+ *
+ * Sets start_off if alignment is needed for large files.
*/
static loff_t ext4_mb_predict_file_size(struct ext4_sb_info *sbi,
struct ext4_allocation_context *ac,
@@ -4511,6 +4514,59 @@ static loff_t ext4_mb_predict_file_size(struct ext4_sb_info *sbi,
*start_off = 0;
+ /*
+ * For RAID/striped devices, align preallocation size to stripe
+ * width (io_opt) for optimal I/O performance. Use power-of-2
+ * multiples of stripe size for size prediction.
+ */
+ if (sbi->s_stripe) {
+ loff_t stripe_bytes = (loff_t)sbi->s_stripe << bsbits;
+ loff_t max_size = (loff_t)max << bsbits;
+
+ /*
+ * TODO: If stripe is larger than max chunk size, we can't
+ * do stripe-aligned allocation. Fall back to traditional
+ * size prediction. This can happen with very large stripe
+ * configurations on small block sizes.
+ */
+ if (stripe_bytes > max_size)
+ goto no_stripe;
+
+ if (size <= stripe_bytes) {
+ size = stripe_bytes;
+ } else if (size <= stripe_bytes * 2) {
+ size = stripe_bytes * 2;
+ } else if (size <= stripe_bytes * 4) {
+ size = stripe_bytes * 4;
+ } else if (size <= stripe_bytes * 8) {
+ size = stripe_bytes * 8;
+ } else if (size <= stripe_bytes * 16) {
+ size = stripe_bytes * 16;
+ } else if (size <= stripe_bytes * 32) {
+ size = stripe_bytes * 32;
+ } else {
+ size = roundup(size, stripe_bytes);
+ }
+
+ /*
+ * Limit size to max free chunk size, rounded down to
+ * stripe alignment.
+ */
+ if (size > max_size)
+ size = rounddown(max_size, stripe_bytes);
+
+ /*
+ * Align start offset to stripe boundary for large allocations
+ * to ensure both start and size are stripe-aligned.
+ */
+ *start_off = rounddown((loff_t)ac->ac_o_ex.fe_logical << bsbits,
+ stripe_bytes);
+
+ return size;
+ }
+
+no_stripe:
+ /* No stripe: use traditional hardcoded size prediction */
if (size <= 16 * 1024) {
size = 16 * 1024;
} else if (size <= 32 * 1024) {
@@ -4556,7 +4612,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
{
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_super_block *es = sbi->s_es;
- int bsbits, max;
+ int bsbits;
loff_t size, start_off = 0, end;
loff_t orig_size __maybe_unused;
ext4_lblk_t start;
--
2.51.0
Powered by blists - more mailing lists