[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <9E89B4B6-BF01-4B6B-94BA-8A44112F8A3A@dilger.ca>
Date: Fri, 6 Feb 2026 16:53:28 -0700
From: Andreas Dilger <adilger@...ger.ca>
To: Mario Lohajner <mario_lohajner@...ketmail.com>
Cc: Baokun Li <libaokun1@...wei.com>,
tytso@....edu,
linux-ext4@...r.kernel.org,
linux-kernel@...r.kernel.org,
Yang Erkun <yangerkun@...wei.com>,
libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy
On Feb 5, 2026, at 05:23, Mario Lohajner <mario_lohajner@...ketmail.com> wrote:
>
> Let me briefly restate the intent, focusing on the fundamentals.
>
> Rotalloc is not wear leveling (and is intentionally not named as such).
> It is a allocation policy whose goal is to reduce allocation hotspots by
> enforcing mount-wide sequential allocation. Wear leveling, if any,
> remains a device/firmware concern and is explicitly out of scope.
> While WL motivated part of this work,
>
> the main added value of this patch is allocator separation.
> The policy indirection (aka vectored allocator) allows allocation
> strategies that are orthogonal to the regular allocator to operate
> outside the hot path, preserving existing heuristics and improving
> maintainability. Rotalloc is one such case; when disabled (by default),
> there is literally no behavioral change.
>
> The cursor is global because the policy is global (KSS applied here :-).
> It operates at the mount level, not per inode, stream, or CPU.
> Splitting the cursor would still require keeping all cursors in sync;
> once any allocator diverges, it either has to scan more blocks or
> sequentiality collapses back into locality and hotspots reappear.
> A single cursor is therefore intentional.
>
> The rotating allocator itself is a working prototype.
> It was written with minimal diff and clarity in mind to make the policy
> reviewable. Refinements and simplifications are expected and welcome.
>
> Regarding discard/trim: while discard prepares blocks for reuse and
> signals that a block is free, it does not implement wear leveling by
> itself. Rotalloc operates at a higher layer; by promoting sequentiality,
> it reduces block/group allocation hotspots regardless of underlying
> device behavior.
I think there are two main reasons why a round-robin allocation policy is of
interest:
- maximizing the time between block group re-use also maximizes the time
during which blocks can be freed in a group, reducing fragmentation
- this also improves the ability of fstrim to process larger segments
of free storage compared to "-o discard" which is sub-optimal when
processing recently-freed blocks (which may be small/unaligned compared
to the erase blocks/RAID of the storage and not actually trim any space)
The latter is what motivated the "-o fstrim" feature[1] that we developed
for Lustre as a lighter-weight alternative to "-o discard" to use fstrim
infrastructure on an ongoing basis to trim whole groups that have freed
space recently, but not track all of the freed extents individually.
Cheers, Andreas
[1] Patches to implement persistent TRIMMED flag, stats, and "fstrim":
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel10.0/ext4-introduce-EXT4_BG_TRIMMED-to-optimize-fstrim.patch
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel10.0/ext4-add-DISCARD-stats.patch
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel9.4/ext4-add-fstrim-mount-option.patch
> Since it is not in line with the current allocator goals, it is
> implemented as an optional policy.
>
> Best regards,
> Mario
>
> PS: thank you for acknowledging that there are workloads and scenarios
> where this method is worthwhile :-).
>
> On 05. 02. 2026. 04:52, Baokun Li wrote:
>> On 2026-02-04 19:06, Mario Lohajner wrote:
>>> Hello Baokun Li,
>>>
>>> This response was originally intended for Andreas.
>>> I'm sending you the full copy to provide context for your query,
>>> rather than writing a separate response.
>>>
>>> Yes, the main motive for this allocator is flash wear leveling,
>>> but it is not strictly a wear leveling mechanism, and it is not named
>>> as such for a reason.
>>> Wear leveling may (or may not) exist at the device/hardware level.
>>> The goal of this policy is not to "fix" that.
>>>
>> As Ted mentioned in another thread, wear leveling is media-dependent.
>> Most drivers can handle wear leveling effectively enough just via the
>> discard command.
>>
>> If you are using UFS, F2FS might be a solid choice. However, for raw
>> NAND flash, UBIFS (via UBI) or JFFS2 would be more appropriate.
>>
>> A single global goal would cause severe contention in multi-CPU
>> scenarios, which is precisely why the stream allocation goal was split
>> into multiple ones.
>>
>> Furthermore, constantly overriding the inode goal leads to significant
>> file fragmentation, as it often misses opportunities for extent merging.
>>
>> If we truly want to implement ext4_mb_rotating_allocator, we should strip
>> out inode goal, stream allocation, and optimize_scan, rather than simply
>> cloning ext4_mb_regular_allocator and forcing a goal setting.
>>
>>
>> Cheers,
>> Baokun
>>
>>> This policy helps avoid allocation hotspots at mount start by
>>> distributing allocations sequentially across the entire mount,
>>> not just a file or allocation stream.
>>>
>>> At the block/group allocation level, the file system is fairly stochastic
>>> and timing-sensitive. Rather than providing raw benchmark data, I prefer
>>> to explain the design analytically:
>>> The vectored separation of the new allocator ensures that the performance
>>> of the regular allocator is maintained (literally unchanged).
>>> The overhead of the new rotating allocator is minimal and occurs outside
>>> of the "hot loop":
>>> the cursor is retrieved early at the start, updated upon successful
>>> allocation,
>>> and is negligible with respect to IO latency.
>>> Because allocations proceed sequentially, latency is comparable to
>>> or better than the regular allocator.
>>> Having separated allocators increases maintainability and independence
>>> with minimal (virtually no) overhead.
>>>
>>> This policy benefits workloads with frequent large or small allocations,
>>> while keeping file fragmentation and slack space minimal.
>>> It is a conscious trade-off: sacrificing locality in favor of reinforced
>>> sequentiality.
>>> Of course, this is not optimal for classic HDDs, but NVMe drives behave
>>> differently.
>>> For this reason, the policy is optional per mount, turned off by default,
>>> and can be toggled at mount time.
>>>
>>> Best regards,
>>> Mario
>>>
>>> On 04. 02. 2026. 07:29, Baokun Li wrote:
>>>> On 2026-02-04 11:31, Mario Lohajner wrote:
>>>>> Add support for the rotalloc allocation policy as a new mount
>>>>> option. Policy rotates the starting block group for new allocations.
>>>>>
>>>>> Changes:
>>>>> - fs/ext4/ext4.h
>>>>> rotalloc policy dedlared, extend sb with cursor, vector & lock
>>>>>
>>>>> - fs/ext4/mballoc.h
>>>>> expose allocator functions for vectoring in super.c
>>>>>
>>>>> - fs/ext4/super.c
>>>>> parse rotalloc mnt opt, init cursor, lock and allocator vector
>>>>>
>>>>> - fs/ext4/mballoc.c
>>>>> add rotalloc allocator, vectored allocator call in new_blocks
>>>>>
>>>>> The policy is selected via a mount option and does not change the
>>>>> on-disk format or default allocation behavior. It preserves existing
>>>>> allocation heuristics within a block group while distributing
>>>>> allocations across block groups in a deterministic sequential manner.
>>>>>
>>>>> The rotating allocator is implemented as a separate allocation path
>>>>> selected at mount time. This avoids conditional branches in the regular
>>>>> allocator and keeps allocation policies isolated.
>>>>> This also allows the rotating allocator to evolve independently in the
>>>>> future without increasing complexity in the regular allocator.
>>>>>
>>>>> The policy was tested using v6.18.6 stable locally with the new mount
>>>>> option "rotalloc" enabled, confirmed working as desribed!
>>>>>
>>>>> Signed-off-by: Mario Lohajner <mario_lohajner@...ketmail.com>
>>>>> ---
>>>>> fs/ext4/ext4.h | 8 +++
>>>>> fs/ext4/mballoc.c | 152 ++++++++++++++++++++++++++++++++++++++++++++--
>>>>> fs/ext4/mballoc.h | 3 +
>>>>> fs/ext4/super.c | 18 +++++-
>>>>> 4 files changed, 175 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>> index 56112f201cac..cbbb7c05d7a2 100644
>>>>> --- a/fs/ext4/ext4.h
>>>>> +++ b/fs/ext4/ext4.h
>>>>> @@ -229,6 +229,9 @@ struct ext4_allocation_request {
>>>>> unsigned int flags;
>>>>> };
>>>>> +/* expose rotalloc allocator argument pointer type */
>>>>> +struct ext4_allocation_context;
>>>>> +
>>>>> /*
>>>>> * Logical to physical block mapping, used by ext4_map_blocks()
>>>>> *
>>>>> @@ -1230,6 +1233,7 @@ struct ext4_inode_info {
>>>>> * Mount flags set via mount options or defaults
>>>>> */
>>>>> #define EXT4_MOUNT_NO_MBCACHE 0x00001 /* Do not use mbcache */
>>>>> +#define EXT4_MOUNT_ROTALLOC 0x00002 /* Use rotalloc
>>>>> policy/allocator */
>>>>> #define EXT4_MOUNT_GRPID 0x00004 /* Create files with
>>>>> directory's group */
>>>>> #define EXT4_MOUNT_DEBUG 0x00008 /* Some debugging messages */
>>>>> #define EXT4_MOUNT_ERRORS_CONT 0x00010 /* Continue on
>>>>> errors */
>>>>> @@ -1559,6 +1563,10 @@ struct ext4_sb_info {
>>>>> unsigned long s_mount_flags;
>>>>> unsigned int s_def_mount_opt;
>>>>> unsigned int s_def_mount_opt2;
>>>>> + /* Rotalloc cursor, lock & new_blocks allocator vector */
>>>>> + unsigned int s_rotalloc_cursor;
>>>>> + spinlock_t s_rotalloc_lock;
>>>>> + int (*s_mb_new_blocks)(struct ext4_allocation_context *ac);
>>>>> ext4_fsblk_t s_sb_block;
>>>>> atomic64_t s_resv_clusters;
>>>>> kuid_t s_resuid;
>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>> index 56d50fd3310b..74f79652c674 100644
>>>>> --- a/fs/ext4/mballoc.c
>>>>> +++ b/fs/ext4/mballoc.c
>>>>> @@ -2314,11 +2314,11 @@ static void ext4_mb_check_limits(struct
>>>>> ext4_allocation_context *ac,
>>>>> * stop the scan and use it immediately
>>>>> *
>>>>> * * If free extent found is smaller than goal, then keep retrying
>>>>> - * upto a max of sbi->s_mb_max_to_scan times (default 200). After
>>>>> + * up to a max of sbi->s_mb_max_to_scan times (default 200). After
>>>>> * that stop scanning and use whatever we have.
>>>>> *
>>>>> * * If free extent found is bigger than goal, then keep retrying
>>>>> - * upto a max of sbi->s_mb_min_to_scan times (default 10) before
>>>>> + * up to a max of sbi->s_mb_min_to_scan times (default 10) before
>>>>> * stopping the scan and using the extent.
>>>>> *
>>>>> *
>>>>> @@ -2981,7 +2981,7 @@ static int ext4_mb_scan_group(struct
>>>>> ext4_allocation_context *ac,
>>>>> return ret;
>>>>> }
>>>>> -static noinline_for_stack int
>>>>> +noinline_for_stack int
>>>>> ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>>>> {
>>>>> ext4_group_t i;
>>>>> @@ -3012,7 +3012,7 @@ ext4_mb_regular_allocator(struct
>>>>> ext4_allocation_context *ac)
>>>>> * is greater than equal to the sbi_s_mb_order2_reqs
>>>>> * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
>>>>> * We also support searching for power-of-two requests only for
>>>>> - * requests upto maximum buddy size we have constructed.
>>>>> + * requests up to maximum buddy size we have constructed.
>>>>> */
>>>>> if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
>>>>> if (is_power_of_2(ac->ac_g_ex.fe_len))
>>>>> @@ -3101,6 +3101,144 @@ ext4_mb_regular_allocator(struct
>>>>> ext4_allocation_context *ac)
>>>>> return err;
>>>>> }
>>>>> +/* Rotating allocator (rotalloc mount option) */
>>>>> +noinline_for_stack int
>>>>> +ext4_mb_rotating_allocator(struct ext4_allocation_context *ac)
>>>>> +{
>>>>> + ext4_group_t i, goal;
>>>>> + int err = 0;
>>>>> + struct super_block *sb = ac->ac_sb;
>>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>>> + struct ext4_buddy e4b;
>>>>> +
>>>>> + BUG_ON(ac->ac_status == AC_STATUS_FOUND);
>>>>> +
>>>>> + /* Set the goal from s_rotalloc_cursor */
>>>>> + spin_lock(&sbi->s_rotalloc_lock);
>>>>> + goal = sbi->s_rotalloc_cursor;
>>>>> + spin_unlock(&sbi->s_rotalloc_lock);
>>>>> + ac->ac_g_ex.fe_group = goal;
>>>>> +
>>>>> + /* first, try the goal */
>>>>> + err = ext4_mb_find_by_goal(ac, &e4b);
>>>>> + if (err || ac->ac_status == AC_STATUS_FOUND)
>>>>> + goto out;
>>>>> +
>>>>> + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
>>>>> + goto out;
>>>>> +
>>>>> + /*
>>>>> + * ac->ac_2order is set only if the fe_len is a power of 2
>>>>> + * if ac->ac_2order is set we also set criteria to CR_POWER2_ALIGNED
>>>>> + * so that we try exact allocation using buddy.
>>>>> + */
>>>>> + i = fls(ac->ac_g_ex.fe_len);
>>>>> + ac->ac_2order = 0;
>>>>> + /*
>>>>> + * We search using buddy data only if the order of the request
>>>>> + * is greater than equal to the sbi_s_mb_order2_reqs
>>>>> + * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
>>>>> + * We also support searching for power-of-two requests only for
>>>>> + * requests up to maximum buddy size we have constructed.
>>>>> + */
>>>>> + if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
>>>>> + if (is_power_of_2(ac->ac_g_ex.fe_len))
>>>>> + ac->ac_2order = array_index_nospec(i - 1,
>>>>> + MB_NUM_ORDERS(sb));
>>>>> + }
>>>>> +
>>>>> + /* if stream allocation is enabled, use global goal */
>>>>> + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>> + int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
>>>>> +
>>>>> + ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_groups[hash]);
>>>>> + ac->ac_g_ex.fe_start = -1;
>>>>> + ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>>>> Rotating block allocation looks a lot like stream allocation—they both
>>>> pick up from where the last successful allocation left off.
>>>>
>>>> I noticed that the stream allocation's global goal is now split up.
>>>> Is there an advantage to keeping it as a single goal?
>>>> Alternatively, do you see any downsides to this split in your use case?
>>>>
>>>>
Cheers, Andreas
Powered by blists - more mailing lists