linux-ext4 - Re: [PATCH] ext4: add optional rotating block allocation policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <9E89B4B6-BF01-4B6B-94BA-8A44112F8A3A@dilger.ca>
Date: Fri, 6 Feb 2026 16:53:28 -0700
From: Andreas Dilger <adilger@...ger.ca>
To: Mario Lohajner <mario_lohajner@...ketmail.com>
Cc: Baokun Li <libaokun1@...wei.com>,
 tytso@....edu,
 linux-ext4@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 Yang Erkun <yangerkun@...wei.com>,
 libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy

On Feb 5, 2026, at 05:23, Mario Lohajner <mario_lohajner@...ketmail.com> wrote:
> 
> Let me briefly restate the intent, focusing on the fundamentals.
> 
> Rotalloc is not wear leveling (and is intentionally not named as such).
> It is a allocation policy whose goal is to reduce allocation hotspots by
> enforcing mount-wide sequential allocation. Wear leveling, if any,
> remains a device/firmware concern and is explicitly out of scope.
> While WL motivated part of this work,
> 
> the main added value of this patch is allocator separation.
> The policy indirection (aka vectored allocator) allows allocation
> strategies that are orthogonal to the regular allocator to operate
> outside the hot path, preserving existing heuristics and improving
> maintainability. Rotalloc is one such case; when disabled (by default),
> there is literally no behavioral change.
> 
> The cursor is global because the policy is global (KSS applied here :-).
> It operates at the mount level, not per inode, stream, or CPU.
> Splitting the cursor would still require keeping all cursors in sync;
> once any allocator diverges, it either has to scan more blocks or
> sequentiality collapses back into locality and hotspots reappear.
> A single cursor is therefore intentional.
> 
> The rotating allocator itself is a working prototype.
> It was written with minimal diff and clarity in mind to make the policy
> reviewable. Refinements and simplifications are expected and welcome.
> 
> Regarding discard/trim: while discard prepares blocks for reuse and
> signals that a block is free, it does not implement wear leveling by
> itself. Rotalloc operates at a higher layer; by promoting sequentiality,
> it reduces block/group allocation hotspots regardless of underlying
> device behavior.

I think there are two main reasons why a round-robin allocation policy is of
interest:
- maximizing the time between block group re-use also maximizes the time
  during which blocks can be freed in a group, reducing fragmentation
- this also improves the ability of fstrim to process larger segments
  of free storage compared to "-o discard" which is sub-optimal when
  processing recently-freed blocks (which may be small/unaligned compared
  to the erase blocks/RAID of the storage and not actually trim any space)

The latter is what motivated the "-o fstrim" feature[1] that we developed
for Lustre as a lighter-weight alternative to "-o discard" to use fstrim
infrastructure on an ongoing basis to trim whole groups that have freed
space recently, but not track all of the freed extents individually.

Cheers, Andreas

[1] Patches to implement persistent TRIMMED flag, stats, and "fstrim":
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel10.0/ext4-introduce-EXT4_BG_TRIMMED-to-optimize-fstrim.patch
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel10.0/ext4-add-DISCARD-stats.patch
https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel9.4/ext4-add-fstrim-mount-option.patch


> Since it is not in line with the current allocator goals, it is
> implemented as an optional policy.
> 
> Best regards,
> Mario
> 
> PS: thank you for acknowledging that there are workloads and scenarios
> where this method is worthwhile :-).
> 
> On 05. 02. 2026. 04:52, Baokun Li wrote:
>> On 2026-02-04 19:06, Mario Lohajner wrote:
>>> Hello Baokun Li,
>>> 
>>> This response was originally intended for Andreas.
>>> I'm sending you the full copy to provide context for your query,
>>> rather than writing a separate response.
>>> 
>>> Yes, the main motive for this allocator is flash wear leveling,
>>> but it is not strictly a wear leveling mechanism, and it is not named
>>> as such for a reason.
>>> Wear leveling may (or may not) exist at the device/hardware level.
>>> The goal of this policy is not to "fix" that.
>>> 
>> As Ted mentioned in another thread, wear leveling is media-dependent.
>> Most drivers can handle wear leveling effectively enough just via the
>> discard command.
>> 
>> If you are using UFS, F2FS might be a solid choice. However, for raw
>> NAND flash, UBIFS (via UBI) or JFFS2 would be more appropriate.
>> 
>> A single global goal would cause severe contention in multi-CPU
>> scenarios, which is precisely why the stream allocation goal was split
>> into multiple ones.
>> 
>> Furthermore, constantly overriding the inode goal leads to significant
>> file fragmentation, as it often misses opportunities for extent merging.
>> 
>> If we truly want to implement ext4_mb_rotating_allocator, we should strip
>> out inode goal, stream allocation, and optimize_scan, rather than simply
>> cloning ext4_mb_regular_allocator and forcing a goal setting.
>> 
>> 
>> Cheers,
>> Baokun
>> 
>>> This policy helps avoid allocation hotspots at mount start by
>>> distributing allocations sequentially across the entire mount,
>>> not just a file or allocation stream.
>>> 
>>> At the block/group allocation level, the file system is fairly stochastic
>>> and timing-sensitive. Rather than providing raw benchmark data, I prefer
>>> to explain the design analytically:
>>> The vectored separation of the new allocator ensures that the performance
>>> of the regular allocator is maintained (literally unchanged).
>>> The overhead of the new rotating allocator is minimal and occurs outside
>>> of the "hot loop":
>>> the cursor is retrieved early at the start, updated upon successful
>>> allocation,
>>> and is negligible with respect to IO latency.
>>> Because allocations proceed sequentially, latency is comparable to
>>> or better than the regular allocator.
>>> Having separated allocators increases maintainability and independence
>>> with minimal (virtually no) overhead.
>>> 
>>> This policy benefits workloads with frequent large or small allocations,
>>> while keeping file fragmentation and slack space minimal.
>>> It is a conscious trade-off: sacrificing locality in favor of reinforced
>>> sequentiality.
>>> Of course, this is not optimal for classic HDDs, but NVMe drives behave
>>> differently.
>>> For this reason, the policy is optional per mount, turned off by default,
>>> and can be toggled at mount time.
>>> 
>>> Best regards,
>>> Mario
>>> 
>>> On 04. 02. 2026. 07:29, Baokun Li wrote:
>>>> On 2026-02-04 11:31, Mario Lohajner wrote:
>>>>> Add support for the rotalloc allocation policy as a new mount
>>>>> option. Policy rotates the starting block group for new allocations.
>>>>> 
>>>>> Changes:
>>>>> - fs/ext4/ext4.h
>>>>>     rotalloc policy dedlared, extend sb with cursor, vector & lock
>>>>> 
>>>>> - fs/ext4/mballoc.h
>>>>>     expose allocator functions for vectoring in super.c
>>>>> 
>>>>> - fs/ext4/super.c
>>>>>     parse rotalloc mnt opt, init cursor, lock and allocator vector
>>>>> 
>>>>> - fs/ext4/mballoc.c
>>>>>     add rotalloc allocator, vectored allocator call in new_blocks
>>>>> 
>>>>> The policy is selected via a mount option and does not change the
>>>>> on-disk format or default allocation behavior. It preserves existing
>>>>> allocation heuristics within a block group while distributing
>>>>> allocations across block groups in a deterministic sequential manner.
>>>>> 
>>>>> The rotating allocator is implemented as a separate allocation path
>>>>> selected at mount time. This avoids conditional branches in the regular
>>>>> allocator and keeps allocation policies isolated.
>>>>> This also allows the rotating allocator to evolve independently in the
>>>>> future without increasing complexity in the regular allocator.
>>>>> 
>>>>> The policy was tested using v6.18.6 stable locally with the new mount
>>>>> option "rotalloc" enabled, confirmed working as desribed!
>>>>> 
>>>>> Signed-off-by: Mario Lohajner <mario_lohajner@...ketmail.com>
>>>>> ---
>>>>>   fs/ext4/ext4.h    |   8 +++
>>>>>   fs/ext4/mballoc.c | 152 ++++++++++++++++++++++++++++++++++++++++++++--
>>>>>   fs/ext4/mballoc.h |   3 +
>>>>>   fs/ext4/super.c   |  18 +++++-
>>>>>   4 files changed, 175 insertions(+), 6 deletions(-)
>>>>> 
>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>> index 56112f201cac..cbbb7c05d7a2 100644
>>>>> --- a/fs/ext4/ext4.h
>>>>> +++ b/fs/ext4/ext4.h
>>>>> @@ -229,6 +229,9 @@ struct ext4_allocation_request {
>>>>>       unsigned int flags;
>>>>>   };
>>>>>   +/* expose rotalloc allocator argument pointer type */
>>>>> +struct ext4_allocation_context;
>>>>> +
>>>>>   /*
>>>>>    * Logical to physical block mapping, used by ext4_map_blocks()
>>>>>    *
>>>>> @@ -1230,6 +1233,7 @@ struct ext4_inode_info {
>>>>>    * Mount flags set via mount options or defaults
>>>>>    */
>>>>>   #define EXT4_MOUNT_NO_MBCACHE        0x00001 /* Do not use mbcache */
>>>>> +#define EXT4_MOUNT_ROTALLOC            0x00002 /* Use rotalloc
>>>>> policy/allocator */
>>>>>   #define EXT4_MOUNT_GRPID        0x00004    /* Create files with
>>>>> directory's group */
>>>>>   #define EXT4_MOUNT_DEBUG        0x00008    /* Some debugging messages */
>>>>>   #define EXT4_MOUNT_ERRORS_CONT        0x00010    /* Continue on
>>>>> errors */
>>>>> @@ -1559,6 +1563,10 @@ struct ext4_sb_info {
>>>>>       unsigned long s_mount_flags;
>>>>>       unsigned int s_def_mount_opt;
>>>>>       unsigned int s_def_mount_opt2;
>>>>> +    /* Rotalloc cursor, lock & new_blocks allocator vector */
>>>>> +    unsigned int s_rotalloc_cursor;
>>>>> +    spinlock_t s_rotalloc_lock;
>>>>> +    int (*s_mb_new_blocks)(struct ext4_allocation_context *ac);
>>>>>       ext4_fsblk_t s_sb_block;
>>>>>       atomic64_t s_resv_clusters;
>>>>>       kuid_t s_resuid;
>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>> index 56d50fd3310b..74f79652c674 100644
>>>>> --- a/fs/ext4/mballoc.c
>>>>> +++ b/fs/ext4/mballoc.c
>>>>> @@ -2314,11 +2314,11 @@ static void ext4_mb_check_limits(struct
>>>>> ext4_allocation_context *ac,
>>>>>    *   stop the scan and use it immediately
>>>>>    *
>>>>>    * * If free extent found is smaller than goal, then keep retrying
>>>>> - *   upto a max of sbi->s_mb_max_to_scan times (default 200). After
>>>>> + *   up to a max of sbi->s_mb_max_to_scan times (default 200). After
>>>>>    *   that stop scanning and use whatever we have.
>>>>>    *
>>>>>    * * If free extent found is bigger than goal, then keep retrying
>>>>> - *   upto a max of sbi->s_mb_min_to_scan times (default 10) before
>>>>> + *   up to a max of sbi->s_mb_min_to_scan times (default 10) before
>>>>>    *   stopping the scan and using the extent.
>>>>>    *
>>>>>    *
>>>>> @@ -2981,7 +2981,7 @@ static int ext4_mb_scan_group(struct
>>>>> ext4_allocation_context *ac,
>>>>>       return ret;
>>>>>   }
>>>>>   -static noinline_for_stack int
>>>>> +noinline_for_stack int
>>>>>   ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>>>>   {
>>>>>       ext4_group_t i;
>>>>> @@ -3012,7 +3012,7 @@ ext4_mb_regular_allocator(struct
>>>>> ext4_allocation_context *ac)
>>>>>        * is greater than equal to the sbi_s_mb_order2_reqs
>>>>>        * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
>>>>>        * We also support searching for power-of-two requests only for
>>>>> -     * requests upto maximum buddy size we have constructed.
>>>>> +     * requests up to maximum buddy size we have constructed.
>>>>>        */
>>>>>       if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
>>>>>           if (is_power_of_2(ac->ac_g_ex.fe_len))
>>>>> @@ -3101,6 +3101,144 @@ ext4_mb_regular_allocator(struct
>>>>> ext4_allocation_context *ac)
>>>>>       return err;
>>>>>   }
>>>>>   +/* Rotating allocator (rotalloc mount option) */
>>>>> +noinline_for_stack int
>>>>> +ext4_mb_rotating_allocator(struct ext4_allocation_context *ac)
>>>>> +{
>>>>> +    ext4_group_t i, goal;
>>>>> +    int err = 0;
>>>>> +    struct super_block *sb = ac->ac_sb;
>>>>> +    struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>>> +    struct ext4_buddy e4b;
>>>>> +
>>>>> +    BUG_ON(ac->ac_status == AC_STATUS_FOUND);
>>>>> +
>>>>> +    /* Set the goal from s_rotalloc_cursor */
>>>>> +    spin_lock(&sbi->s_rotalloc_lock);
>>>>> +    goal = sbi->s_rotalloc_cursor;
>>>>> +    spin_unlock(&sbi->s_rotalloc_lock);
>>>>> +    ac->ac_g_ex.fe_group = goal;
>>>>> +
>>>>> +    /* first, try the goal */
>>>>> +    err = ext4_mb_find_by_goal(ac, &e4b);
>>>>> +    if (err || ac->ac_status == AC_STATUS_FOUND)
>>>>> +        goto out;
>>>>> +
>>>>> +    if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
>>>>> +        goto out;
>>>>> +
>>>>> +    /*
>>>>> +     * ac->ac_2order is set only if the fe_len is a power of 2
>>>>> +     * if ac->ac_2order is set we also set criteria to CR_POWER2_ALIGNED
>>>>> +     * so that we try exact allocation using buddy.
>>>>> +     */
>>>>> +    i = fls(ac->ac_g_ex.fe_len);
>>>>> +    ac->ac_2order = 0;
>>>>> +    /*
>>>>> +     * We search using buddy data only if the order of the request
>>>>> +     * is greater than equal to the sbi_s_mb_order2_reqs
>>>>> +     * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
>>>>> +     * We also support searching for power-of-two requests only for
>>>>> +     * requests up to maximum buddy size we have constructed.
>>>>> +     */
>>>>> +    if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
>>>>> +        if (is_power_of_2(ac->ac_g_ex.fe_len))
>>>>> +            ac->ac_2order = array_index_nospec(i - 1,
>>>>> +                               MB_NUM_ORDERS(sb));
>>>>> +    }
>>>>> +
>>>>> +    /* if stream allocation is enabled, use global goal */
>>>>> +    if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>> +        int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
>>>>> +
>>>>> +        ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_groups[hash]);
>>>>> +        ac->ac_g_ex.fe_start = -1;
>>>>> +        ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>>>> Rotating block allocation looks a lot like stream allocation—they both
>>>> pick up from where the last successful allocation left off.
>>>> 
>>>> I noticed that the stream allocation's global goal is now split up.
>>>> Is there an advantage to keeping it as a single goal?
>>>> Alternatively, do you see any downsides to this split in your use case?
>>>> 
>>>> 


Cheers, Andreas