lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <606941c7-2a0d-44c7-a848-188212686a78@rocketmail.com>
Date: Thu, 5 Feb 2026 13:23:18 +0100
From: Mario Lohajner <mario_lohajner@...ketmail.com>
To: Baokun Li <libaokun1@...wei.com>, tytso@....edu
Cc: adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org,
 linux-kernel@...r.kernel.org, Yang Erkun <yangerkun@...wei.com>,
 libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy

Let me briefly restate the intent, focusing on the fundamentals.

Rotalloc is not wear leveling (and is intentionally not named as such).
It is a allocation policy whose goal is to reduce allocation hotspots by
enforcing mount-wide sequential allocation. Wear leveling, if any,
remains a device/firmware concern and is explicitly out of scope.
While WL motivated part of this work,

the main added value of this patch is allocator separation.
The policy indirection (aka vectored allocator) allows allocation
strategies that are orthogonal to the regular allocator to operate
outside the hot path, preserving existing heuristics and improving
maintainability. Rotalloc is one such case; when disabled (by default),
there is literally no behavioral change.

The cursor is global because the policy is global (KSS applied here :-).
It operates at the mount level, not per inode, stream, or CPU.
Splitting the cursor would still require keeping all cursors in sync;
once any allocator diverges, it either has to scan more blocks or
sequentiality collapses back into locality and hotspots reappear.
A single cursor is therefore intentional.

The rotating allocator itself is a working prototype.
It was written with minimal diff and clarity in mind to make the policy
reviewable. Refinements and simplifications are expected and welcome.

Regarding discard/trim: while discard prepares blocks for reuse and
signals that a block is free, it does not implement wear leveling by
itself. Rotalloc operates at a higher layer; by promoting sequentiality,
it reduces block/group allocation hotspots regardless of underlying
device behavior.
Since it is not in line with the current allocator goals, it is
implemented as an optional policy.

Best regards,
Mario

PS: thank you for acknowledging that there are workloads and scenarios
where this method is worthwhile :-).

On 05. 02. 2026. 04:52, Baokun Li wrote:
> On 2026-02-04 19:06, Mario Lohajner wrote:
>> Hello Baokun Li,
>>
>> This response was originally intended for Andreas.
>> I'm sending you the full copy to provide context for your query,
>> rather than writing a separate response.
>>
>> Yes, the main motive for this allocator is flash wear leveling,
>> but it is not strictly a wear leveling mechanism, and it is not named
>> as such for a reason.
>> Wear leveling may (or may not) exist at the device/hardware level.
>> The goal of this policy is not to "fix" that.
>>
> As Ted mentioned in another thread, wear leveling is media-dependent.
> Most drivers can handle wear leveling effectively enough just via the
> discard command.
>
> If you are using UFS, F2FS might be a solid choice. However, for raw
> NAND flash, UBIFS (via UBI) or JFFS2 would be more appropriate.
>
> A single global goal would cause severe contention in multi-CPU
> scenarios, which is precisely why the stream allocation goal was split
> into multiple ones.
>
> Furthermore, constantly overriding the inode goal leads to significant
> file fragmentation, as it often misses opportunities for extent merging.
>
> If we truly want to implement ext4_mb_rotating_allocator, we should strip
> out inode goal, stream allocation, and optimize_scan, rather than simply
> cloning ext4_mb_regular_allocator and forcing a goal setting.
>
>
> Cheers,
> Baokun
>
>> This policy helps avoid allocation hotspots at mount start by
>> distributing allocations sequentially across the entire mount,
>> not just a file or allocation stream.
>>
>> At the block/group allocation level, the file system is fairly stochastic
>> and timing-sensitive. Rather than providing raw benchmark data, I prefer
>> to explain the design analytically:
>> The vectored separation of the new allocator ensures that the performance
>> of the regular allocator is maintained (literally unchanged).
>> The overhead of the new rotating allocator is minimal and occurs outside
>> of the "hot loop":
>> the cursor is retrieved early at the start, updated upon successful
>> allocation,
>> and is negligible with respect to IO latency.
>> Because allocations proceed sequentially, latency is comparable to
>> or better than the regular allocator.
>> Having separated allocators increases maintainability and independence
>> with minimal (virtually no) overhead.
>>
>> This policy benefits workloads with frequent large or small allocations,
>> while keeping file fragmentation and slack space minimal.
>> It is a conscious trade-off: sacrificing locality in favor of reinforced
>> sequentiality.
>> Of course, this is not optimal for classic HDDs, but NVMe drives behave
>> differently.
>> For this reason, the policy is optional per mount, turned off by default,
>> and can be toggled at mount time.
>>
>> Best regards,
>> Mario
>>
>> On 04. 02. 2026. 07:29, Baokun Li wrote:
>>> On 2026-02-04 11:31, Mario Lohajner wrote:
>>>> Add support for the rotalloc allocation policy as a new mount
>>>> option. Policy rotates the starting block group for new allocations.
>>>>
>>>> Changes:
>>>> - fs/ext4/ext4.h
>>>>      rotalloc policy dedlared, extend sb with cursor, vector & lock
>>>>
>>>> - fs/ext4/mballoc.h
>>>>      expose allocator functions for vectoring in super.c
>>>>
>>>> - fs/ext4/super.c
>>>>      parse rotalloc mnt opt, init cursor, lock and allocator vector
>>>>
>>>> - fs/ext4/mballoc.c
>>>>      add rotalloc allocator, vectored allocator call in new_blocks
>>>>
>>>> The policy is selected via a mount option and does not change the
>>>> on-disk format or default allocation behavior. It preserves existing
>>>> allocation heuristics within a block group while distributing
>>>> allocations across block groups in a deterministic sequential manner.
>>>>
>>>> The rotating allocator is implemented as a separate allocation path
>>>> selected at mount time. This avoids conditional branches in the regular
>>>> allocator and keeps allocation policies isolated.
>>>> This also allows the rotating allocator to evolve independently in the
>>>> future without increasing complexity in the regular allocator.
>>>>
>>>> The policy was tested using v6.18.6 stable locally with the new mount
>>>> option "rotalloc" enabled, confirmed working as desribed!
>>>>
>>>> Signed-off-by: Mario Lohajner <mario_lohajner@...ketmail.com>
>>>> ---
>>>>    fs/ext4/ext4.h    |   8 +++
>>>>    fs/ext4/mballoc.c | 152 ++++++++++++++++++++++++++++++++++++++++++++--
>>>>    fs/ext4/mballoc.h |   3 +
>>>>    fs/ext4/super.c   |  18 +++++-
>>>>    4 files changed, 175 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 56112f201cac..cbbb7c05d7a2 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -229,6 +229,9 @@ struct ext4_allocation_request {
>>>>        unsigned int flags;
>>>>    };
>>>>    +/* expose rotalloc allocator argument pointer type */
>>>> +struct ext4_allocation_context;
>>>> +
>>>>    /*
>>>>     * Logical to physical block mapping, used by ext4_map_blocks()
>>>>     *
>>>> @@ -1230,6 +1233,7 @@ struct ext4_inode_info {
>>>>     * Mount flags set via mount options or defaults
>>>>     */
>>>>    #define EXT4_MOUNT_NO_MBCACHE        0x00001 /* Do not use mbcache */
>>>> +#define EXT4_MOUNT_ROTALLOC            0x00002 /* Use rotalloc
>>>> policy/allocator */
>>>>    #define EXT4_MOUNT_GRPID        0x00004    /* Create files with
>>>> directory's group */
>>>>    #define EXT4_MOUNT_DEBUG        0x00008    /* Some debugging messages */
>>>>    #define EXT4_MOUNT_ERRORS_CONT        0x00010    /* Continue on
>>>> errors */
>>>> @@ -1559,6 +1563,10 @@ struct ext4_sb_info {
>>>>        unsigned long s_mount_flags;
>>>>        unsigned int s_def_mount_opt;
>>>>        unsigned int s_def_mount_opt2;
>>>> +    /* Rotalloc cursor, lock & new_blocks allocator vector */
>>>> +    unsigned int s_rotalloc_cursor;
>>>> +    spinlock_t s_rotalloc_lock;
>>>> +    int (*s_mb_new_blocks)(struct ext4_allocation_context *ac);
>>>>        ext4_fsblk_t s_sb_block;
>>>>        atomic64_t s_resv_clusters;
>>>>        kuid_t s_resuid;
>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>> index 56d50fd3310b..74f79652c674 100644
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -2314,11 +2314,11 @@ static void ext4_mb_check_limits(struct
>>>> ext4_allocation_context *ac,
>>>>     *   stop the scan and use it immediately
>>>>     *
>>>>     * * If free extent found is smaller than goal, then keep retrying
>>>> - *   upto a max of sbi->s_mb_max_to_scan times (default 200). After
>>>> + *   up to a max of sbi->s_mb_max_to_scan times (default 200). After
>>>>     *   that stop scanning and use whatever we have.
>>>>     *
>>>>     * * If free extent found is bigger than goal, then keep retrying
>>>> - *   upto a max of sbi->s_mb_min_to_scan times (default 10) before
>>>> + *   up to a max of sbi->s_mb_min_to_scan times (default 10) before
>>>>     *   stopping the scan and using the extent.
>>>>     *
>>>>     *
>>>> @@ -2981,7 +2981,7 @@ static int ext4_mb_scan_group(struct
>>>> ext4_allocation_context *ac,
>>>>        return ret;
>>>>    }
>>>>    -static noinline_for_stack int
>>>> +noinline_for_stack int
>>>>    ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>>>    {
>>>>        ext4_group_t i;
>>>> @@ -3012,7 +3012,7 @@ ext4_mb_regular_allocator(struct
>>>> ext4_allocation_context *ac)
>>>>         * is greater than equal to the sbi_s_mb_order2_reqs
>>>>         * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
>>>>         * We also support searching for power-of-two requests only for
>>>> -     * requests upto maximum buddy size we have constructed.
>>>> +     * requests up to maximum buddy size we have constructed.
>>>>         */
>>>>        if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
>>>>            if (is_power_of_2(ac->ac_g_ex.fe_len))
>>>> @@ -3101,6 +3101,144 @@ ext4_mb_regular_allocator(struct
>>>> ext4_allocation_context *ac)
>>>>        return err;
>>>>    }
>>>>    +/* Rotating allocator (rotalloc mount option) */
>>>> +noinline_for_stack int
>>>> +ext4_mb_rotating_allocator(struct ext4_allocation_context *ac)
>>>> +{
>>>> +    ext4_group_t i, goal;
>>>> +    int err = 0;
>>>> +    struct super_block *sb = ac->ac_sb;
>>>> +    struct ext4_sb_info *sbi = EXT4_SB(sb);
>>>> +    struct ext4_buddy e4b;
>>>> +
>>>> +    BUG_ON(ac->ac_status == AC_STATUS_FOUND);
>>>> +
>>>> +    /* Set the goal from s_rotalloc_cursor */
>>>> +    spin_lock(&sbi->s_rotalloc_lock);
>>>> +    goal = sbi->s_rotalloc_cursor;
>>>> +    spin_unlock(&sbi->s_rotalloc_lock);
>>>> +    ac->ac_g_ex.fe_group = goal;
>>>> +
>>>> +    /* first, try the goal */
>>>> +    err = ext4_mb_find_by_goal(ac, &e4b);
>>>> +    if (err || ac->ac_status == AC_STATUS_FOUND)
>>>> +        goto out;
>>>> +
>>>> +    if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
>>>> +        goto out;
>>>> +
>>>> +    /*
>>>> +     * ac->ac_2order is set only if the fe_len is a power of 2
>>>> +     * if ac->ac_2order is set we also set criteria to CR_POWER2_ALIGNED
>>>> +     * so that we try exact allocation using buddy.
>>>> +     */
>>>> +    i = fls(ac->ac_g_ex.fe_len);
>>>> +    ac->ac_2order = 0;
>>>> +    /*
>>>> +     * We search using buddy data only if the order of the request
>>>> +     * is greater than equal to the sbi_s_mb_order2_reqs
>>>> +     * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
>>>> +     * We also support searching for power-of-two requests only for
>>>> +     * requests up to maximum buddy size we have constructed.
>>>> +     */
>>>> +    if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
>>>> +        if (is_power_of_2(ac->ac_g_ex.fe_len))
>>>> +            ac->ac_2order = array_index_nospec(i - 1,
>>>> +                               MB_NUM_ORDERS(sb));
>>>> +    }
>>>> +
>>>> +    /* if stream allocation is enabled, use global goal */
>>>> +    if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>> +        int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
>>>> +
>>>> +        ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_groups[hash]);
>>>> +        ac->ac_g_ex.fe_start = -1;
>>>> +        ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>>> Rotating block allocation looks a lot like stream allocation—they both
>>> pick up from where the last successful allocation left off.
>>>
>>> I noticed that the stream allocation's global goal is now split up.
>>> Is there an advantage to keeping it as a single goal?
>>> Alternatively, do you see any downsides to this split in your use case?
>>>
>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ