linux-kernel - Re: [PATCH] ext4: add optional rotating block allocation policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <26d60068-d149-4c53-a432-8b9db6b7e6a5@rocketmail.com>
Date: Fri, 6 Feb 2026 20:25:24 +0100
From: Mario Lohajner <mario_lohajner@...ketmail.com>
To: Theodore Tso <tytso@....edu>
Cc: Baokun Li <libaokun1@...wei.com>, adilger.kernel@...ger.ca,
 linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
 Yang Erkun <yangerkun@...wei.com>, libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy

On 06. 02. 2026. 02:42, Theodore Tso wrote:
> On Thu, Feb 05, 2026 at 01:23:18PM +0100, Mario Lohajner wrote:
>> Let me briefly restate the intent, focusing on the fundamentals.
>>
>> Rotalloc is not wear leveling (and is intentionally not named as such).
>> It is a allocation policy whose goal is to reduce allocation hotspots by
>> enforcing mount-wide sequential allocation. Wear leveling, if any,
>> remains a device/firmware concern and is explicitly out of scope.
>> While WL motivated part of this work,
> 
> Yes, but *why* are you trying to reduce allocation hotspots?  What
> problem are you trying to solve?  And actually, you are making
> allocation hotspots *worse* since with global cursor, by definition
> there is a single, super-hotspot.  This will cause scalability issues
> on a system with multiple CPU's trying to write in parallel.
Greetings Ted,

First off, apologies for the delayed reply — your emails somehow ended 
up in my spam! I hope this doesn’t happen again.
Also, sorry for the lengthy responses; I really care to make my points 
clear.

I’m not proposing that ext4 should implement or control wear leveling.
WL clearly does (or does not) exist below the FS layer and is opaque to 
us (we have no way of knowing).
What is observable in practice, however, is persistent allocation 
locality near the beginning of the LBA space under real workloads, and a 
corresponding concentration of wear in that area, interestingly it seems 
to be vendor-agnostic. = The force within is very strong :-)

The elephant:
My concern is a potential policy interaction: filesystem locality
policies tend to concentrate hot metadata and early allocations. During
deallocation, we naturally discard/trim those blocks ASAP to make them
ready for write, thus optimizing for speed, while at the same time 
signaling them as free. Meanwhile, an underlying WL policy (if present) 
tries to consume free blocks opportunistically.
If these two interact poorly, the result can be a sustained bias toward
low-LBA hot regions (as observable in practice).
The elephant is in the room and is called “wear” / hotspots at the LBA 
start.


> 
>> the main added value of this patch is allocator separation.
>> The policy indirection (aka vectored allocator) allows allocation
>> strategies that are orthogonal to the regular allocator to operate
>> outside the hot path, preserving existing heuristics and improving
>> maintainability.
> 
> Allocator separation is not necessarily that an unalloyed good thing.
> By having duplicated code, it means that if we need to make a change
> in infrastructure code, we might now need to make it in multiple code
> paths.  It is also one more code path that we have to test and
> maintain.  So there is a real cost from the perspctive of the upstream
> maintenance perspective.

My goal was to keep the regular allocator intact and trivially clean.
Baokun noticed this well — I’m using all existing heuristics; the only
tweak I do is to ‘fix the goal’ (i.e., set where to start), which then
sequentially advances toward the region most likely to contain empty,
unused space, at which point allocations become nearly instantaneous.

Being orthogonal in principle, these two allocators/policies are meant 
to live independently of each other.

Alternatively, we could drop the separation entirely and add a few
conditional branches to the regular allocator to the same effect,
but this introduces overhead, potential branch mispredictions, and all 
the associated shenanigans (minor but not insignifficant).
Separation avoids that, at the minimal cost of maintaining 20-ish extra
lines of code.
(memory we have; time is scarce)

> 
> Also, because having a single global allocation point (your "cursor")
> is going to absolutely *trash* performance, especially for high speed
> NVMe devices connected to high count CPU's, it's not clear to me why
> performance is necessary for rotalloc.
> 
>> The rotating allocator itself is a working prototype.
>> It was written with minimal diff and clarity in mind to make the policy
>> reviewable. Refinements and simplifications are expected and welcome.
> 
> OK, so this sounds like it's not ready for prime time....

I don’t consider it “not ready for prime time.” It is a rather simple 
refinement of the existing allocator, producing clean, contiguous 
layouts with sequential allocation across the LBA space without increase 
in complexity and with equal or lower latency.
Further refinements are anticipated and welcome — not because the 
current approach is flawed, but because this seems like an area where we 
can reasonably ask whether it can be even better.

> 
>> Regarding discard/trim: while discard prepares blocks for reuse and
>> signals that a block is free, it does not implement wear leveling by
>> itself. Rotalloc operates at a higher layer; by promoting sequentiality,
>> it reduces block/group allocation hotspots regardless of underlying
>> device behavior.
>> Since it is not in line with the current allocator goals, it is
>> implemented as an optional policy.
> 
> Again, what is the high level goal of rotalloc?  What specific
> hardware and workload are you trying to optimize for?  If you want to
> impose a maintaince overhead on upstream, you need to justify why the
> mainteance overhead is worth it.  And so that means you need to be a
> bit more explicit about what specific real-world solution you are
> trying to solve....
> 
> 						- Ted

Again, we’re not focusing solely on wear leveling here, but since we
can’t influence the WL implementation itself, the only lever we have is
our own allocation policy.
The question I’m trying to sanity-check is whether we can avoid
reinforcing this pattern, and instead aim for an allocation strategy
that helps minimize the issue—or even avoid it entirely if possible.

Even though this pattern is clear in practice I’m not claiming this
applies universally, only that it appears often enough to be worth
discussing at the policy level. For that reason, it seems reasonable to
treat this as an optional policy choice, disabled by default.

Sincerely,
Mario