[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <16f17918-9186-4416-bbde-b93482933d8b@rocketmail.com>
Date: Sat, 7 Feb 2026 13:45:06 +0100
From: Mario Lohajner <mario_lohajner@...ketmail.com>
To: Theodore Tso <tytso@....edu>
Cc: Baokun Li <libaokun1@...wei.com>, adilger.kernel@...ger.ca,
linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
Yang Erkun <yangerkun@...wei.com>, libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy
On 2/7/26 06:31, Theodore Tso wrote:
> On Fri, Feb 06, 2026 at 08:25:24PM +0100, Mario Lohajner wrote:
>> What is observable in practice, however, is persistent allocation locality
>> near the beginning of the LBA space under real workloads, and a
>> corresponding concentration of wear in that area, interestingly it seems to
>> be vendor-agnostic. = The force within is very strong :-)
>
> This is simply not true. Data blocks are *not* located to the
> low-numbered LBA's in kind of reasonble real-world situation. Why do
> you think this is true, and what was your experiment that led you
> believe this?
>
> Let me show you *my* experiment:
>
> root@...-xfstests:~# /sbin/mkfs.ext4 -qF /dev/vdc 5g
> root@...-xfstests:~# mount /dev/vdc /vdc
> [ 171.091299] EXT4-fs (vdc): mounted filesystem 06dd464f-1c3a-4a2b-b3dd-e937c1e7624f r/w with ordered data mode. Quota mode: none.
> root@...-xfstests:~# tar -C /vdc -xJf /vtmp/ext4-6.12.tar.xz
> root@...-xfstests:~# ls -li /vdc
> total 1080
> 31018 -rw-r--r-- 1 15806 15806 496 Dec 12 2024 COPYING
> 347 -rw-r--r-- 1 15806 15806 105095 Dec 12 2024 CREDITS
> 31240 drwxr-xr-x 75 15806 15806 4096 Dec 12 2024 Documentation
> 31034 -rw-r--r-- 1 15806 15806 2573 Dec 12 2024 Kbuild
> 31017 -rw-r--r-- 1 15806 15806 555 Dec 12 2024 Kconfig
> 30990 drwxr-xr-x 6 15806 15806 4096 Dec 12 2024 LICENSES
> 323 -rw-r--r-- 1 15806 15806 781906 Dec 1 21:34 MAINTAINERS
> 19735 -rw-r--r-- 1 15806 15806 68977 Dec 1 21:34 Makefile
> 14 -rw-r--r-- 1 15806 15806 726 Dec 12 2024 README
> 1392 drwxr-xr-x 23 15806 15806 4096 Dec 12 2024 arch
> 669 drwxr-xr-x 3 15806 15806 4096 Dec 1 21:34 block
> 131073 drwxr-xr-x 2 15806 15806 4096 Dec 12 2024 certs
> 31050 drwxr-xr-x 4 15806 15806 4096 Dec 1 21:34 crypto
> 143839 drwxr-xr-x 143 15806 15806 4096 Dec 12 2024 drivers
> 140662 drwxr-xr-x 81 15806 15806 4096 Dec 1 21:34 fs
> 134043 drwxr-xr-x 32 15806 15806 4096 Dec 12 2024 include
> 31035 drwxr-xr-x 2 15806 15806 4096 Dec 1 21:34 init
> 140577 drwxr-xr-x 2 15806 15806 4096 Dec 1 21:34 io_uring
> 140648 drwxr-xr-x 2 15806 15806 4096 Dec 1 21:34 ipc
> 771 drwxr-xr-x 22 15806 15806 4096 Dec 1 21:34 kernel
> 143244 drwxr-xr-x 20 15806 15806 12288 Dec 1 21:34 lib
> 11 drwx------ 2 root root 16384 Feb 6 16:34 lost+found
> 22149 drwxr-xr-x 6 15806 15806 4096 Dec 1 21:34 mm
> 19736 drwxr-xr-x 72 15806 15806 4096 Dec 12 2024 net
> 42649 drwxr-xr-x 7 15806 15806 4096 Dec 1 21:34 rust
> 349 drwxr-xr-x 42 15806 15806 4096 Dec 12 2024 samples
> 42062 drwxr-xr-x 19 15806 15806 12288 Dec 1 21:34 scripts
> 15 drwxr-xr-x 15 15806 15806 4096 Dec 1 21:34 security
> 131086 drwxr-xr-x 27 15806 15806 4096 Dec 12 2024 sound
> 22351 drwxr-xr-x 45 15806 15806 4096 Dec 12 2024 tools
> 31019 drwxr-xr-x 4 15806 15806 4096 Dec 12 2024 usr
> 324 drwxr-xr-x 4 15806 15806 4096 Dec 12 2024 virt
>
> Note how different directories have different inode numbers, which are
> in different block groups. This is how we naturally spread block
> allocations across different block groups. This is *specifically* to
> spread block allocations across the entire storage device. So for example:
>
> root@...-xfstests:~# filefrag -v /vdc/arch/Kconfig
> Filesystem type is: ef53
> File size of /vdc/arch/Kconfig is 51709 (13 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 12: 67551.. 67563: 13: last,eof
> /vdc/arch/Kconfig: 1 extent found
>
> root@...-xfstests:~# filefrag -v /vdc/sound/Makefile
> Filesystem type is: ef53
> File size of /vdc/sound/Makefile is 562 (1 block of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 0: 574197.. 574197: 1: last,eof
> /vdc/sound/Makefile: 1 extent found
>
> See? The are not spread across LBA's. Quod Erat Demonstratum.
>
> By the way, spreading block allocations across LBA's was not done
> because of a concern about flash storage. The ext2, ext3, and ewxt4
> filesysetm has had this support going over a quarter of a century,
> because spreading the blocks across file system avoids file
> fragmentation. It's a technique that we took from BSD's Fast File
> System, called the Orlov algorithm. For more inforamtion, see [1], or
> in the ext4 sources[2].
>
> [1] https://en.wikipedia.org/wiki/Orlov_block_allocator
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/ialloc.c#n398
>
>> My concern is a potential policy interaction: filesystem locality
>> policies tend to concentrate hot metadata and early allocations. During
>> deallocation, we naturally discard/trim those blocks ASAP to make them
>> ready for write, thus optimizing for speed, while at the same time signaling
>> them as free. Meanwhile, an underlying WL policy (if present) tries to
>> consume free blocks opportunistically.
>> If these two interact poorly, the result can be a sustained bias toward
>> low-LBA hot regions (as observable in practice).
>> The elephant is in the room and is called “wear” / hotspots at the LBA
>> start.
>
> First of all, most of the "sustained bias towards low-LBA regions" is
> not because where data blocks are located, but because of the location
> of static metadata blocks in particular, the superblock, block group
> descriptors, and the allocation bitmaps. Having static metadata is
> not unique to ext2/ext3/ext4. The FAT file system has the File
> Allocation Table in low numbered LBA's, which are constantly updated
> whenever blocks are allocated. Even log structured file systems, such
> as btrfs, f2fs, and ZFS have a superblock at a static location which
> gets rewriten at every file system commit.
>
> Secondly, *because* all file systems rewrite certain LBA's, and how
> flash erase blocks work, pretty much all flash translation layers for
> the past two decades are *designed* to be able to deal with it.
> Because of Digital Cameras and the FAT file systems, pretty much all
> flash storage do *not* have a static mapping between a particular LBA
> and a specific set of flash cells. The fact that you keep asserting
> that "hotspots at the LBA start" is a problem indicates to me that you
> don't understand how SSD's work in real life.
>
> So I commend to you these two articles:
>
> https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/
> https://flashdba.com/2014/09/17/understanding-flash-the-flash-translation-layer/
>
> These web pages date from 12 years ago, because SSD technology is in
> 2026, very old technology in an industry where two years == infinity.
>
> For a more academic perspective, there's the paper from the
> conference: 2009 First International Conference on Advances in System
> Simulation, published by researchers from Pennsylvania State
> University:
>
> https://www.cse.psu.edu/~buu1/papers/ps/flashsim.pdf
>
> The FlashSim is available as open source, and has since been used by
> many other researchers to explore improvements in Flash Translation
> Layer. And even the most basic FTL algorithms mean that your proposed
> RotAlloc is ***pointless***. If you think otherwise, you're going to
> need to provide convincing evidence.
Hi Ted,
Let me try to clarify this in a way that avoids talking past each other.
I fully agree with the allocator theory, the Orlov algorithm, and with
your demonstration.
I am not disputing *anything*, nor have I ever intended to.
The pattern I keep referring to as “observable in practice” is about
repeated free -> reallocate cycles, allocator restart points, and reuse
bias - i.e., which regions of the address space are revisited most
frequently over time.
>
>> Again, we’re not focusing solely on wear leveling here, but since we
>> can’t influence the WL implementation itself, the only lever we have is
>> our own allocation policy.
>
> You claim that you're not focusing on wear leveling, but every single
> justification for your changes reference "wear / hotspotting". I'm
> trying to tell you that it's not an issue. If you think it *could* be
> an issue, *ever*, you need to provide *proof* --- at the very least,
> proof that you understand things like how flash erase blocks work, how
> flash translation layers work, and how the orlov block allocation
> algorithm works. Because with all due respect, it appears that you
> are profoundly ignorant, and it's not clear why we should be
> respecting your opinion and your arguments. If you think we should,
> you really need to up your game.
>
> Regards,
>
> - Ted
Although I admitted being WL-inspired right from the start, I maintain
that *this is not* wear leveling - WL deals with reallocations,
translations, amplification history... This simply *is not* that.
Calling it "wear leveling" would be like an election promise - it might,
but probably won’t, come true.
The question I’m raising is much narrower: whether allocator
policy choices can unintentionally reinforce reuse patterns under
certain workloads - and whether offering an *alternative policy* is
reasonable (I dare to say; in some cases more optimal).
I was consciously avoiding turning this into a “your stats vs. my stats”
&| “your methods vs. my methods” discussion.
However, to avoid arguing from theory alone, I will follow up with a
small set of real-world examples.
https://github.com/mlohajner/elephant-in-the-room
These are snapshots from different systems, illustrating the point I’m
presenting here. Provided as-is, without annotations; while they do not
show the allocation bitmap explicitly, they are statistically correlated
with the most frequently used blocks/groups across the LBA space.
Given that another maintainer has already expressed support for making
this an *optional policy; disabled by default* I believe this discussion
is less about allocator theory correctness and more about whether
accommodating real-world workload diversity is desirable.
Regards,
Mario
P.S.
I'm so altruistic I dare say this out loud:
At this point, my other concern is this: if we reach common ground and
make it optional, and it truly helps more than it hurts, who will
actually ever use it? :-)
(Assuming end users even know it exists, to adopt it in a way that feels
like a natural progression/improvement.)
Powered by blists - more mailing lists