lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <16f17918-9186-4416-bbde-b93482933d8b@rocketmail.com>
Date: Sat, 7 Feb 2026 13:45:06 +0100
From: Mario Lohajner <mario_lohajner@...ketmail.com>
To: Theodore Tso <tytso@....edu>
Cc: Baokun Li <libaokun1@...wei.com>, adilger.kernel@...ger.ca,
 linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
 Yang Erkun <yangerkun@...wei.com>, libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy

On 2/7/26 06:31, Theodore Tso wrote:
> On Fri, Feb 06, 2026 at 08:25:24PM +0100, Mario Lohajner wrote:
>> What is observable in practice, however, is persistent allocation locality
>> near the beginning of the LBA space under real workloads, and a
>> corresponding concentration of wear in that area, interestingly it seems to
>> be vendor-agnostic. = The force within is very strong :-)
> 
> This is simply not true.  Data blocks are *not* located to the
> low-numbered LBA's in kind of reasonble real-world situation.  Why do
> you think this is true, and what was your experiment that led you
> believe this?
> 
> Let me show you *my* experiment:
> 
> root@...-xfstests:~# /sbin/mkfs.ext4 -qF /dev/vdc 5g
> root@...-xfstests:~# mount /dev/vdc /vdc
> [  171.091299] EXT4-fs (vdc): mounted filesystem 06dd464f-1c3a-4a2b-b3dd-e937c1e7624f r/w with ordered data mode. Quota mode: none.
> root@...-xfstests:~# tar -C /vdc -xJf /vtmp/ext4-6.12.tar.xz
> root@...-xfstests:~# ls -li /vdc
> total 1080
>   31018 -rw-r--r--   1 15806 15806    496 Dec 12  2024 COPYING
>     347 -rw-r--r--   1 15806 15806 105095 Dec 12  2024 CREDITS
>   31240 drwxr-xr-x  75 15806 15806   4096 Dec 12  2024 Documentation
>   31034 -rw-r--r--   1 15806 15806   2573 Dec 12  2024 Kbuild
>   31017 -rw-r--r--   1 15806 15806    555 Dec 12  2024 Kconfig
>   30990 drwxr-xr-x   6 15806 15806   4096 Dec 12  2024 LICENSES
>     323 -rw-r--r--   1 15806 15806 781906 Dec  1 21:34 MAINTAINERS
>   19735 -rw-r--r--   1 15806 15806  68977 Dec  1 21:34 Makefile
>      14 -rw-r--r--   1 15806 15806    726 Dec 12  2024 README
>    1392 drwxr-xr-x  23 15806 15806   4096 Dec 12  2024 arch
>     669 drwxr-xr-x   3 15806 15806   4096 Dec  1 21:34 block
> 131073 drwxr-xr-x   2 15806 15806   4096 Dec 12  2024 certs
>   31050 drwxr-xr-x   4 15806 15806   4096 Dec  1 21:34 crypto
> 143839 drwxr-xr-x 143 15806 15806   4096 Dec 12  2024 drivers
> 140662 drwxr-xr-x  81 15806 15806   4096 Dec  1 21:34 fs
> 134043 drwxr-xr-x  32 15806 15806   4096 Dec 12  2024 include
>   31035 drwxr-xr-x   2 15806 15806   4096 Dec  1 21:34 init
> 140577 drwxr-xr-x   2 15806 15806   4096 Dec  1 21:34 io_uring
> 140648 drwxr-xr-x   2 15806 15806   4096 Dec  1 21:34 ipc
>     771 drwxr-xr-x  22 15806 15806   4096 Dec  1 21:34 kernel
> 143244 drwxr-xr-x  20 15806 15806  12288 Dec  1 21:34 lib
>      11 drwx------   2 root  root   16384 Feb  6 16:34 lost+found
>   22149 drwxr-xr-x   6 15806 15806   4096 Dec  1 21:34 mm
>   19736 drwxr-xr-x  72 15806 15806   4096 Dec 12  2024 net
>   42649 drwxr-xr-x   7 15806 15806   4096 Dec  1 21:34 rust
>     349 drwxr-xr-x  42 15806 15806   4096 Dec 12  2024 samples
>   42062 drwxr-xr-x  19 15806 15806  12288 Dec  1 21:34 scripts
>      15 drwxr-xr-x  15 15806 15806   4096 Dec  1 21:34 security
> 131086 drwxr-xr-x  27 15806 15806   4096 Dec 12  2024 sound
>   22351 drwxr-xr-x  45 15806 15806   4096 Dec 12  2024 tools
>   31019 drwxr-xr-x   4 15806 15806   4096 Dec 12  2024 usr
>     324 drwxr-xr-x   4 15806 15806   4096 Dec 12  2024 virt
> 
> Note how different directories have different inode numbers, which are
> in different block groups.  This is how we naturally spread block
> allocations across different block groups.  This is *specifically* to
> spread block allocations across the entire storage device.  So for example:
> 
> root@...-xfstests:~# filefrag -v /vdc/arch/Kconfig
> Filesystem type is: ef53
> File size of /vdc/arch/Kconfig is 51709 (13 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..      12:      67551..     67563:     13:             last,eof
> /vdc/arch/Kconfig: 1 extent found
> 
> root@...-xfstests:~# filefrag -v /vdc/sound/Makefile
> Filesystem type is: ef53
> File size of /vdc/sound/Makefile is 562 (1 block of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..       0:     574197..    574197:      1:             last,eof
> /vdc/sound/Makefile: 1 extent found
> 
> See?  The are not spread across LBA's.  Quod Erat Demonstratum.
> 
> By the way, spreading block allocations across LBA's was not done
> because of a concern about flash storage.  The ext2, ext3, and ewxt4
> filesysetm has had this support going over a quarter of a century,
> because spreading the blocks across file system avoids file
> fragmentation.  It's a technique that we took from BSD's Fast File
> System, called the Orlov algorithm.  For more inforamtion, see [1], or
> in the ext4 sources[2].
> 
> [1] https://en.wikipedia.org/wiki/Orlov_block_allocator
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/ialloc.c#n398
> 
>> My concern is a potential policy interaction: filesystem locality
>> policies tend to concentrate hot metadata and early allocations. During
>> deallocation, we naturally discard/trim those blocks ASAP to make them
>> ready for write, thus optimizing for speed, while at the same time signaling
>> them as free. Meanwhile, an underlying WL policy (if present) tries to
>> consume free blocks opportunistically.
>> If these two interact poorly, the result can be a sustained bias toward
>> low-LBA hot regions (as observable in practice).
>> The elephant is in the room and is called “wear” / hotspots at the LBA
>> start.
> 
> First of all, most of the "sustained bias towards low-LBA regions" is
> not because where data blocks are located, but because of the location
> of static metadata blocks in particular, the superblock, block group
> descriptors, and the allocation bitmaps.  Having static metadata is
> not unique to ext2/ext3/ext4.  The FAT file system has the File
> Allocation Table in low numbered LBA's, which are constantly updated
> whenever blocks are allocated.  Even log structured file systems, such
> as btrfs, f2fs, and ZFS have a superblock at a static location which
> gets rewriten at every file system commit.
> 
> Secondly, *because* all file systems rewrite certain LBA's, and how
> flash erase blocks work, pretty much all flash translation layers for
> the past two decades are *designed* to be able to deal with it.
> Because of Digital Cameras and the FAT file systems, pretty much all
> flash storage do *not* have a static mapping between a particular LBA
> and a specific set of flash cells.  The fact that you keep asserting
> that "hotspots at the LBA start" is a problem indicates to me that you
> don't understand how SSD's work in real life.
> 
> So I commend to you these two articles:
> 
> https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/
> https://flashdba.com/2014/09/17/understanding-flash-the-flash-translation-layer/
> 
> These web pages date from 12 years ago, because SSD technology is in
> 2026, very old technology in an industry where two years == infinity.
> 
> For a more academic perspective, there's the paper from the
> conference: 2009 First International Conference on Advances in System
> Simulation, published by researchers from Pennsylvania State
> University:
> 
>      https://www.cse.psu.edu/~buu1/papers/ps/flashsim.pdf
> 
> The FlashSim is available as open source, and has since been used by
> many other researchers to explore improvements in Flash Translation
> Layer.  And even the most basic FTL algorithms mean that your proposed
> RotAlloc is ***pointless***.  If you think otherwise, you're going to
> need to provide convincing evidence.

Hi Ted,

Let me try to clarify this in a way that avoids talking past each other.

I fully agree with the allocator theory, the Orlov algorithm, and with
your demonstration.
I am not disputing *anything*, nor have I ever intended to.

The pattern I keep referring to as “observable in practice” is about
repeated free -> reallocate cycles, allocator restart points, and reuse
bias - i.e., which regions of the address space are revisited most
frequently over time.

> 
>> Again, we’re not focusing solely on wear leveling here, but since we
>> can’t influence the WL implementation itself, the only lever we have is
>> our own allocation policy.
> 
> You claim that you're not focusing on wear leveling, but every single
> justification for your changes reference "wear / hotspotting".  I'm
> trying to tell you that it's not an issue.  If you think it *could* be
> an issue, *ever*, you need to provide *proof* --- at the very least,
> proof that you understand things like how flash erase blocks work, how
> flash translation layers work, and how the orlov block allocation
> algorithm works.  Because with all due respect, it appears that you
> are profoundly ignorant, and it's not clear why we should be
> respecting your opinion and your arguments.  If you think we should,
> you really need to up your game.
> 
> Regards,
> 
> 					- Ted

Although I admitted being WL-inspired right from the start, I maintain 
that *this is not* wear leveling - WL deals with reallocations, 
translations, amplification history... This simply *is not* that.
Calling it "wear leveling" would be like an election promise - it might, 
but probably won’t, come true.

The question I’m raising is much narrower: whether allocator
policy choices can unintentionally reinforce reuse patterns under
certain workloads - and whether offering an *alternative policy* is
reasonable (I dare to say; in some cases more optimal).

I was consciously avoiding turning this into a “your stats vs. my stats”
&| “your methods vs. my methods” discussion.
However, to avoid arguing from theory alone, I will follow up with a
small set of real-world examples.

https://github.com/mlohajner/elephant-in-the-room

These are snapshots from different systems, illustrating the point I’m 
presenting here. Provided as-is, without annotations; while they do not 
show the allocation bitmap explicitly, they are statistically correlated 
with the most frequently used blocks/groups across the LBA space.

Given that another maintainer has already expressed support for making
this an *optional policy; disabled by default* I believe this discussion
is less about allocator theory correctness and more about whether
accommodating real-world workload diversity is desirable.

Regards,
Mario

P.S.
I'm so altruistic I dare say this out loud:
At this point, my other concern is this: if we reach common ground and 
make it optional, and it truly helps more than it hurts, who will 
actually ever use it? :-)
(Assuming end users even know it exists, to adopt it in a way that feels 
like a natural progression/improvement.)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ