linux-kernel - Re: [PATCH] ext4: add optional rotating block allocation policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260207053106.GA87551@macsyma.lan>
Date: Sat, 7 Feb 2026 00:31:06 -0500
From: "Theodore Tso" <tytso@....edu>
To: Mario Lohajner <mario_lohajner@...ketmail.com>
Cc: Baokun Li <libaokun1@...wei.com>, adilger.kernel@...ger.ca,
        linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
        Yang Erkun <yangerkun@...wei.com>, libaokun9@...il.com
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy

On Fri, Feb 06, 2026 at 08:25:24PM +0100, Mario Lohajner wrote:
> What is observable in practice, however, is persistent allocation locality
> near the beginning of the LBA space under real workloads, and a
> corresponding concentration of wear in that area, interestingly it seems to
> be vendor-agnostic. = The force within is very strong :-)

This is simply not true.  Data blocks are *not* located to the
low-numbered LBA's in kind of reasonble real-world situation.  Why do
you think this is true, and what was your experiment that led you
believe this?

Let me show you *my* experiment:

root@...-xfstests:~# /sbin/mkfs.ext4 -qF /dev/vdc 5g
root@...-xfstests:~# mount /dev/vdc /vdc
[  171.091299] EXT4-fs (vdc): mounted filesystem 06dd464f-1c3a-4a2b-b3dd-e937c1e7624f r/w with ordered data mode. Quota mode: none.
root@...-xfstests:~# tar -C /vdc -xJf /vtmp/ext4-6.12.tar.xz
root@...-xfstests:~# ls -li /vdc
total 1080
 31018 -rw-r--r--   1 15806 15806    496 Dec 12  2024 COPYING
   347 -rw-r--r--   1 15806 15806 105095 Dec 12  2024 CREDITS
 31240 drwxr-xr-x  75 15806 15806   4096 Dec 12  2024 Documentation
 31034 -rw-r--r--   1 15806 15806   2573 Dec 12  2024 Kbuild
 31017 -rw-r--r--   1 15806 15806    555 Dec 12  2024 Kconfig
 30990 drwxr-xr-x   6 15806 15806   4096 Dec 12  2024 LICENSES
   323 -rw-r--r--   1 15806 15806 781906 Dec  1 21:34 MAINTAINERS
 19735 -rw-r--r--   1 15806 15806  68977 Dec  1 21:34 Makefile
    14 -rw-r--r--   1 15806 15806    726 Dec 12  2024 README
  1392 drwxr-xr-x  23 15806 15806   4096 Dec 12  2024 arch
   669 drwxr-xr-x   3 15806 15806   4096 Dec  1 21:34 block
131073 drwxr-xr-x   2 15806 15806   4096 Dec 12  2024 certs
 31050 drwxr-xr-x   4 15806 15806   4096 Dec  1 21:34 crypto
143839 drwxr-xr-x 143 15806 15806   4096 Dec 12  2024 drivers
140662 drwxr-xr-x  81 15806 15806   4096 Dec  1 21:34 fs
134043 drwxr-xr-x  32 15806 15806   4096 Dec 12  2024 include
 31035 drwxr-xr-x   2 15806 15806   4096 Dec  1 21:34 init
140577 drwxr-xr-x   2 15806 15806   4096 Dec  1 21:34 io_uring
140648 drwxr-xr-x   2 15806 15806   4096 Dec  1 21:34 ipc
   771 drwxr-xr-x  22 15806 15806   4096 Dec  1 21:34 kernel
143244 drwxr-xr-x  20 15806 15806  12288 Dec  1 21:34 lib
    11 drwx------   2 root  root   16384 Feb  6 16:34 lost+found
 22149 drwxr-xr-x   6 15806 15806   4096 Dec  1 21:34 mm
 19736 drwxr-xr-x  72 15806 15806   4096 Dec 12  2024 net
 42649 drwxr-xr-x   7 15806 15806   4096 Dec  1 21:34 rust
   349 drwxr-xr-x  42 15806 15806   4096 Dec 12  2024 samples
 42062 drwxr-xr-x  19 15806 15806  12288 Dec  1 21:34 scripts
    15 drwxr-xr-x  15 15806 15806   4096 Dec  1 21:34 security
131086 drwxr-xr-x  27 15806 15806   4096 Dec 12  2024 sound
 22351 drwxr-xr-x  45 15806 15806   4096 Dec 12  2024 tools
 31019 drwxr-xr-x   4 15806 15806   4096 Dec 12  2024 usr
   324 drwxr-xr-x   4 15806 15806   4096 Dec 12  2024 virt

Note how different directories have different inode numbers, which are
in different block groups.  This is how we naturally spread block
allocations across different block groups.  This is *specifically* to
spread block allocations across the entire storage device.  So for example:

root@...-xfstests:~# filefrag -v /vdc/arch/Kconfig
Filesystem type is: ef53
File size of /vdc/arch/Kconfig is 51709 (13 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      12:      67551..     67563:     13:             last,eof
/vdc/arch/Kconfig: 1 extent found

root@...-xfstests:~# filefrag -v /vdc/sound/Makefile
Filesystem type is: ef53
File size of /vdc/sound/Makefile is 562 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:     574197..    574197:      1:             last,eof
/vdc/sound/Makefile: 1 extent found

See?  The are not spread across LBA's.  Quod Erat Demonstratum.

By the way, spreading block allocations across LBA's was not done
because of a concern about flash storage.  The ext2, ext3, and ewxt4
filesysetm has had this support going over a quarter of a century,
because spreading the blocks across file system avoids file
fragmentation.  It's a technique that we took from BSD's Fast File
System, called the Orlov algorithm.  For more inforamtion, see [1], or
in the ext4 sources[2].

[1] https://en.wikipedia.org/wiki/Orlov_block_allocator
[2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/ialloc.c#n398

> My concern is a potential policy interaction: filesystem locality
> policies tend to concentrate hot metadata and early allocations. During
> deallocation, we naturally discard/trim those blocks ASAP to make them
> ready for write, thus optimizing for speed, while at the same time signaling
> them as free. Meanwhile, an underlying WL policy (if present) tries to
> consume free blocks opportunistically.
> If these two interact poorly, the result can be a sustained bias toward
> low-LBA hot regions (as observable in practice).
> The elephant is in the room and is called “wear” / hotspots at the LBA
> start.

First of all, most of the "sustained bias towards low-LBA regions" is
not because where data blocks are located, but because of the location
of static metadata blocks in particular, the superblock, block group
descriptors, and the allocation bitmaps.  Having static metadata is
not unique to ext2/ext3/ext4.  The FAT file system has the File
Allocation Table in low numbered LBA's, which are constantly updated
whenever blocks are allocated.  Even log structured file systems, such
as btrfs, f2fs, and ZFS have a superblock at a static location which
gets rewriten at every file system commit.

Secondly, *because* all file systems rewrite certain LBA's, and how
flash erase blocks work, pretty much all flash translation layers for
the past two decades are *designed* to be able to deal with it.
Because of Digital Cameras and the FAT file systems, pretty much all
flash storage do *not* have a static mapping between a particular LBA
and a specific set of flash cells.  The fact that you keep asserting
that "hotspots at the LBA start" is a problem indicates to me that you
don't understand how SSD's work in real life.

So I commend to you these two articles:

https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/
https://flashdba.com/2014/09/17/understanding-flash-the-flash-translation-layer/

These web pages date from 12 years ago, because SSD technology is in
2026, very old technology in an industry where two years == infinity.

For a more academic perspective, there's the paper from the
conference: 2009 First International Conference on Advances in System
Simulation, published by researchers from Pennsylvania State
University:

    https://www.cse.psu.edu/~buu1/papers/ps/flashsim.pdf

The FlashSim is available as open source, and has since been used by
many other researchers to explore improvements in Flash Translation
Layer.  And even the most basic FTL algorithms mean that your proposed
RotAlloc is ***pointless***.  If you think otherwise, you're going to
need to provide convincing evidence.

> Again, we’re not focusing solely on wear leveling here, but since we
> can’t influence the WL implementation itself, the only lever we have is
> our own allocation policy.

You claim that you're not focusing on wear leveling, but every single
justification for your changes reference "wear / hotspotting".  I'm
trying to tell you that it's not an issue.  If you think it *could* be
an issue, *ever*, you need to provide *proof* --- at the very least,
proof that you understand things like how flash erase blocks work, how
flash translation layers work, and how the orlov block allocation
algorithm works.  Because with all due respect, it appears that you
are profoundly ignorant, and it's not clear why we should be
respecting your opinion and your arguments.  If you think we should,
you really need to up your game.

Regards,

					- Ted