linux-kernel - Re: [GIT PULL] bcachefs fixes for 6.15-rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wgtibpSH3+th-YjbQUSZVMbGNxG87oBDeqx+UkbHWejGw@mail.gmail.com>
Date: Sun, 27 Apr 2025 19:53:26 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: Eric Biggers <ebiggers@...nel.org>, Autumn Ashton <misyl@...ggi.es>, 
	Matthew Wilcox <willy@...radead.org>, "Theodore Ts'o" <tytso@....edu>, linux-bcachefs@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] bcachefs fixes for 6.15-rc4

On Sun, 27 Apr 2025 at 19:34, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
>
> Do you mean to say that we invented yet another incompatible unicode
> casefolding scheme?
>
> Dear god, why?

Oh, Unicode itself comes with multiple "you can do this" schemes.

It's designed by committee, and meant for different situations and
different uses.  Because the rules for things like sorting names are
wildly different even for the same language, just for different
contexts.

Think of Unicode as "several decades of many different people coming
together, all having very different use cases".

So you find four different normalization forms, all with different use-cases.

And guess what? The only actual *valid* scheme for a filesystem is
none of the four. Literally. It's to say "we don't normalize".

Because the normalization forms are not meant to be some kind of "you
should do this".

They are meant as a kind of "if you are going to do X, then you can
normalize into form Y, which makes doing X easier". And often the
normalized form should only ever be an intermediate _temporary_ form
for doing comparisons, not the actual form you save things in.

Sadly, people so often get it wrong.

For example, one very typical "you got it wrong, because you didn't
understand the problem" case is to do comparisons by normalizing both
sides (in one of the normalization forms) and then doing the
comparison in that form.

And guess what? 99.9% of the time, you just wasted enormous amounts of
time, because you could have done the comparison first *without* any
normalization at all, because equality is equality even when neither
side is normalized.

And the *common* case is that you are comparing things that are in the
same form. For example, in filesystem operations, 99.999% of the time
when you do a 'stat()' the *source* of the 'stat()' is typically a
'readdir()' operation. So you are going to be using the same exact
form that the filesystem ALREADY HAD, and it's going to be an exact
match, and there will NEVER EVER be any case folding issues in those
situations.

But the "simplistic" way to do it is to always normalize - which
involves allocating temporary storage for the new form, doing a fairly
expensive transformation including case folding, and then comparing
those things.

Christ.

The pure and incompetence in case-insensitivity *hurts*.

And what is so sad is that all of this is self-inflicted damage by
filesystem people who SHOULD NOT HAVE DONE THE COMPLEXITY IN THE FIRST
PLACE!

It's a classic case of

  "Doctor, doctor, it hurts when I hit myself in the balls with this hammer"

and then people wonder why I still claim the answer still remains -
and always will remain - "Don't do that then".

               Linus