linux-kernel - Re: [GIT PULL] bcachefs fixes for 6.15-rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <jqodgs2ui26a2rvphdlxbabcwertvollwquiwg7eekai4pmwxl@amgb2mfhh3lb>
Date: Sun, 27 Apr 2025 23:22:37 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Eric Biggers <ebiggers@...nel.org>, Autumn Ashton <misyl@...ggi.es>, 
	Matthew Wilcox <willy@...radead.org>, Theodore Ts'o <tytso@....edu>, linux-bcachefs@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] bcachefs fixes for 6.15-rc4

On Sun, Apr 27, 2025 at 07:53:26PM -0700, Linus Torvalds wrote:
> On Sun, 27 Apr 2025 at 19:34, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
> >
> > Do you mean to say that we invented yet another incompatible unicode
> > casefolding scheme?
> >
> > Dear god, why?
> 
> Oh, Unicode itself comes with multiple "you can do this" schemes.
> 
> It's designed by committee, and meant for different situations and
> different uses.  Because the rules for things like sorting names are
> wildly different even for the same language, just for different
> contexts.
> 
> Think of Unicode as "several decades of many different people coming
> together, all having very different use cases".
> 
> So you find four different normalization forms, all with different use-cases.

I'm still dying to know why we had to invent our own, though. The
proliferation of standards is just ridiculous.

> And guess what? The only actual *valid* scheme for a filesystem is
> none of the four. Literally. It's to say "we don't normalize".
> 
> Because the normalization forms are not meant to be some kind of "you
> should do this".
> 
> They are meant as a kind of "if you are going to do X, then you can
> normalize into form Y, which makes doing X easier". And often the
> normalized form should only ever be an intermediate _temporary_ form
> for doing comparisons, not the actual form you save things in.
> 
> Sadly, people so often get it wrong.
> 
> For example, one very typical "you got it wrong, because you didn't
> understand the problem" case is to do comparisons by normalizing both
> sides (in one of the normalization forms) and then doing the
> comparison in that form.
> 
> And guess what? 99.9% of the time, you just wasted enormous amounts of
> time, because you could have done the comparison first *without* any
> normalization at all, because equality is equality even when neither
> side is normalized.

Yeah, that's another point in favor of "index both the normalized and
un-normalized form".

i.e.: the normalized index is a special thing that doesn't have to
exist, and we only check it if the lookup in the un-normalized index
fails.

Case-insensitive capable filesystems could act just like normal
filesystems, unless specific pids opted into the extra "normalized
lookups" path.