[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jqodgs2ui26a2rvphdlxbabcwertvollwquiwg7eekai4pmwxl@amgb2mfhh3lb>
Date: Sun, 27 Apr 2025 23:22:37 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Eric Biggers <ebiggers@...nel.org>, Autumn Ashton <misyl@...ggi.es>,
Matthew Wilcox <willy@...radead.org>, Theodore Ts'o <tytso@....edu>, linux-bcachefs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] bcachefs fixes for 6.15-rc4
On Sun, Apr 27, 2025 at 07:53:26PM -0700, Linus Torvalds wrote:
> On Sun, 27 Apr 2025 at 19:34, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
> >
> > Do you mean to say that we invented yet another incompatible unicode
> > casefolding scheme?
> >
> > Dear god, why?
>
> Oh, Unicode itself comes with multiple "you can do this" schemes.
>
> It's designed by committee, and meant for different situations and
> different uses. Because the rules for things like sorting names are
> wildly different even for the same language, just for different
> contexts.
>
> Think of Unicode as "several decades of many different people coming
> together, all having very different use cases".
>
> So you find four different normalization forms, all with different use-cases.
I'm still dying to know why we had to invent our own, though. The
proliferation of standards is just ridiculous.
> And guess what? The only actual *valid* scheme for a filesystem is
> none of the four. Literally. It's to say "we don't normalize".
>
> Because the normalization forms are not meant to be some kind of "you
> should do this".
>
> They are meant as a kind of "if you are going to do X, then you can
> normalize into form Y, which makes doing X easier". And often the
> normalized form should only ever be an intermediate _temporary_ form
> for doing comparisons, not the actual form you save things in.
>
> Sadly, people so often get it wrong.
>
> For example, one very typical "you got it wrong, because you didn't
> understand the problem" case is to do comparisons by normalizing both
> sides (in one of the normalization forms) and then doing the
> comparison in that form.
>
> And guess what? 99.9% of the time, you just wasted enormous amounts of
> time, because you could have done the comparison first *without* any
> normalization at all, because equality is equality even when neither
> side is normalized.
Yeah, that's another point in favor of "index both the normalized and
un-normalized form".
i.e.: the normalized index is a special thing that doesn't have to
exist, and we only check it if the lookup in the un-normalized index
fails.
Case-insensitive capable filesystems could act just like normal
filesystems, unless specific pids opted into the extra "normalized
lookups" path.
Powered by blists - more mailing lists