[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250428031307.GE6134@sol.localdomain>
Date: Sun, 27 Apr 2025 20:13:07 -0700
From: Eric Biggers <ebiggers@...nel.org>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Autumn Ashton <misyl@...ggi.es>,
Matthew Wilcox <willy@...radead.org>, Theodore Ts'o <tytso@....edu>,
linux-bcachefs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] bcachefs fixes for 6.15-rc4
On Sun, Apr 27, 2025 at 11:01:20PM -0400, Kent Overstreet wrote:
> On Sun, Apr 27, 2025 at 07:39:46PM -0700, Linus Torvalds wrote:
> > On Sun, 27 Apr 2025 at 19:22, Eric Biggers <ebiggers@...nel.org> wrote:
> > >
> > > I suspect that all that was really needed was case-insensitivity of ASCII a-z.
> >
> > Yes. That's my argument. I think anything else ends up being a
> > mistake. MAYBE extend it to the first 256 characters in Unicode (aka
> > "Latin1").
> >
> > Case folding on a-z is the only thing you could really effectively
> > rely on in user space even in the DOS times, because different
> > codepages would make for different rules for the upper 128 characters
> > anyway, and you could be in a situation where you literally couldn't
> > copy files from one floppy to another, because two files that had
> > distinct names on one floppy would have the *same* name on another
> > one.
> >
> > Of course, that was mostly a weird corner case that almost nobody ever
> > actually saw in practice, because very few people even used anything
> > else than the default codepage.
> >
> > And the same is afaik still true on NT, although practically speaking
> > I suspect it went from "unusual" to "really doesn't happen EVER in
> > practice".
>
> I'm having trouble finding anything authoritative, but what I'm seeing
> indicates that NTFS does do Unicode casefolding (and their own
> incompatible version, at that).
NTFS "just" has a 65536-entry table that maps UTF-16 coding units to their
"upper case" equivalents. So it only does 1-to-1 codepoint mappings, and only
for U+FFFF and below.
I suspect that it's the same, or at least nearly the same, as what
https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt calls "simple"
casefolding (as opposed to "full" casefolding), but only for U+FFFF and below.
Of course, to implement the same with Linux's UTF-8 names, we won't be able to
just do a simple table lookup like Windows does. But it could still be
implemented -- we'd just decode the Unicode codepoints from the string and apply
the same mapping from there. Still much simpler than normalization.
- Eric
Powered by blists - more mailing lists