[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgF7AjODAyO9n+8SfTiQd9-=zTLKh4SQP-xEpeMUPHvAw@mail.gmail.com>
Date: Wed, 11 Dec 2024 11:22:37 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Gabriel Krisman Bertazi <krisman@...e.de>
Cc: Jaegeuk Kim <jaegeuk@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, "hanqi@...o.com" <hanqi@...o.com>
Subject: Re: Unicode conversion issue
On Wed, 11 Dec 2024 at 08:08, Gabriel Krisman Bertazi <krisman@...e.de> wrote:
>
> It seems commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable
> code points") has affected more than ignorable code points, because that
> U+2764 is not marked as Ignorable in the unicode database.
It's not U+2764 - "Heavy Black Heart".
It's U+2764 _and_ U+FE0F - "Variation Selector-16 (VS16)"
And VS16 asks that the heart be shown as an emoji, which in turn turns
that black heart red.
And presumably that VS16 is one of those idiotic "ignorable" characters.
Christ, I don't understand why some people still think that
casefolding is sane. It damn well isn't, exactly because it causes
these kinds of insane situations, because the "case folding" of "mark
it as an emoji" is damn well undefined.
> I still think the solution to the original issue is eliminating
> ignorable code points, and that should be fine. Let me look at why this
> block of characters is mishandled.
I suspect we'll have to revert, and re-examine.
Of course, in the meantime, somebody has probably already created
files with the *new* hashing, so even reverting might not "fix" the
issue.
The real fix is to not do casefolding, or at least to never *EVER*
trust the hashing of case-folded crap, because the hash is
fundamentally not reliable.
What a case-folding filesystem should do is
(a) preserve case and hash with that preserved case (which is
equivalent to NOT DOING CASE FOLDING! The user gave you binary data,
you *treat* it as binary sacred data instead of corrupting it)
(b) only using case folding for "I didn't find the exact case, let's
do an approximate search".
but decades of history has shown that filesystem people seem to be
unable to understand the whole notion of "you don't screw with peoples
data".
That (a) guarantees that you get sane semantics for 1:1 names and that
you can *always* access the file using the preserved case.
And (b) is the "you get the insane case folded semantics for the
insane situation where it's needed, and never anywhere else".
Alternatively, case-folding should only fold the really damn obvious
cases. That was the problem with the horrendous "ignorable code
points", where case-folding reacted to non-case characters by simply
ignoring them.
Damn how I hate broken filesystems that "interpret" the data that they
are given. Pure unadulterated garbage.
If I wanted made-up random crap and hallucinations, I'd ask ChatGPT,
not my filesystem.
Linus
Powered by blists - more mailing lists