[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ea8aeb1-3760-4d00-baac-a81a4c4c3986@froggi.es>
Date: Mon, 28 Apr 2025 03:05:19 +0100
From: Autumn Ashton <misyl@...ggi.es>
To: Kent Overstreet <kent.overstreet@...ux.dev>,
Eric Biggers <ebiggers@...nel.org>
Cc: Matthew Wilcox <willy@...radead.org>, Theodore Ts'o <tytso@....edu>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-bcachefs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] bcachefs fixes for 6.15-rc4
On 4/28/25 2:43 AM, Kent Overstreet wrote:
> On Sun, Apr 27, 2025 at 06:30:59PM -0700, Eric Biggers wrote:
>> On Sun, Apr 27, 2025 at 08:55:30PM -0400, Kent Overstreet wrote:
>>> The thing is, that's exactly what we're doing. ext4 and bcachefs both
>>> refer to a specific revision of the folding rules: for ext4 it's
>>> specified in the superblock, for bcachefs it's hardcoded for the moment.
>>>
>>> I don't think this is the ideal approach, though.
>>>
>>> That means the folding rules are "whatever you got when you mkfs'd".
>>> Think about what that means if you've got a fleet of machines, of
>>> different ages, but all updated in sync: that's a really annoying way
>>> for gremlins of the "why does this machine act differently" variety to
>>> creep in.
>>>
>>> What I'd prefer is for the unicode folding rules to be transparently and
>>> automatically updated when the kernel is updated, so that behaviour
>>> stays in sync. That would behave more the way users would expect.
>>>
>>> But I only gave this real thought just over the past few days, and doing
>>> this safely and correctly would require some fairly significant changes
>>> to the way casefolding works.
>>>
>>> We'd have to ensure that lookups via the case sensitive name always
>>> works, even if the casefolding table the dirent was created with give
>>> different results that the currently active casefolding table.
>>>
>>> That would require storing two different "dirents" for each real dirent,
>>> one normalized and one un-normalized, because we'd have to do an
>>> un-normalized lookup if the normalized lookup fails (and vice versa).
>>> Which should be completely fine from a performance POV, assuming we have
>>> working negative dentries.
>>>
>>> But, if the unicode folding rules are stable enough (and one would hope
>>> they are), hopefully all this is a non-issue.
>>>
>>> I'd have to gather more input from users of casefolding on other
>>> filesystems before saying what our long term plans (if any) will be.
>>
>> Wouldn't lookups via the case-sensitive name keep working even if the
>> case-insensitivity rules change? It's lookups via a case-insensitive name that
>> could start producing different results. Applications can depend on
>> case-insensitive lookups being done in a certain way, so changing the
>> case-insensitivity rules can be risky.
>
> No, because right now on a case-insensitive filesystem we _only_ do the
> lookup with the normalized name.
>
>> Regardless, the long-term plan for the case-insensitivity rules should be to
>> deprecate the current set of rules, which does Unicode normalization which is
>> way overkill. It should be replaced with a simple version of case-insensitivity
>> that matches what FAT does. And *possibly* also a version that matches what
>> NTFS does (a u16 upcase_table[65536] indexed by UTF-16 coding units), if someone
>> really needs that.
>>
>> As far as I know, that was all that was really needed in the first place.
>>
>> People misunderstood the problem as being about language support, rather than
>> about compatibility with legacy filesystems. And as a result they incorrectly
>> decided they should do Unicode normalization, which is way too complex and has
>> all sorts of weird properties.
>
> Believe me, I do see the appeal of that.
>
> One of the things I should really float with e.g. Valve is the
> possibility of providing tooling/auditing to make it easy to fix
> userspace code that's doing lookups that only work with casefolding.
This is not really about fixing userspace code that expects casefolding,
or providing some form of stopgap there.
The main need there is Proton/Wine, which is a compat layer for Windows
apps, which needs to pretend it's on NTFS and everything there expects
casefolding to work.
No auditing/tooling required, we know the problem. It is unavoidable.
I agree with the calling about Unicode normalization being odd though,
when I was implementing casefolding for bcachefs, I immediately thought
it was a huge hammer to do full normalization for the intended purpose,
and not just a big table...
FWIR, there is actually two forms of casefolding in unicode, full
casefolding, C+F, (eg. ß->ss) and the simpler one, simple casefolding
(C+S), where lengths don't change and it's glyph for glyph.
- Autumn ✨
>
> And, another thing I'd like is a way to make casefolding per-process, so
> that it could be opt-in for the programs that need it - so that new code
> isn't accidentally depending on casefolding.
>
> That's something we really should have, anyways.
>
> But, as much as we might hate it, casefolding is something that users
> like and do expect in other contexts, so if casefolding is going to
> exist (as more than just a compatibility thing for legacy code) - it
> really ought to be unicode, and utf8 really has won at this point.
>
> Mainly though, it's not a decision I care to revisit, I intend to stick
> with casefolding that's compatible with how it's done on our other
> filesystems where it's widely used.
>
Powered by blists - more mailing lists