[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8736whpevj.fsf@collabora.co.uk>
Date: Tue, 17 Jul 2018 20:27:28 -0400
From: Gabriel Krisman Bertazi <krisman@...labora.co.uk>
To: "Theodore Y. Ts'o" <tytso@....edu>
Cc: linux-ext4@...r.kernel.org, darrick.wong@...cle.com,
kernel@...labora.com
Subject: Re: [PATCH 00/20] EXT4 encoding support
"Theodore Y. Ts'o" <tytso@....edu> writes:
> So maybe we need to talk about is having a feature called
> EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in
> the superblock. One is the encoding identifier (and 8 or 16 bits is
> probably *plenty*), and the other is the "encoding flags" field.
The current patchset makes encoding an INCOMPAT feature, but I'm
using 32 bits for the encoding identifier. I will change it to 16 bits
in the next iteration of the patch.
> Some of these flags might specify an encoding --- e.g., the file
> system supports normalization- and/or case- insensitive lookups in an
> efficient way by normalizing the string before calculating the
> dir_index hash. Some of these might specify the default behavior
> (e.g., case-insensitive or normalization-insensitive) file lookups if
> not overridden by a mount option.
I like the idea of encoding flags for selecting the default for
case/normalization -sensitiveness. But I'm not really sure about a flag
stating support for normalized hashes. It could be made redundant with
the feature/casefold flag itself, if we make tune2fs or similar rehash
the disk when enabling/disabling the encoding feature flag.
Feature flag is set -> Hash(normalization(x))
Feature flag and parent inode casefold flag are set -> Hash(casefold(x))
The casefold superblock flag would state whether the casefold inode
flags defaults to true or false.
> This assumes that normalization and case sensitivity are completely
> orthogonal.
I'm thinking of casefolding as a special case of the normalization
problem, just because its semantics are interesting for users. In fact,
it could be seen as just a different normalization function, from the
implementation point of view.
So, it is not completely orthogonal per-se, but it also deserves some
special stuff attention be more useful, like being per-directory, and to
carrying its on activation flags.
> The other thing is there seems to be some debate (and Apple isn't even
> consistent over time) over what kind of normalization is considered
> "best" or "correct". e.g., NFD, NFC, NFKD, NFKC. And if you want to
> export the file system over APFS, it might make a difference which one
> you use. (This is usually the point where some people will assert
> that teaching everyone in the world English really *would* be easier
> than supporting full I18N. :-) Is this something we can or should
> consider when deciding what we want to support in Linux long-term?
Since the implementation is normalization-preserving on-disk, isn't this
something that can be changed in the future if it is ever needed?
Provided we can rehash the dentries if we need to change the
normalization, a flag in the superblock, stating what normalization
method is used, should suffice if we ever want to support other
normalization methods. I have to say, It is not in my plans to support
anything other than NFKD. :)
> ... and what I'm really asking is do we really want to be specifying
> whether or not normalization is a Thing as a property of the encoding,
> or a property of the file system (or object, or document) that uses
> that particular encoding?
I see normalization as an inherent property of the encoding, since, for
the user equivalent strings should mean the same thing in the natural
language. But I see the point of filesystems wanting to ignore
normalization. I am pending towards the permissive route, where this
can be enabled/disabled when loading a NLS charset table. This way we
can merge utf8 and utf8n, and satisfy the normalization case, while
keeping compatibility with older users, What do you think?
--
Gabriel Krisman Bertazi
Powered by blists - more mailing lists