[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875xnqudr1.fsf@mailhost.krisman.be>
Date: Wed, 11 Dec 2024 16:10:58 -0500
From: Gabriel Krisman Bertazi <krisman@...e.de>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Jaegeuk Kim <jaegeuk@...nel.org>, Linux Kernel Mailing List
<linux-kernel@...r.kernel.org>, "hanqi@...o.com" <hanqi@...o.com>,
"Theodore Ts'o" <tytso@....edu>
Subject: Re: Unicode conversion issue
Linus Torvalds <torvalds@...ux-foundation.org> writes:
> On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>>
>> The problem is that all the filesystems basically do some variation of
>>
>> if (IS_CASEFOLDED(dir) ..) {
>>
>> len = utf8_casefold(sb->s_encoding, orig_name,
>> new_name, MAXLEN);
>>
>> and then they use that "new_name" for both hashing and for comparisons.
>
> Oh, actually, f2fs does pass in the original name to
> generic_ci_match(), so I think this is solvable.
>
> The solution involves just telling f2fs to ignore the hash if it has
> seen odd characters.
>
> So I think f2fs could actually do something like this:
>
> --- a/fs/f2fs/dir.c
> +++ b/fs/f2fs/dir.c
> @@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
> /* fall back to treating name as opaque byte sequence */
> return 0;
> }
> + fname->ignore_hash = utf8_oddname(fname->usr_fname);
> fname->cf_name.name = buf;
> fname->cf_name.len = len;
> }
> @@ -231,7 +232,7 @@ struct f2fs_dir_entry
> *f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
> continue;
> }
>
> - if (de->hash_code == fname->hash) {
> + if (fname->ignore_hash || de->hash_code == fname->hash) {
> res = f2fs_match_name(d->inode, fname,
> d->filename[bit_pos],
> le16_to_cpu(de->name_len));
This solves it for directories with inlined dirents
(FI_INLINE_DENTRY). but for large directories, we use fname->hash to
find the right block to start the search. So, we'd need to walk through
the entire case-insensitive directory. In ext4, the issue only exists
on large directories, because we don't care about the hash on small
directories.
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -521,6 +521,7 @@ struct f2fs_filename {
>
> /* The dirhash of this filename */
> f2fs_hash_t hash;
> + bool ignore_hash;
>
> #ifdef CONFIG_FS_ENCRYPTION
> /*
>
> where that "utf8_oddname()" is the one that goes "this filename
> contains unhashable characters".
>
> I didn't look very closely at what ext4 does, but it seems to already
> have a pattern for "don't even look at the hash because it's not
> reliable", so I think ext4 can do something similar.
> So then all you actually need is that utf8_oddname() that recognizes
> those ignored code-points.
>
> So I take it all back: option (1) actually doesn't look that bad, and
> would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
> ignorable code points") unnecessary.
I think we really need to revert it. The simplest way to implement
utf8_oddname is having the full database with the Ignorable code points
available. We can then add a flag in the same data structure indicating
this is an Ignorable codepoint that should be dismissed by the
utf8_strncasecmp when doing the casefold, while still using the full
string for the hash.
--
Gabriel Krisman Bertazi
Powered by blists - more mailing lists