[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wiC3evUXq8QTcOBFTMu1wsUR_dYiS8eGxy0Hh7VbL55yA@mail.gmail.com>
Date: Wed, 11 Dec 2024 12:18:25 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Gabriel Krisman Bertazi <krisman@...e.de>
Cc: Jaegeuk Kim <jaegeuk@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, "hanqi@...o.com" <hanqi@...o.com>,
"Theodore Ts'o" <tytso@....edu>
Subject: Re: Unicode conversion issue
On Wed, 11 Dec 2024 at 11:58, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> The problem is that all the filesystems basically do some variation of
>
> if (IS_CASEFOLDED(dir) ..) {
>
> len = utf8_casefold(sb->s_encoding, orig_name,
> new_name, MAXLEN);
>
> and then they use that "new_name" for both hashing and for comparisons.
Oh, actually, f2fs does pass in the original name to
generic_ci_match(), so I think this is solvable.
The solution involves just telling f2fs to ignore the hash if it has
seen odd characters.
So I think f2fs could actually do something like this:
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -67,6 +67,7 @@ int f2fs_init_casefolded_name(const struct inode *dir,
/* fall back to treating name as opaque byte sequence */
return 0;
}
+ fname->ignore_hash = utf8_oddname(fname->usr_fname);
fname->cf_name.name = buf;
fname->cf_name.len = len;
}
@@ -231,7 +232,7 @@ struct f2fs_dir_entry
*f2fs_find_target_dentry(const struct f2fs_dentry_ptr *d,
continue;
}
- if (de->hash_code == fname->hash) {
+ if (fname->ignore_hash || de->hash_code == fname->hash) {
res = f2fs_match_name(d->inode, fname,
d->filename[bit_pos],
le16_to_cpu(de->name_len));
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -521,6 +521,7 @@ struct f2fs_filename {
/* The dirhash of this filename */
f2fs_hash_t hash;
+ bool ignore_hash;
#ifdef CONFIG_FS_ENCRYPTION
/*
where that "utf8_oddname()" is the one that goes "this filename
contains unhashable characters".
I didn't look very closely at what ext4 does, but it seems to already
have a pattern for "don't even look at the hash because it's not
reliable", so I think ext4 can do something similar.
So then all you actually need is that utf8_oddname() that recognizes
those ignored code-points.
So I take it all back: option (1) actually doesn't look that bad, and
would make reverting commit 5c26d2f1d3f5 ("unicode: Don't special case
ignorable code points") unnecessary.
Linus
Powered by blists - more mailing lists