linux-ext4 - Re: [PATCH RESEND v2 00/25] Ext4 Encoding and Case-insensitive support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20181012192401.GA20322@thunk.org>
Date:   Fri, 12 Oct 2018 15:24:01 -0400
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     "Darrick J. Wong" <darrick.wong@...cle.com>
Cc:     Gabriel Krisman Bertazi <krisman@...labora.co.uk>,
        linux-ext4@...r.kernel.org
Subject: Re: [PATCH RESEND v2 00/25] Ext4 Encoding and Case-insensitive
 support

On Thu, Oct 11, 2018 at 03:23:59PM -0700, Darrick J. Wong wrote:
> 
> Hmmm, I'm curious, why pick NFKD specifically?  AFAICT Linux userspace
> environments (I only tried with GNOME and KDE) use NF[K]C....
>
> Is there a particular reason you picked NFKD?  Ohhh, right, because this
> series is a derivative of the ~2014 XFS case folding patchset.  Hmm, so
> looking at the ext4 changes, I guess what you do is add a custom ->d_hash
> function so that the dentries are hashed by hash(nfkd(fname))?  Which
> makes it easy to have link() look for names that will conflict after
> normalization?

This would be true for NFKC or NFC as well though, right?  So the
tradeoff of NF[K]C vs NF[K]D is that NFC is more efficient from an
encoding perspective.  For e with a grave accent, NFC would encode it
as C3 A9, while NFD would encode it as 65 CC 81.  So from an encoding
perspective there would be a benefit to use 'C' versus 'D'.  But MacOS
X by default canonicalizes to 'D', not 'C'.  I assume that's the
rationale for using NFKD versus NFKC?

As far as the 'K' versus "non-K" distinction, I imagine the main issue
is that a user could cut and paste something like "She\uFB03eld" which
it makes sense to canonicalize this to "Sheffield".  This is *not* a
canoncalization which MacOS X does (it uses NFD, not NFKD) but from a
compatibility perspective, it's not a problem since:

NFD:	Sheffield -> Sheffield
	She\uFB03eld -> She\uFB03eld

NFKD:	Sheffield -> Sheffield
	She\uFB03eld -> Sheffield

Given it's really painful to type the string She\uFB03eld into a
terminal, it seems to make sense that even if the user tries to create
a file with that string, that the actual file name that should get
created should be "Sheffield".

And hence, that's the argument for why the best on-disk encoding for
Linux file systems should be NFKD.

Does that seem right to everyone?

						- Ted