[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150311184252.GA25276@bolet.org>
Date: Wed, 11 Mar 2015 19:42:52 +0100
From: Thomas Pornin <pornin@...et.org>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] NFC v NFD UTF-8 Normalization Re: [Was output specifics]
On Mon, Mar 09, 2015 at 03:41:51PM -0600, Jeffrey Goldberg wrote:
> But now to the NFC/NFD debate. The case for NFC is obvious. It's what
> is widely used already.
>
> The case for NFD is if one wanted to do something like what Facebook
> does with CAPS-LOCK. If your password fails, Facebook will
> automatically and silently retry it with case shifted.
>
> So if someone wanted to do something similar with ö and o, it is much
> easier to engage in such transformations with NFD.
This picture is, I think, enlightening:
http://unicode.org/reports/tr15/NormalizationOverview.GIF
If you want to remove accents and do similar transformations, then NFKD
may be preferable. For instance, in NFD, the "fi" ligature (should be
"fi" but it does not display well in my terminal font) is a single code
point U+FB01, while NFKD will make it a "f" followed by an "i" (U+0066
U+0069). Of course, it depends on whether you want to consider the
ligature "fi" as equal to or distinct from the two-letter sequence "fi".
You can even get that sort of question with very mundane latin-1
characters ("æ").
Also noteworthy is the fact that casing is a complex matter in all
generality. For instance, the switched-case equivalent of "sTRASSE" can
be either "Straße" and "Strasse" -- so it may depend on how much the
user is intent on using correct German typography (for a notion of
"correct" than can occasionally change, as it did in 1996 in Germany,
but not in Switzerland, for a number of words involving indeed the "ß").
Another problematic case is related to U+01F1, a letter which occurs in
Polish, Hungarian, Slovak, and a few other languages; it somewhat looks
like "DZ". That digraph has a lowercase variant (looks like "dz") AND a
third possible case called "titlecase", which looks like "Dz".
The bottom-line is that if you want to do smart things with casing
and accents and ligatures, and want to do it properly, then:
-- it will depend on what _exactly_ you want to do;
-- it will be hard.
Normalization is only a first step, and won't be sufficient (unless you
are ready to tolerate approximations, which may or may not make sense
for passwords, depending on the context). If we defined PHS to use NFD
(or NFKD), then:
-- People who just want to "hash a password" will have to include some
normalization library to pre-process whatever they got from the user
interface, which _usually_ produces NFC. Some language frameworks
already include such a library (e.g. Java, C#/.NET...), others do not
(typically C).
-- Peter's argument applies: some people will just take the bytes as
they got them, and I suspect this will be a rather common case.
"Maximizing interoperability" really means "doing the same as almost
everybody else"; if most other people use NFC (either explicitly, or,
more often, simply because they did not care enough and just used "the
bytes") then maximum interoperability will be achieved by using NFC.
So my conclusion is: if we want to specify (at "SHOULD" or "RECOMMENDED"
level) some encoding rules (and I believe we want to include such a
specification), then it should be NFC. Otherwise it won't work as
announced (i.e. it won't "maximize interoperability"), even if it might
be preferable for better support of some other features.
--Thomas Pornin
Powered by blists - more mailing lists