linux-kernel - RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <TY1PR01MB15782010C68C0C568A6AE68690DA0@TY1PR01MB1578.jpnprd01.prod.outlook.com>
Date:   Tue, 14 Apr 2020 09:29:32 +0000
From:   "Kohada.Tetsuhiro@...MitsubishiElectric.co.jp" 
        <Kohada.Tetsuhiro@...MitsubishiElectric.co.jp>
To:     'Pali Rohár' <pali@...nel.org>
CC:     "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "'linux-fsdevel@...r.kernel.org'" <linux-fsdevel@...r.kernel.org>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
        "'namjae.jeon@...sung.com'" <namjae.jeon@...sung.com>,
        "'sj1557.seo@...sung.com'" <sj1557.seo@...sung.com>,
        "Mori.Takahiro@...MitsubishiElectric.co.jp" 
        <Mori.Takahiro@...MitsubishiElectric.co.jp>
Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points
 above U+FFFF

> We do not know how code points above U+FFFF could be converted to upper case. 

Code points above U+FFFF do not need to be converted to uppercase.

> Basically from exfat specification can be deduced it only for
> U+0000 .. U+FFFF code points. 

exFAT specifications (sec.7.2.5.1) saids ...
-- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).

UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
It just says "Unicode".

> Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> of single / half surrogate.

Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
The character type is basically 'wchar_t'(16bit).
The description "0000h to FFFFh" also assumes the use of 'wchar_t'.

This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF),
but these should not be converted to upper case.
Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
(* RtlUpcaseUnicodeChar() is one of Windows native API)

If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.

The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
To be more strict...
D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().

> Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8 
> encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction 
> for surrogate pairs values.

WTF-8 is new to me.
That's an interesting idea, but is it needed for exfat?

For characters over U+FFFF,
 -For UTF-32, a value of 0x10000 or more
 -For UTF-16, the value from 0xd800 to 0xdfff
I think these are just "don't convert to uppercase."

If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
it will simply be ignored by utf16s_to_utf8s().
string after utf8 conversion does not include illegal byte sequence.

> Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.

Ugh...

BR
---
Kohada Tetsuhiro <Kohada.Tetsuhiro@...MitsubishiElectric.co.jp>