[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <TY1PR01MB15782010C68C0C568A6AE68690DA0@TY1PR01MB1578.jpnprd01.prod.outlook.com>
Date: Tue, 14 Apr 2020 09:29:32 +0000
From: "Kohada.Tetsuhiro@...MitsubishiElectric.co.jp"
<Kohada.Tetsuhiro@...MitsubishiElectric.co.jp>
To: 'Pali Rohár' <pali@...nel.org>
CC: "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"'linux-fsdevel@...r.kernel.org'" <linux-fsdevel@...r.kernel.org>,
"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
"'namjae.jeon@...sung.com'" <namjae.jeon@...sung.com>,
"'sj1557.seo@...sung.com'" <sj1557.seo@...sung.com>,
"Mori.Takahiro@...MitsubishiElectric.co.jp"
<Mori.Takahiro@...MitsubishiElectric.co.jp>
Subject: RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points
above U+FFFF
> We do not know how code points above U+FFFF could be converted to upper case.
Code points above U+FFFF do not need to be converted to uppercase.
> Basically from exfat specification can be deduced it only for
> U+0000 .. U+FFFF code points.
exFAT specifications (sec.7.2.5.1) saids ...
-- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive).
UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
It just says "Unicode".
> Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between
> it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So
> surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is
> also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values
> of single / half surrogate.
Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
The character type is basically 'wchar_t'(16bit).
The description "0000h to FFFFh" also assumes the use of 'wchar_t'.
This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF),
but these should not be converted to upper case.
Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
(* RtlUpcaseUnicodeChar() is one of Windows native API)
If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.
The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
To be more strict...
D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().
> Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative
> values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8
> encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction
> for surrogate pairs values.
WTF-8 is new to me.
That's an interesting idea, but is it needed for exfat?
For characters over U+FFFF,
-For UTF-32, a value of 0x10000 or more
-For UTF-16, the value from 0xd800 to 0xdfff
I think these are just "don't convert to uppercase."
If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half),
it will simply be ignored by utf16s_to_utf8s().
string after utf8 conversion does not include illegal byte sequence.
> Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers.
Ugh...
BR
---
Kohada Tetsuhiro <Kohada.Tetsuhiro@...MitsubishiElectric.co.jp>
Powered by blists - more mailing lists