[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a4c545dc7f14e33b7e59321a0aab868@AcuMS.aculab.com>
Date: Mon, 20 Jan 2020 15:47:22 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Pali Rohár' <pali.rohar@...il.com>
CC: OGAWA Hirofumi <hirofumi@...l.parknet.co.jp>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"Theodore Y. Ts'o" <tytso@....edu>,
Namjae Jeon <linkinjeon@...il.com>,
"Gabriel Krisman Bertazi" <krisman@...labora.com>
Subject: RE: vfat: Broken case-insensitive support for UTF-8
From: Pali Rohár
> Sent: 20 January 2020 15:20
...
> This is not possible. There is 1:1 mapping between UTF-8 sequence and
> Unicode code point. wchar_t in kernel represent either one Unicode code
> point (limited up to U+FFFF in NLS framework functions) or 2bytes in
> UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
> functions).
Unfortunately there is neither a 1:1 mapping of all possible byte sequences
to wchar_t (or unicode code points), nor a 1:1 mapping of all possible
wchar_t values to UTF-8.
Really both need to be defined - even for otherwise 'invalid' sequences.
Even the 16-bit values above 0xd000 can appear on their own in
windows filesystems (according to wikipedia).
It is all to easy to get sequences of values that cannot be converted
to/from UTF-8.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists